Harvesting Data From the Web With Rvest: Exercises

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The rvest package allows for simple and convenient extraction of data from the web into R, which is often called “web scraping.” Web scraping is a basic and important skill that every data analyst should master. You’ll often see it as a job requirement.

In the following exercises, you will practice your scraping skills on the “Money” section of the CNN website. All of the main functions of the rvest package will be used. Answers to these exercises are available here.

Since websites are constantly changing, some of the solutions might grow to be outdated with time. If this is the case, you are welcome to inform the author and the relevant sections will be updated.

Exercise 1
Read the HTML content of the following URL with a variable called webpage:
https://money.cnn.com/data/us_markets/
At this point, it will also be useful to open this web page in your browser.

Exercise 2
Get the session details (status, type, size) of the above mentioned URL.

Exercise 3
Extract all of the sector names from the “Stock Sectors” table (bottom left of the web page.)

Exercise 4
Extract all of the “3 Month % Change” values from the “Stock Sectors” table.

Exercise 5
Extract the table “What’s Moving” (top middle of the web page) into a data-frame.

Exercise 6
Re-construct all of the links from the first column of the “What’s Moving” table.
Hint: the base URL is “https://money.cnn.com”

Exercise 7
Extract the titles under the “Latest News” section (bottom middle of the web page.)

Exercise 8
To understand the structure of the data in a web page, it is often useful to know what the underlying attributes are of the text you see.
Extract the attributes (and their values) of the HTML element that holds the timestamp underneath the “What’s Moving” table.

Exercise 9
Extract the values of the blue percentage-bars from the “Trending Tickers” table (bottom right of the web page.)
Hint: in this case, the values are stored under the “class” attribute.

Exercise 10
Get the links of all of the “svg” images on the web page.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)