Harvesting Data From the Web With Rvest: Exercises

August 20, 2018
By

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The rvest package allows for simple and convenient extraction of data from the web into R, which is often called “web scraping.” Web scraping is a basic and important skill that every data analyst should master. You’ll often see it as a job requirement.

In the following exercises, you will practice your scraping skills on the “Money” section of the CNN website. All of the main functions of the rvest package will be used. Answers to these exercises are available here.

Since websites are constantly changing, some of the solutions might grow to be outdated with time. If this is the case, you are welcome to inform the author and the relevant sections will be updated.

Exercise 1
Read the HTML content of the following URL with a variable called webpage:
https://money.cnn.com/data/us_markets/
At this point, it will also be useful to open this web page in your browser.

Exercise 2
Get the session details (status, type, size) of the above mentioned URL.

Exercise 3
Extract all of the sector names from the “Stock Sectors” table (bottom left of the web page.)

Exercise 4
Extract all of the “3 Month % Change” values from the “Stock Sectors” table.

Exercise 5
Extract the table “What’s Moving” (top middle of the web page) into a data-frame.

Exercise 6
Re-construct all of the links from the first column of the “What’s Moving” table.
Hint: the base URL is “https://money.cnn.com”

Exercise 7
Extract the titles under the “Latest News” section (bottom middle of the web page.)

Exercise 8
To understand the structure of the data in a web page, it is often useful to know what the underlying attributes are of the text you see.
Extract the attributes (and their values) of the HTML element that holds the timestamp underneath the “What’s Moving” table.

Exercise 9
Extract the values of the blue percentage-bars from the “Trending Tickers” table (bottom right of the web page.)
Hint: in this case, the values are stored under the “class” attribute.

Exercise 10
Get the links of all of the “svg” images on the web page.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)