This is the fourth installment in our series about web scraping with R. This includes practical examples for the leading R web scraping packages, including: RCurl package and jsonlite (for JSON). This article primarily talks about using the rvest package. We will be targeting data using CSS tags.
I read the email and my heart sank. As part of our latest project, my team was being asked to compile statistics for a large group of public companies. A rather diverse set of statistics, from about four different sources. And to make life better, the list was “subject to change”. Translated: be ready to update this mess at a moment’s notice….
The good news. Most of the request was publicly available, crammed into the nooks and corners of various financial sites.
This was a perfect use case for web scraping. An old school update (aka, the intern-o-matic model) would take about three or four hours. Even worse, it would be nearly impossible to quality check. A well written web scraper would be faster and easier to check afterwards.
After installing the
jsonlite libraries, I fired up Google and started looking for sources. The information we needed was available on several sites. After doing a little comparison and data validation, I settled on several preferred sources.
Important: Many websites have policies which restrict or prohibit web scraping; the same policies generally prohibit you from doing anything else useful with the data (such as compiling it). If you intend to use the scraped data for public (publication) or commercial use, you should consult a lawyer to understand your legal risks. This code should be used for educational purposes only. In practice, personal scraping is difficult to detect and rarely pursued (particularly if there is a low volume of requests).
Back to our example. To reduce the risk of getting a snarky legal letter, we’re going to share a couple of examples using the package to grab information from Wikipedia. The same techniques can be used to pull data from other sites.
Nice Table. We’ll Take It!
In many cases, the data you want is neatly laid out on the page in a series of tables. Here’s a sample of rvest code where you target a specific page and pick the table you want (in order of appearance). This script is going after every item on the page within an HTML tag of
- => you target this as “sources li”
This is just scratching the surface of what you can accomplish using CSS selector targeting. For a deeper view of the possibilities, take a look at some of the tutorials written by the JQuery community.
JSON: On a Silver Platter…
The good news is once you’ve figured out how the request is structured, the data is usually handed to you on a silver platter. The basic design of JSON is a dictionary structure. Data is labeled (usually very well), free of display cruft, and you can filter down to the parts you want. For a deeper look at how to work with JSON, check out our article on this topic.
While it is always nice to automate the boring stuff, there are a couple of other advantages to using web scraping to over manual collection. The use of scripted processes makes it easier to replicate errors and fix them. You’re no longer at the whim of a (usually bored) human data collector (aka. the inter-o-matic) grabbing the wrong fields or mis-coding a record. We have also found that large scale database errors are detected faster in this approach. For example, in the corporate data collection project we mentioned earlier we noticed that the websites we were scraping generally didn’t seem to collect accurate data on certain types of companies. While this would have eventually surfaced via a manual collection effort, the process-focused element of scraping forced this issue to the surface quickly. And finally, since the scraping script shrunk our refresh cycle from several hours to under a minute, we can refresh our results much more frequently.
This was the latest in our series on web scraping. Check out one of the earlier articles to learn more about scraping:
- Scraping HTML using readLines() and RCurl
- Using jsonlite to scrap data from AJAX websites
- Scraper Ergo Sum – Suggested projects for going deeper on web scraping
You may also be interested in the following
- Accessing data for R using SPARQL (Semantic Web Queries)
- Quantmod package for getting stock price data and economic indicators
- Using R Animations to spice up your presentations
The post Webscraping with rvest: So Easy Even An MBA Can Do It! appeared first on ProgrammingR.