Web-Scraping JavaScript rendered Sites

March 26, 2016

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

Gathering data from the web is one of the key tasks in order to generate easy data-driven insights into various topics.
Thanks to the fantastic Rvest R package web scraping is pretty straight forward.
It basically works like this; go to a website, find the right items using the selector gadget and plug the element path into your R-code.
There are various, great tutorials on how to do that (e.g. 1, 2 ).

Increasingly, I see that websites, which (un-)consciously make it hard to scrape their site by employing delayed JavaScript-based rendering. In these cases the simple approach of using rvest breaks down. Examples include the Bwin betting site or the German site busliniensuche. Busliniensuche is a site to compare bus travel providers on price, duration and schedule.

As side-note, the German bus travel market has been deregulated in 2013, hence the market is still rapidly developing. I thought it would be interesting to analyse the basic market elements and compare bus travel to the established train provider Deutsche Bahn.

As mentioned the site employs some kind of delayed JavaScript rendering. It basically loads the page content with a delay. This would be great if the web site would call a structured JSON endpoint. This is not the case, rather I believe they make it intentionally hard to gather their data in a structured form.

##How-to scrape JS-rendered websites?

One way to gather the data nonetheless is using a “headless” browser such as PhantomJS.
“A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers” (Source: Wikipedia). In order to control PhantomJS from R we need two scripts; a) the PhantomJS file and b) a R file that manipulates and runs PhantomJS. Both files are included at the end of this post.

The PhantomJS file has one parameter; the URL that is supposed to be scraped, here placed right at the beginning of the file.
The headless browser loads the URL, waits 2500 milliseconds and saves the file to disk (file-name: 1.html).
The R file changes the URL to the target site, runs the headless browser using a system call and works with the locally saved file in an rvest-like way.

So to get started on your own: Download PhantomJS, place the .exe file in your working R-directory and adapt the source code accordingly.

Happy Scraping!

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)