[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For dynamic sites, the RSelenium and/or seleniumPipes packages are super-handy tools to have in the toolbox. They interface with Selenium which is a feature-rich environment/ecosystem for automating browser tasks. You can programmatically click buttons, press keys, follow links and extract page content because you’re scripting actions in an actual browser or a browser-like tool such as phantomjs. Getting the server component of Selenium running was often a source of pain for R folks, but the new docker images make it much easier to get started. For truly gnarly scraping tasks, it should be your go-to solution.
However, sometimes all you need is the rendering part and for that, there’s a new light[er]weight alternative dubbed Splash. It’s written in python and uses QT webkit for rendering. To avoid deluging your system with all of the Splash dependencies you can use the docker images. In fact, I made it dead easy to do so. Read on!
Going for a dip
The intrepid Winston Chang at RStudio started a package to wrap Docker operations and I’ve recently joind in the fun to add some tweaks & enhancements to it that are necessary to get it on CRAN. Why point this out? Since you need to have Splash running to work with it in splashr I wanted to make it as easy as possible. So, if you install Docker and then devtools::install_github("wch/harbor") you can then devtools::install_github("hrbrmstr/splashr") to get Splash up and running with:
The install_splash() function will pull the correct image to your local system and you’ll need that splash_svr object later on to stop the container. Now, you can have Splash running on any host, but this post assumes you’re running it locally.
We can test to see if the server is active:
splash("localhost") %>% splash_active()
## Status of splash instance on [http://localhost:8050]: ok. Max RSS: 70443008
Now, we’re ready to scrape!
We’ll use this site — http://www.techstars.com/companies/ — mentioned over at DataCamp’s tutorial since it doesn’t use XHR but does require rendering and it doesn’t prohibit scraping in the Terms of Service (don’t violate Terms of Service, it is both unethical and could get you blocked, fined or worse).
Let’s scrape the “Summary by Class” table. Here’s an excerpt along with the Developer Tools view:
You’re saying “HEY. That has
in the HTML so why not just use rvest? Well, you can validate the lack of
The snapshot functions return magick objects, so you can do anything you’d like with them.
Since Splash is rendering the entire site (it’s a real browser), it knows all the information about the various components of a page and can return that in HAR format. You can retrieve this data and use John Harrison’s spiffy HARtools package to visualize and further analyze the data. For the sake of brevity, here’s just the main print() output from a site:
You can also do some basic scripting in Splash with lua and coding up an interface with that capability is on the TODO as is adding final tests and enabling tweaking the Docker configurations to support more fun things that Splash can do.
File an issue on github if you have feature requests or problems and feel free to jump on board with a PR if you’d like to help put the finishing touches on the package or add some features.
Don’t forget to stop_splash(splash_svr) when you’re finished scraping!
To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.