Web Scraping Javascript Content
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Web scraping with rvest and SelectorGadget can be powerful and fun. Recently I have experimented
with trying to scrape a table from the Chronicle of Higher Education
that showed compensation for
university CEO’s. With a certain amount of trial and error I used SelectorGadget
to find the fields I wanted to scrape: name, university, and compensation.
But when I went through the steps in rvest
to scrape those fields, I got nothing (in the form of a zero length vector of results).
I know next to nothing about how javascript is used to populate a web page so my understanding
of what is happening is almost non-existant. I see the results on the screen, but when I use
the browser to look at the page source, I don’t see the individual compensation data.
Here’s a Stack Overflow response that explains a little bit of what is going on:
The “View Source” only shows the source the server sent at the time the browser requested the particular webpage from the server. Therefore, since these changes were made on the client side, they don’t show up on the “View Source” because they’ve been made after the original page has been delivered.
In my case, javascript processed by my browser is adding html after the original page was sent. (Or at least I think that is what is going on.)
The solution is a “headless browser”. This is a program that functions like a web browser,
but it’s purpose is to produce HTML rather than to put something on the screen (hence, headless, if
one thinks of the screen as the head).
This article by Brooke Watson
goes over the steps you need to follow to scrape web data created via javascript (and also
discusses other considerations related to web scraping). Another good source is this post by Florian Teschner. I also found a comprehensive list of headless browsers, although I haven’t looked beyond PhantomJS.
An Example
In this post I will work through scraping the Chronicle page as an example.
First we have to get the HTML containing the CEO compensation table.
To do that I will use PhantomJS as a headless browser.
I installed it on my Windows computer at work via a PhantomJS install page. I also installed it on my Mac at home
using HomeBrew. I did a little bit of googling first to assure
myself that PhantomJS was was reputable and safe to install. The source code is available on GitHub. Note that the
author has suspended development of PhantomJS as of March, 2018, for reasons that
are discussed at the GitHub README.
To use PhantomJS, I’ll use the R system()
function to run PhantomJS so that it processes a javascript file called
public_fetch.js
. This is based on example code cited above with
the URL reference changed to point to the Chronicle of Higher Ed page
and an output file for the resulting HTML called public_ceo_export.html
.
Here’s the javascript that PhantomJS will process:
var webPage = require('webpage'); var page = webPage.create(); var fs = require('fs'); var path = 'public_ceo_export.html'; page.open('https://www.chronicle.com/interactives/executive-compensation?cid=wsinglestory#id=table_public_2017', function (status) { var content = page.content; fs.write(path,content,'w'); phantom.exit(); });
We can call PhantomJS from within R:
# a macos version of running phantomjs system("phantomjs public_fetch.js")
After this runs, public_ceo_export.html
contains the HTML that contains the
data table. Here’s what the beginning of the data table looks like in raw HTML:
![]() James R. Ramsey University of Louisville, 2017Total Compensation: $4,290,232Base Pay: $55,703Bonus Pay: $0Other Pay: $4,233,739 |
Yuck. The data is in there, but HTML is not easy for mere humans to decipher. What class names are we looking for? I used SelectorGadget to tease out Next we’ll look at some basic rvest code. library(rvest) library(tidyverse) library(stringr) # read the page that was written by the headless browser public_page <- read_html("public_ceo_export.html") # look for HTML bit with the ".name" and produce them as text name <- public_page %>% html_nodes(".name") %>% html_text() # look for HTML bit with the ".college" and produce list of universities college <- public_page %>% html_nodes(".college") %>% html_text() print(college[1:3]) # show we have something reasonable base <- public_page %>% html_nodes(".ech_base") %>% html_text() %>% str_replace_all("[[:alpha:]]|:| |\\$|,", "") %>% as.numeric() bonus <- public_page %>% html_nodes(".ech_bonus") %>% html_text() %>% str_replace_all("[[:alpha:]]|:| |\\$|,", "") %>% as.numeric() other <- public_page %>% html_nodes(".ech_other") %>% html_text() %>% str_replace_all("[[:alpha:]]|:| |\\$|,", "") %>% as.numeric() # .ech_detail retrieves all four elements so length is 200 rther than 50 # (I figured out .ech_detail by trial and error.) # we need to pull out every fourth element to get the total total <- public_page %>% html_nodes(".ech_detail") %>% html_text() total <- total[seq(1, 200, 4)] # Identify cases where adding up the bits is different than the "total" if (!all(str_detect(total, "Total Compensation"))) print("something is wrong with total") total <- total %>% str_replace_all("[[:alpha:]]|:| |\\$|,", "") %>% as.numeric() The general pattern is that we get the HTML (loaded here into ( The Punch Line?At this point, you are probably looking for the punchline: some insight about university When Is Scraping OK?Just because you can do something doesn’t mean you may. You should
Okey dokey, no web scraping the FlightAware web site. The “User Agreement” for the Chronicle web
Under the section on permitted uses,
So can I use web scraping to download data from this Chronicle article for my own use? I don’t really know. As a practical robots.txtAnother way to check on whether web scraping is welcomed or frowned upon is to check
There’s nothing after There’s a semi-official site that describes robots.txt. That wraps it up. Web sraping can be fun and useful. To leave a comment for the author, please follow the link and comment on their blog: R on Can I Blog Too.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Copyright © 2020 | MH Corporate basic by MH Themes Never miss an update!
|