Check ‘Developer Tools’ First To Avoid Heavy-ish Dependencies

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Guillaume Pressiat (@GuillaumePressiat) did a solid post & video on using Selenium to scrape a paginated table from understat[.]com/league/EPL/2020 (I just cannot bring myself to provide an active link to any SportsBall site). He does a great job walking folks through acquiring & orchestrating the heavy dependency that is Selenium.

I did a quick “look at browser Developer Tools” tweet a few weeks back that included the entire code for retrieving the Forbes billionaires list via the JSON file the Forbes’ site loads via an XHR request responding to a similar fine article by another R user on using Selenium to do the same thing.

If you find yourself thwarted by rvest::read_html() not returning “nodes that are clearly there” it is likely due to the page rendering nodes dynamically via javascript. Selenium orchestrates full or headless browsers and lets you scrape the dynamically rendered DOM. You can see this yourself if you first view the source of an HTML page (via the browser’s “view source” menu) and then use Developer Tools to inspect the browser session. The “view source” view (in Blink-based browsers, at least) will be the raw, unrendered source HTML from the site and the DevTools “Elements” tab will have the rendered DOM elements.

The “Nework” tab of DevTools has an “XHR” tab of its own, but if you try to use it on this SportsBall site to see the JSON it loads, you’ll be bitterly disappointed because — while it does indeed render JSON into HTML DOM nodes dynamically — that JSON is embedded in the web page:

sportsball DOM node with JSON

We can work in two different ways without the use of Selenium.

First, we’ll “cheat” and use the {V8} package, which is an R interface to a javascript virtual machine, the type of which browsers use to run javascript on web pages. I say “cheat” because we’re still depending on a chunk of a browser engine.

Let’s get some boilerplate out of the way:

library(V8)        # V8 engine
library(rvest)     # Scraping
library(stringi)   # String manipulation which we'll use later
library(tidyverse) # Duh

ctx <- v8() # create a new instance of the javascript VM

pg <- read_html("https://understat.com/league/EPL/2020") # read sportsball page

If you examine the SportsBall page you’ll see that JSON.parse in a few different locations, let’s target them all:

html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]")
## {xml_nodeset (4)}
## [1] 
		

            

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)