Check ‘Developer Tools’ First To Avoid Heavy-ish Dependencies

Posted on April 12, 2021 by hrbrmstr in R bloggers | 0 Comments

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Guillaume Pressiat (@GuillaumePressiat) did a solid post & video on using Selenium to scrape a paginated table from understat[.]com/league/EPL/2020 (I just cannot bring myself to provide an active link to any SportsBall site). He does a great job walking folks through acquiring & orchestrating the heavy dependency that is Selenium.

I did a quick “look at browser Developer Tools” tweet a few weeks back that included the entire code for retrieving the Forbes billionaires list via the JSON file the Forbes’ site loads via an XHR request responding to a similar fine article by another R user on using Selenium to do the same thing.

If you find yourself thwarted by rvest::read_html() not returning “nodes that are clearly there” it is likely due to the page rendering nodes dynamically via javascript. Selenium orchestrates full or headless browsers and lets you scrape the dynamically rendered DOM. You can see this yourself if you first view the source of an HTML page (via the browser’s “view source” menu) and then use Developer Tools to inspect the browser session. The “view source” view (in Blink-based browsers, at least) will be the raw, unrendered source HTML from the site and the DevTools “Elements” tab will have the rendered DOM elements.

The “Nework” tab of DevTools has an “XHR” tab of its own, but if you try to use it on this SportsBall site to see the JSON it loads, you’ll be bitterly disappointed because — while it does indeed render JSON into HTML DOM nodes dynamically — that JSON is embedded in the web page:

We can work in two different ways without the use of Selenium.

First, we’ll “cheat” and use the {V8} package, which is an R interface to a javascript virtual machine, the type of which browsers use to run javascript on web pages. I say “cheat” because we’re still depending on a chunk of a browser engine.

Let’s get some boilerplate out of the way:

library(V8)        # V8 engine
library(rvest)     # Scraping
library(stringi)   # String manipulation which we'll use later
library(tidyverse) # Duh

ctx <- v8() # create a new instance of the javascript VM

pg <- read_html("https://understat.com/league/EPL/2020") # read sportsball page

If you examine the SportsBall page you’ll see that JSON.parse in a few different locations, let’s target them all:

html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]")
## {xml_nodeset (4)}
## [1] <script>\n\tvar datesData \t= JSON.parse('\\x5B\\x7B\\x22id\\x22\\x3A\\x2214086 ...
## [2] <script>\n\tvar teamsData = JSON.parse('\\x7B\\x2271\\x22\\x3A\\x7B\\x22id\\x22 ...
## [3] <script>\n\tvar playersData\t= JSON.parse('\\x5B\\x7B\\x22id\\x22\\x3A\\x22647\ ...
## [4] <script>\n\t\tWebFont.load({\n\t\t\tgoogle: {\n\t\t\t\tfamilies: ['Barlow:500', ...

We don’t need that last one, so the first three contain all the data we need.

Turning that into data is pretty straightforward work:

html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]") %>% 
  .[1:3] %>%         # only want the first three nodes
  html_text() %>%    # turn the nodes into text
  walk(ctx$eval)     # tell V8 to evaluate the javascript

The VM we created now has those three variables:

ctx$get(JS("Object.keys(global)"))
## [1] "print"       "console"     "global"      "datesData"   "_week"      
## [6] "_year"       "teamsData"   "playersData"

and, we can retrieve them like this:

as_tibble(ctx$get("datesData"))
##  A tibble: 380 x 8
##    id    isResult h$id  $title $short_title a$id  $title $short_title goals$h $a   
##    <chr> <lgl>    <chr> <chr>  <chr>        <chr> <chr>  <chr>        <chr>   <chr>
##  1 14086 TRUE     228   Fulham FLH          83    Arsen… ARS          0       3    
##  2 14087 TRUE     78    Cryst… CRY          74    South… SOU          1       0    
##  3 14090 TRUE     87    Liver… LIV          245   Leeds  LED          4       3    
##  4 14091 TRUE     81    West … WHU          86    Newca… NEW          0       2    
##  5 14092 TRUE     76    West … WBA          75    Leice… LEI          0       3    
##  6 14093 TRUE     82    Totte… TOT          72    Evert… EVE          0       1    
##  7 14094 TRUE     238   Sheff… SHE          229   Wolve… WOL          0       2    
##  8 14095 TRUE     220   Brigh… BRI          80    Chels… CHE          1       3    
##  9 14096 TRUE     72    Evert… EVE          76    West … WBA          5       2    
## 10 14097 TRUE     245   Leeds  LED          228   Fulham FLH          4       3    
## # … with 370 more rows, and 6 more variables: xG$h <chr>, $a <chr>, datetime <chr>,
## #   forecast$w <chr>, $d <chr>, $l <chr>

Note that we need to do some extra processing of the second one to make it a bit tidier:

ctx$get("teamsData") %>% 
  map_df(~{
    .x$history$id <- .x$id
    .x$history$title <- .x$title
    .x$history
  }) %>% 
  as_tibble()
## # A tibble: 616 x 21
##    h_a      xG   xGA  npxG  npxGA ppda$att  $def ppda_allowed$att  $def  deep
##    <chr> <dbl> <dbl> <dbl>  <dbl>    <int> <int>            <int> <int> <int>
##  1 h     0.805 0.850 0.805 0.0885       89    20              247    14    17
##  2 a     2.03  0.535 2.03  0.535       307    33              143    24    10
##  3 h     3.08  1.66  3.08  1.66        365    25              119    25     7
##  4 a     0.874 0.672 0.874 0.672       212    23              210    24     7
##  5 h     1.50  2.38  1.50  2.38        225    17              124    34     7
##  6 h     2.45  1.00  1.69  1.00        161    23              164    22     5
##  7 a     1.99  1.39  1.99  1.39        331    24              169    15    16
##  8 h     1.77  1.50  1.77  1.50        257    14              208    17     6
##  9 a     2.39  0.572 1.63  0.572       144    11              289    23     8
## 10 a     1.27  1.14  0.508 1.14        162    28              166    20     5
## # … with 606 more rows, and 13 more variables: deep_allowed <int>, scored <int>,
## #   missed <int>, xpts <dbl>, result <chr>, date <chr>, wins <int>, draws <int>,
## #   loses <int>, pts <int>, npxGD <dbl>, id <chr>, title <chr>

The last one does not need any extra help:

as_tibble(ctx$get("playersData"))
## # A tibble: 505 x 18
##    id    player_name games time  goals xG    assists xA    shots key_passes
##    <chr> <chr>       <chr> <chr> <chr> <chr> <chr>   <chr> <chr> <chr>     
##  1 647   Harry Kane  29    2557  19    17.6… 13      6.73… 113   39        
##  2 1250  Mohamed Sa… 30    2529  19    16.1… 3       4.55… 99    40        
##  3 1228  Bruno Fern… 31    2659  16    13.4… 11      10.8… 95    87        
##  4 453   Son Heung-… 30    2509  14    9.35… 9       8.03… 55    56        
##  5 822   Patrick Ba… 31    2572  14    14.8… 7       3.44… 93    24        
##  6 5555  Dominic Ca… 26    2248  14    15.7… 0       0.95… 65    13        
##  7 3277  Alexandre … 27    1818  13    11.8… 2       1.77… 43    21        
##  8 314   Ilkay Günd… 24    1776  12    8.64… 1       3.29… 45    35        
##  9 755   Jamie Vardy 27    2230  12    16.0… 7       4.26… 64    22        
## 10 8865  Ollie Watk… 30    2700  12    13.7… 3       4.15… 81    36        
## # … with 495 more rows, and 8 more variables: yellow_cards <chr>, red_cards <chr>,
## #   position <chr>, team_title <chr>, npg <chr>, npxG <chr>, xGChain <chr>,
## #   xGBuildup <chr>

We don’t really need {V8} for this, though, if we’re willing to use some regular expressions. We have to be a bit careful since some extra, non-JSON data comes along for the ride with that first <script> tag (see the embedded image above).

We perform the same initial setup (get the text of the first three <script> tags), then we erase everthing that isn’t JSON data (so all the var and javascript punctuation). By using comments = TRUE in call to stri_replace_all_regex we can provide documentation along with the (ugly) regex.

The creator of the SportsBall site did some encoding to make the string easier to shove into a <script> tag, so we need to undo that by converting the hex-escapes to HTML entity escapes (replace \x with %) and then decoding them with curl::curl_unescape()

We could have made the regular expression uglier to avoid the other javascript cruft in the first <script> tag, but it’s just as easy to split them all into lines and pull the first line out.

Then, it’s just a matter of running each one through jsonlite::fromJSON(). I kept it a list and just set the names as the names of the variables above.

html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]") %>% 
  .[1:3] %>% 
  html_text() %>% 
  stri_replace_all_regex("
^[^\\(]+\\('            # remove everything from the beginning of the line to the first ('
|                       # OR
'\\)[;,][[:space:]]*    # remove the last ') and everything after it
$
", "", comments = TRUE, multiline = TRUE) %>% 
  stri_replace_all_fixed("\\x", "%") %>% 
  curl::curl_unescape() %>% 
  stri_split_lines() %>% 
  map_chr(1) %>% 
  map(jsonlite::fromJSON) %>%
  map(as_tibble) %>% 
  set_names(c("datesData", "teamsData", "playersData")) %>% 
  str(3)
## List of 3
##  $ datesData  : tibble [380 × 8] (S3: tbl_df/tbl/data.frame)
##   ..$ id      : chr [1:380] "14086" "14087" "14090" "14091" ...
##   ..$ isResult: logi [1:380] TRUE TRUE TRUE TRUE TRUE TRUE ...
##   ..$ h       :'data.frame':  380 obs. of  3 variables:
##   ..$ a       :'data.frame':  380 obs. of  3 variables:
##   ..$ goals   :'data.frame':  380 obs. of  2 variables:
##   ..$ xG      :'data.frame':  380 obs. of  2 variables:
##   ..$ datetime: chr [1:380] "2020-09-12 11:30:00" "2020-09-12 14:00:00" "2020-09-12 16:30:00" "2020-09-12 19:00:00" ...
##   ..$ forecast:'data.frame':  380 obs. of  3 variables:
##  $ teamsData  : tibble [3 × 20] (S3: tbl_df/tbl/data.frame)
##   ..$ 71 :List of 3
##   ..$ 72 :List of 3
##   ..$ 74 :List of 3
##   ..$ 75 :List of 3
##   ..$ 76 :List of 3
##   ..$ 78 :List of 3
##   ..$ 80 :List of 3
##   ..$ 81 :List of 3
##   ..$ 82 :List of 3
##   ..$ 83 :List of 3
##   ..$ 86 :List of 3
##   ..$ 87 :List of 3
##   ..$ 88 :List of 3
##   ..$ 89 :List of 3
##   ..$ 92 :List of 3
##   ..$ 220:List of 3
##   ..$ 228:List of 3
##   ..$ 229:List of 3
##   ..$ 238:List of 3
##   ..$ 245:List of 3
##  $ playersData: tibble [505 × 18] (S3: tbl_df/tbl/data.frame)
##   ..$ id          : chr [1:505] "647" "1250" "1228" "453" ...
##   ..$ player_name : chr [1:505] "Harry Kane" "Mohamed Salah" "Bruno Fernandes" "Son Heung-Min" ...
##   ..$ games       : chr [1:505] "29" "30" "31" "30" ...
##   ..$ time        : chr [1:505] "2557" "2529" "2659" "2509" ...
##   ..$ goals       : chr [1:505] "19" "19" "16" "14" ...
##   ..$ xG          : chr [1:505] "17.650331255048513" "16.19410896115005" "13.438796618022025" "9.352356541901827" ...
##   ..$ assists     : chr [1:505] "13" "3" "11" "9" ...
##   ..$ xA          : chr [1:505] "6.7384555246680975" "4.557050030678511" "10.812157344073057" "8.036493374034762" ...
##   ..$ shots       : chr [1:505] "113" "99" "95" "55" ...
##   ..$ key_passes  : chr [1:505] "39" "40" "87" "56" ...
##   ..$ yellow_cards: chr [1:505] "1" "0" "5" "0" ...
##   ..$ red_cards   : chr [1:505] "0" "0" "0" "0" ...
##   ..$ position    : chr [1:505] "F" "F S" "M S" "F M S" ...
##   ..$ team_title  : chr [1:505] "Tottenham" "Liverpool" "Manchester United" "Tottenham" ...
##   ..$ npg         : chr [1:505] "15" "13" "8" "14" ...
##   ..$ npxG        : chr [1:505] "14.605655785650015" "11.627095961943269" "6.5883138151839375" "9.352356541901827" ...
##   ..$ xGChain     : chr [1:505] "20.556765687651932" "21.694580920040607" "22.04182725213468" "17.928756553679705" ...
##   ..$ xGBuildup   : chr [1:505] "3.99019683804363" "8.287332298234105" "8.843060294166207" "5.881684513762593" ...

You can use the cleanup code from the {V8} example to reshape that second element, and readr::type_convert() can help you turn the character vectors into something more useful.

FIN

It really always pays to take a look at the DevTools pane before introducing heavy dependencies. More sites are using very straightforward idioms that make the dynamically rendered page JSON source data readily available. Further, sites often add extra fields that you don’t see rendered, but may be useful to have around as you work with the resulting data.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Check ‘Developer Tools’ First To Avoid Heavy-ish Dependencies

FIN

Related

FIN

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)