Spelunking XHRs (XMLHttpRequests) with splashr

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

splashr has gained some new functionality since the introductory post. First, there’s a whole new Docker image for it that embeds a local web server. Why? The main request for it was to enable rendering of htmlwidgets:

splash_vm <- start_splash(add_tempdir=TRUE)

DiagrammeR("
  graph LR
    A-->B
    A-->C
    C-->E
    B-->D
    C-->D
    D-->F
    E-->F
") %>% 
  saveWidget("/tmp/diag.html")

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="html")
## {xml_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<script src= ...
## [2] <body style="background-color: white; margin: 0px; padding: 40px;">\n<div id="htmlwidget_container">\n<div id="ht ...

splash("localhost") %>% 
  render_file("/tmp/diag.html", output="png", wait=2)

But if you use the new Docker image and the add_tempdir=TRUE parameter it can render any local HTML file.

The other new bits are helpers to identify content types in the HAR types. Along with get_content_type():

library(tidyverse)

map_chr(rud_har$log$entries, get_content_type)
##  [1] "text/html"                "text/html"                "application/javascript"   "text/css"                
##  [5] "text/css"                 "text/css"                 "text/css"                 "text/css"                
##  [9] "text/css"                 "application/javascript"   "application/javascript"   "application/javascript"  
## [13] "application/javascript"   "application/javascript"   "application/javascript"   "text/javascript"         
## [17] "text/css"                 "text/css"                 "application/x-javascript" "application/x-javascript"
## [21] "application/x-javascript" "application/x-javascript" "application/x-javascript" NA                        
## [25] "text/css"                 "image/png"                "image/png"                "image/png"               
## [29] "font/ttf"                 "font/ttf"                 "text/html"                "font/ttf"                
## [33] "font/ttf"                 "application/font-woff"    "application/font-woff"    "image/svg+xml"           
## [37] "text/css"                 "text/css"                 "image/gif"                "image/svg+xml"           
## [41] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [45] "application/font-woff"    "application/font-woff"    "application/font-woff"    "application/font-woff"   
## [49] "text/css"                 "application/x-javascript" "image/gif"                NA                        
## [53] "image/jpeg"               "image/svg+xml"            "image/svg+xml"            "image/svg+xml"           
## [57] "image/svg+xml"            "image/svg+xml"            "image/svg+xml"            "image/gif"               
## [61] NA                         "application/x-javascript" NA                         NA

there are many is_...() functions for logical tests.

But, one of the more interesting is_() functions is is_xhr(). Sites with dynamic content usually load said content via an XMLHttpRequest or XHR for short. Modern web apps usually return JSON in said requests and, for questions like this one on StackOverflow it’s usually better to grab the JSON and use it for data than it is to scrape the table made from JavaScript calls.

Now, it’s not too hard to open Developer Tools and find those XHR requests, but we can also use splashr to programmatically find them. We have to do a bit more work and use the new execute_lua() function since we need to give the page time to load up all the data. (I’ll eventually write a mini-R-DSL around this idiom so you don’t have to grok Lua for non-complex scraping tasks). Here’s how we’d answer that StackOverflow question today…

First, we grab the entire HAR contents (including bodies of the individual requests) after waiting a bit:

splash_local %>%
  execute_lua('
function main(splash)
  splash.response_body_enabled = true
  splash:go("http://www.childrenshospital.org/directory?state=%7B%22showLandingContent%22%3Afalse%2C%22model%22%3A%7B%22search_specialist%22%3Afalse%2C%22search_type%22%3A%5B%22directoryphysician%22%2C%22directorynurse%22%5D%7D%2C%22customModel%22%3A%7B%22nurses%22%3Atrue%7D%7D")
  splash:wait(2)
  return splash:har()
end
') -> res

pg <- as_har(res)

then we look for XHRs:

map_lgl(pg$log$entries, is_xhr) %>% which()
## 10

and, finally, we grab the JSON:

pg$log$entries[[10]]$response$content$text %>% 
  openssl::base64_decode() %>% 
  rawToChar() %>% 
  jsonlite::fromJSON() %>% 
  glimpse()
## List of 4
##  $ TotalPages  : int 16
##  $ TotalRecords: int 384
##  $ Records     :'data.frame': 24 obs. of  21 variables:
##   ..$ ID            : chr [1:24] "{5E4B0D96-18D3-4FC6-B1AA-345675F3765C}" "{674EEC8B-062A-4268-9467-5C61030B83C9}" ## "{3E6257FE-67A1-4F13-B377-9EA7CCBD50F2}" "{C28479E6-5458-4010-A005-84E5F35B2FEA}" ...
##   ..$ FirstName     : chr [1:24] "Mirna" "Barbara" "Donald" "Victoria" ...
##   ..$ LastName      : chr [1:24] "Aeschlimann" "Angus" "Annino" "Arthur" ...
##   ..$ Image         : chr [1:24] "" "/~/media/directory/physicians/ppoc/angus_barbara.ashx" "/~/media/directory/physicians/ppoc/## annino_donald.ashx" "/~/media/directory/physicians/ppoc/arthur_victoria.ashx" ...
##   ..$ Suffix        : chr [1:24] "MD" "MD" "MD" "MD" ...
##   ..$ Url           : chr [1:24] "http://www.childrenshospital.org/doctors/mirna-aeschlimann" "http://www.childrenshospital.org/doctors/## barbara-angus" "http://www.childrenshospital.org/doctors/donald-annino" "http://www.childrenshospital.org/doctors/victoria-arthur" ...
##   ..$ Gender        : chr [1:24] "female" "female" "male" "female" ...
##   ..$ Latitude      : chr [1:24] "42.468769" "42.235088" "42.463177" "42.447168" ...
##   ..$ Longitude     : chr [1:24] "-71.100558" "-71.016021" "-71.143169" "-71.229734" ...
##   ..$ Address       : chr [1:24] "{"practice_name":"Pediatrics, Inc.", "address_1":"577 Main ## Street", "city":&q"| __truncated__ "{"practice_name":"Crown Colony Pediatrics", ## "address_1":"500 Congress Street, Suite 1F""| __truncated__ "{"practice_name":"Pediatricians ## Inc.", "address_1":"955 Main Street", "city":"| __truncated__ ## "{"practice_name":"Lexington Pediatrics", "address_1":"19 Muzzey Street, Suite 105", &qu"| ## __truncated__ ...
##   ..$ Distance      : chr [1:24] "" "" "" "" ...
##   ..$ OtherLocations: chr [1:24] "" "" "" "" ...
##   ..$ AcademicTitle : chr [1:24] "" "" "" "Clinical Instructor of Pediatrics - Harvard Medical School" ...
##   ..$ HospitalTitle : chr [1:24] "Pediatrician" "Pediatrician" "Pediatrician" "Pediatrician" ...
##   ..$ Specialties   : chr [1:24] "Primary Care, Pediatrics, General Pediatrics" "Primary Care, Pediatrics, General Pediatrics" "General ## Pediatrics, Pediatrics, Primary Care" "Primary Care, Pediatrics, General Pediatrics" ...
##   ..$ Departments   : chr [1:24] "" "" "" "" ...
##   ..$ Languages     : chr [1:24] "English" "English" "" "" ...
##   ..$ PPOCLink      : chr [1:24] "http://www.childrenshospital.org/patient-resources/provider-glossary" "/patient-resources/## provider-glossary" "http://www.childrenshospital.org/patient-resources/provider-glossary" "http://www.childrenshospital.org/## patient-resources/provider-glossary" ...
##   ..$ Gallery       : chr [1:24] "" "" "" "" ...
##   ..$ Phone         : chr [1:24] "781-438-7330" "617-471-3411" "781-729-4262" "781-862-4110" ...
##   ..$ Fax           : chr [1:24] "781-279-4046" "(617) 471-3584" "" "(781) 863-2007" ...
##  $ Synonims    : list()

UPDATE So, I wrote a mini-DSL for this:

It’s unlikely we want to rely on a running Splash instance for our production work, so I’ll be making a helper function to turn HAR XHR requests into a httr function calls, similar to the way curlconverter works.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)