Craft httr calls cleverly with curlconverter

February 10, 2016
By

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

When you visit a site like the LA Times’ NH Primary Live Results site and wish you had the data that they used to make the tables & visualizations on the site:

primary

Sometimes it’s as simple as opening up your browsers “Developer Tools” console and looking for XHR (XML HTTP Requests) calls:

XHR

You can actually see a preview of those requests (usually JSON):

Developer_Tools_-_http___graphics_latimes_com_election-2016-new-hampshire-results_

While you could go through all the headers and cookies and transcribe them into httr::GET or httr::POST requests, that’s tedious, especially when most browsers present an option to “Copy as cURL”. cURL is a command-line tool (with a corresponding systems programming library) that you can use to grab data from URIs. The RCurl and curl packages in R are built with the underlying library. The cURL command line captures all of the information necessary to replicate the request the browser made for a resource. The cURL command line for the URL that gets the Republican data is:

curl 'http://graphics.latimes.com/election-2016-31146-feed.json' 
  -H 'Pragma: no-cache' 
  -H 'DNT: 1' 
  -H 'Accept-Encoding: gzip, deflate, sdch'
  -H 'X-Requested-With: XMLHttpRequest' 
  -H 'Accept-Language: en-US,en;q=0.8' 
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' 
  -H 'Accept: */*' 
  -H 'Cache-Control: no-cache' 
  -H 'If-None-Match: "7b341d7181cbb9b72f483ae28e464dd7"' 
  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' 
  -H 'Connection: keep-alive' 
  -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT'
  -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' 
  --compressed

While that’s easier than manual copy/paste transcription, these requests are uniform enough that there Has To Be A Better Way. And, now there is, with curlconverter.

The curlconverter package has (for the moment) two main functions:

  • straighten() : which returns a list with all of the necessary parts to craft an httr POST or GET call
  • make_req() : which actually _returns a working httr call, pre-filled with all of the necessary information.

By default, either function reads from the clipboard (envision the workflow where you do the “Copy as cURL” then switch to R and type make_req() or req_params <- straighten()), but they can take in a vector of cURL command lines, too (NOTE: make_req() is currently limited to one while straighten() can handle as many as you want).

Let’s show what happens using election results cURL command line:

REP <- "curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache'  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed"
 
resp <- curlconverter::straighten(REP)
jsonlite::toJSON(resp, pretty=TRUE)
 
    ## [
    ##   {
    ##     "url": ["http://graphics.latimes.com/election-2016-31146-feed.json"],
    ##     "method": ["get"],
    ##     "headers": {
    ##       "Pragma": ["no-cache"],
    ##       "DNT": ["1"],
    ##       "Accept-Encoding": ["gzip, deflate, sdch"],
    ##       "X-Requested-With": ["XMLHttpRequest"],
    ##       "Accept-Language": ["en-US,en;q=0.8"],
    ##       "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"],
    ##       "Accept": ["*/*"],
    ##       "Cache-Control": ["no-cache"],
    ##       "Connection": ["keep-alive"],
    ##       "If-Modified-Since": ["Wed, 10 Feb 2016 16:40:15 GMT"],
    ##       "Referer": ["http://graphics.latimes.com/election-2016-new-hampshire-results/"]
    ##     },
    ##     "cookies": {
    ##       "s_fid": ["79D97B8B22CA721F-2DD12ACE392FF3B2"],
    ##       "s_cc": ["true"]
    ##     },
    ##     "url_parts": {
    ##       "scheme": ["http"],
    ##       "hostname": ["graphics.latimes.com"],
    ##       "port": {},
    ##       "path": ["election-2016-31146-feed.json"],
    ##       "query": {},
    ##       "params": {},
    ##       "fragment": {},
    ##       "username": {},
    ##       "password": {}
    ##     }
    ##   }
    ## ]

You can then use the items in the returned list to make a GET request manually (but still tediously).

curlconverter‘s make_req() will try to do this conversion for you automagically using httr‘s little used VERB() function. It’s easier to show than to tell:

curlconverter::make_req(REP)
VERB(verb = "GET", url = "http://graphics.latimes.com/election-2016-31146-feed.json", 
     add_headers(Pragma = "no-cache", 
                 DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", 
                 `X-Requested-With` = "XMLHttpRequest", 
                 `Accept-Language` = "en-US,en;q=0.8", 
                 `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", 
                 Accept = "*/*", 
                 `Cache-Control` = "no-cache", 
                 Connection = "keep-alive", 
                 `If-Modified-Since` = "Wed, 10 Feb 2016 16:40:15 GMT", 
                 Referer = "http://graphics.latimes.com/election-2016-new-hampshire-results/"))

You probably don’t need all those headers, but you just need to delete what you don’t need vs trial-and-error build by hand. Try assigning the output of that function to a variable and inspecting what’s returned. I think you’ll find this is a big enhancement to your workflows (if you do alot of this “scraping without scraping”).

You can find the package on gitub. It’s built with V8 and uses a modified version of the curlconverter Node module by Nick Carneiro.

It’s still in beta and could use some tyre kicking. Convos in the comments, issues or feature requests in GH (pls).

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)