Site icon R-bloggers

Curling – exploring web request options

[This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rOpenSci specializes in creating R libraries for accessing data resources on the web from R. Most times you request data from the web in R with our packages, you should have no problem. However, you evenutally will run into problems. In addition, there are advanced things you can do modifying requests to web resources that fall in the advanced stuff category.

Underlying almost all of our packages are requests to web resources served over the http protocol via curl. curl is a command line tool and library for transferring data with URL syntax, supporting (lots of protocols) . curl has many options that you may not know about.

I'll go over some of the common and less commonly used curl options, and try to explain why you may want to use some of them.

Discover curl options

You can go to the source, that is the curl manual page at http://curl.haxx.se/docs/manpage.html. In R: RCurl::listCurlOptions() for finding curl options, give website for more info and equivalent call in httr is httr::httr_options(). httr::httr_options() gives more information for each curl option, including the libcurl variable name (e.g., CURLOPT_CERTINFO) and the type of variable (e.g., logical).

Other ways to use curl besides R

Perhaps the canonical way to use curl is on the command line. You can get curl for your operating system at http://curl.haxx.se/download.html, though hopefully you already have curl. Once you have curl, you can have lots of fun. For example, get the contents of the Google landing page:

curl https://www.google.com

Note: if you are on windows you may require extra setup if you want to play with curl on the command line. OSX and linux have it by default. On Windows 8, installing the latest version from here http://curl.haxx.se/download.html#Win64 worked for me.

Install httr

Note: RCurl is a dependency, so you'll get it when you install httr

install.packages("httr")

There are some new features in httr dev version you may want. If so, do:

install.packages("devtools")
devtools::install_github("hadley/httr")

Load httr

library("httr")

general option setting

With httr you can either set globally for an R session like

set_config(timeout(seconds = 2))

Or use with_config()

with_config(verbose(), {
  GET("http://www.google.com/search")
})

Or extensions to with_*, like for verbose output

with_verbose(
  GET("http://www.google.com/search")
)
#> Response [http://www.google.com/webhp]
#>   Date: 2014-12-17 07:54
#>   Status: 200
#>   Content-Type: text/html; charset=ISO-8859-1
#>   Size: 19.3 kB
#> <!doctype html><html   l...
#> function _gjh(){!_gjuc()&&window.google&&google.x&&google.x({id:"GJH"},f...
#> if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf...
#> }
#> })();</script><div id="mngb">   <div id=gbar><nobr><b class=gb1>Search</...
#> a.i.Z,window.gbar.elr&&a.i.$(window.gbar.elr()),window.gbar.elc&&window....
#> });})();</script> </div> </span><br clear="all" id="lgpd"><div id="lga">...
#> });})();</script></div></div><span id="footer"><div style="-size:10p...

Or pass into each function call

GET("http://www.google.com/search", query=list(q="httr"), timeout(seconds = 0.5))

With RCurl you can set options for a function call by passing curl options to the .opts parameter

getForm("http://www.google.com/search?q=RCurl", btnG="Search", .opts = list(timeout.ms = 20))

For all examples below I'll use httr, and pass in config options to function calls.

curl options in rOpenSci packages

In most of our packages we allow you to pass in any curl options, either via ... or a named parameter. We are increasingly making our packages consistent, but they may not all have this ability yet. For example, using the rgbif package, an R client for GBIF:

install.packages("rgbif")

verbose output

library("rgbif")
res <- occ_search(geometry=c(-125.0,38.4,-121.8,40.9), limit=20, config=verbose())
#> -> GET /v1/occurrence/search?geometry=POLYGON%28%28-125%2038.4%2C%20-121.8%2038.4%2C%20-121.8%2040.9%2C%20-125%2040.9%2C%20-125%2038.4%29%29&limit=20&offset=0 HTTP/1.1
#> -> User-Agent: curl/7.37.1 Rcurl/1.95.4.5 httr/0.6.0
#> -> Host: api.gbif.org
#> -> Accept-Encoding: gzip
#> -> Accept: application/json, text/xml, application/xml, */*
#> -> 
#> <- HTTP/1.1 200 OK
#> <- Content-Type: application/json
#> <- Access-Control-Allow-Origin: *
#> <- Server: Jetty(9.1.z-SNAPSHOT)
#> <- x-api-url: /v1/occurrence/search?geometry=POLYGON%28%28-125%2038.4%2C%20-121.8%2038.4%2C%20-121.8%2040.9%2C%20-125%2040.9%2C%20-125%2038.4%29%29&limit=20&offset=0
#> <- Content-Length: 48698
#> <- Accept-Ranges: bytes
#> <- Date: Tue, 16 Dec 2014 23:35:52 GMT
#> <- X-Varnish: 1067986052 1067940827
#> <- Age: 209
#> <- Via: 1.1 varnish
#> <- Connection: keep-alive
#> <- 

Print progress

res <- occ_search(geometry=c(-125.0,38.4,-121.8,40.9), limit=20, config=progress())
#> |===================================================================| 100%

You can also combine curl options – use c() in this case to combine them

c(verbose(), progress())
#> Config: 
#> List of 4
#>  $ debugfunction   :function (...)  
#>  $ verbose         :TRUE
#>  $ noprogress      :FALSE
#>  $ progressfunction:function (...)

res <- occ_search(geometry=c(-125.0,38.4,-121.8,40.9), limit=20, config=c(verbose(), progress()))
#> -> GET /v1/occurrence/search?geometry=POLYGON%28%28-125%2038.4%2C%20-121.8%2038.4%2C%20-121.8%2040.9%2C%20-125%2040.9%2C%20-125%2038.4%29%29&limit=20&offset=0 HTTP/1.1
#> -> User-Agent: curl/7.37.1 Rcurl/1.95.4.5 httr/0.6.0
#> -> Host: api.gbif.org
#> -> Accept-Encoding: gzip
#> -> Accept: application/json, text/xml, application/xml, */*
#> -> 
#> <- HTTP/1.1 200 OK
#> <- Content-Type: application/json
#> <- Access-Control-Allow-Origin: *
#> <- Server: Jetty(9.1.z-SNAPSHOT)
#> <- x-api-url: /v1/occurrence/search?geometry=POLYGON%28%28-125%2038.4%2C%20-121.8%2038.4%2C%20-121.8%2040.9%2C%20-125%2040.9%2C%20-125%2038.4%29%29&limit=20&offset=0
#> <- Content-Length: 48698
#> <- Accept-Ranges: bytes
#> <- Date: Tue, 16 Dec 2014 23:35:52 GMT
#> <- X-Varnish: 1067986052 1067940827
#> <- Age: 209
#> <- Via: 1.1 varnish
#> <- Connection: keep-alive
#> <- 
#>   |======================================================================| 100%

timeout

Set a timeout for a request. If request exceeds timeout, request stops.

Note: For this section and those following, I'll mention an RCurl equivalent if there is one.

GET("http://www.google.com/search", timeout(0.01))
#> Error in function (type, msg, asError = TRUE)  :
#>   Connection timed out after 16 milliseconds

verbose

Print detailed info on a curl call

Just do a HEAD request so we don't have to deal with big output

HEAD("http://www.google.com/search", verbose())
#> -> HEAD / HTTP/1.1
#> -> User-Agent: curl/7.37.1 Rcurl/1.95.4.5 httr/0.6.0
#> -> Host: had.co.nz
#> -> Accept-Encoding: gzip
#> -> Accept: application/json, text/xml, application/xml, */*
#> ->
#> <- HTTP/1.1 200 OK
#> <- X-Powered-By: PHP/4.4.6
#> <- Content-type: text/html
#> <- Date: Tue, 16 Dec 2014 21:03:21 GMT
#> <- Server: LiteSpeed
#> <- Connection: Keep-Alive
#> <- Keep-Alive: timeout=5, max=100
#> <-
#> Response [http://had.co.nz/]
#>   Date: 2014-12-16 12:29
#>   Status: 200
#>   Content-Type: text/html
#> <EMPTY BODY>

headers

Add headers to modify requests, including authentication, setting content-type, accept type, etc.

res <- HEAD("http://www.google.com/search", add_headers(Accept = "application/json"))
res$request$opts$httpheader
#>             Accept 
#> "application/json"

Note: there are shortcuts for add_headers(Accept = "application/json") and add_headers(Accept = "application/xml"): accept_json(), and accept_xml()

authenticate

Set authentication details for a resource

authenticate() for basic username/password authentication

authenticate(user = "foo", password = "bar")
#> Config: 
#> List of 2
#>  $ httpauth:1
#>   ..- attr(*, "names")="basic"
#>  $ userpwd :"foo:bar"

To use an API key, this depends on the data provider. They may request it one or either of the header (in multiple different ways)

HEAD("http://www.google.com/search", add_headers(Authorization = "Bearer 234kqhrlj2342"))
# or
HEAD("http://www.google.com/search", add_headers("token" = "234kqhrlj2342"))

or as a query parameter (which is passed in the URL string)

HEAD("http://www.google.com/search", query = list(api_key = "<your key>"))

Another authentication options is OAuth workflows. OAuth2 is probably more commonly used than OAuth1.

endpts <- oauth_endpoint(authorize = "authorize", access = "access_token", base_url = "https://github.com/login/oauth")
myapp <- oauth_app(appname = "github", key = "<key>", secret = "<secret>")
github_token <- oauth2.0_token(endpts, myapp)
gtoken <- config(token = github_token)
req <- GET("https://api.github.com/rate_limit", gtoken)
content(req)

cookies

Set or get cookies.

Set cookies

GET("http://httpbin.org/cookies", set_cookies(a = 1, b = 2))
#> Response [http://httpbin.org/cookies]
#>   Date: 2014-12-17 07:54
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 50 B
#> {
#>   "cookies": {
#>     "a": "1", 
#>     "b": "2"
#>   }

If there are cookies in a response, you can access them easily with cookies()

res <- GET("http://httpbin.org/cookies/set", query = list(a = 1, b = 2))
cookies(res)
#> $b
#> [1] 2
#> 
#> $a
#> [1] 1

progress

Print curl progress

res <- GET("http://httpbin.org", progress())
#> |==================================| 100%

proxies

When behind a proxy, give authentiction details for your proxy.

GET("http://www.google.com/search", use_proxy(url = "125.39.66.66", port = 80, username = "username", password = "password"))

user agent

Some resources require a user-agent string.

Get the default user agent set if using httr

GET("http://httpbin.org/user-agent")
#> Response [http://httpbin.org/user-agent]
#>   Date: 2014-12-17 07:54
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 59 B
#> {
#>   "user-agent": "curl/7.37.1 Rcurl/1.95.4.5 httr/0.6.0"

Set a user agent string

GET("http://httpbin.org/user-agent", user_agent("its me!"))
#> Response [http://httpbin.org/user-agent]
#>   Date: 2014-12-17 07:54
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 29 B
#> {
#>   "user-agent": "its me!"

Questions?

Let us know if you have any questions. To a curl newbie, it may seem a bit overwhelming, but we're here to help.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.