Curling – exploring web request options
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
rOpenSci specializes in creating R libraries for accessing data resources on the web from R. Most times you request data from the web in R with our packages, you should have no problem. However, you evenutally will run into problems. In addition, there are advanced things you can do modifying requests to web resources that fall in the advanced stuff category.
Underlying almost all of our packages are requests to web resources served over the http
protocol via curl. curl
is a command line tool and library for transferring data with URL syntax, supporting (lots of protocols) . curl
has many options that you may not know about.
I'll go over some of the common and less commonly used curl options, and try to explain why you may want to use some of them.
Discover curl options
You can go to the source, that is the curl manual page at http://curl.haxx.se/docs/manpage.html. In R: RCurl::listCurlOptions()
for finding curl options, give website for more info and equivalent call in httr
is httr::httr_options()
. httr::httr_options()
gives more information for each curl option, including the libcurl variable name (e.g., CURLOPT_CERTINFO
) and the type of variable (e.g., logical).
Other ways to use curl besides R
Perhaps the canonical way to use curl is on the command line. You can get curl for your operating system at http://curl.haxx.se/download.html, though hopefully you already have curl. Once you have curl, you can have lots of fun. For example, get the contents of the Google landing page:
curl https://www.google.com
- If you like that you may also like httpie, a Python command line tool that is a little more convenient than curl (e.g., JSON output is automatically parsed and colorized).
- Alot of data from the web is in JSON format. A great command line tool to pair with
curl
is jq.
Note: if you are on windows you may require extra setup if you want to play with curl on the command line. OSX and linux have it by default. On Windows 8, installing the latest version from here http://curl.haxx.se/download.html#Win64 worked for me.
Install httr
Note:
RCurl
is a dependency, so you'll get it when you installhttr
install.packages("httr")
There are some new features in httr
dev version you may want. If so, do:
install.packages("devtools") devtools::install_github("hadley/httr")
Load httr
library("httr")
general option setting
With httr
you can either set globally for an R session like
set_config(timeout(seconds = 2))
Or use with_config()
with_config(verbose(), { GET("http://www.google.com/search") })
Or extensions to with_*
, like for verbose
output
with_verbose( GET("http://www.google.com/search") ) #> Response [http://www.google.com/webhp] #> Date: 2014-12-17 07:54 #> Status: 200 #> Content-Type: text/html; charset=ISO-8859-1 #> Size: 19.3 kB #> function _gjh(){!_gjuc()&&window.google&&google.x&&google.x({id:"GJH"},f... #> if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf... #> } #> })();Search a.i.Z,window.gbar.elr&&a.i.$(window.gbar.elr()),window.gbar.elc&&window.... #> });})(); ... #> });})();