New package: curl. High performance http(s) streaming in R.

November 21, 2014
By

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

opencpu logo

A bit ago I blogged about new streaming features in jsonlite:

library(jsonlite)
diamonds2 <- stream_in(url("http://jeroenooms.github.io/data/diamonds.json"))

In the same blog post it was also mentioned that R does currently not support https connections. The RCurl package does support https, but does not have a connection interface. This bothered me so I decided to write one. The result is the new curl package.

Encryption, compression and more

From the package description:

The curl() function provides a drop-in replacement for base url() with better performance and support for http 2.0, ssl (https, ftps), gzip, deflate and other libcurl goodies. This interface is implemented using the RConnection API in order to support incremental processing of both binary and text streams.

What this means is that curl() should be able to do anything that url() does, but better. The same example as above, but now with https:

library(curl)
library(jsonlite)
diamonds2 <- stream_in(curl("https://jeroenooms.github.io/data/diamonds.json"))

That was easy. Switching to curl has other benefits as well. For example it automatically recognizes and decompresses gzipped or deflated connections from the Accept-Encoding header:

readLines(curl("http://httpbin.org/gzip"), warn = FALSE)
readLines(curl("http://httpbin.org/deflate"), warn = FALSE)

Support for compression can make a huge difference when streaming large data. Text based formats such as json are popular because they are human readable, but the main downside of plain-text is inefficiency for storing numbers. However when gzipped, json payloads are often comparable to binary formats, giving you the best of both worlds.

Performance

One thing that did surprise me a bit is the difference in performance. Especially the implementation of readLines for url connections seems to be inefficient in base R.

con2 <- curl("http://jeroenooms.github.io/data/diamonds.json")
system.time(readLines(con2))
#   user  system elapsed
#  0.238   0.096   0.334

con1 <- url("http://jeroenooms.github.io/data/diamonds.json")
system.time(readLines(con1))
#   user  system elapsed
#  0.236   0.113   3.858

I’m not quite sure why this is. Maybe the base R version does some additional character recoding that I am not aware of, although I have not observed such behavior. Also measuring performance is tricky in this case because it depends on the connection bandwidth, caching settings, etc.

Compare to RCurl, httr

The curl package is not intended as an alternative for RCurl or httr. The latter packages also use libcurl, but provide a more flexible client for performing http requests in R. The purpose of the curl package is mainly to reimplement functionality already found in base R, in a way that (in a parallel universe) would allow r-core to adopt these changes to start supporting https in url and donwload.file, etc.

Do note that this is an initial release and the RConnection API is a bit experimental, so there might be bugs 🙂 In fact I’ve already made quite some changes since the CRAN release. If you report a bug, please make sure to replicated it with the latest dev version from github:

library(devtools)
install_github("jeroenooms/curl")

For some more fun examples, see the curl manual page.

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)