High performance JSON streaming in R: Part 1

November 5, 2014
By

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

opencpu logo

The jsonlite stream_in and stream_out functions implement line-by-line processing of JSON data over a connection, such as a socket, url, file or pipe. Thereby we can construct a data processing pipeline that can handle large (or unlimited) amounts of data with limited memory. This post will walk through some examples from the help pages.

The json streaming format

Because parsing huge JSON strings is difficult and inefficient, JSON streaming is done using lines of minified JSON records. This is pretty standard: JSON databases such as MongoDB use the same format to import/export large datasets. Note that this means that the total stream combined is not valid JSON itself; only the individual lines are.

library(jsonlite)
x <- iris[1:3,]
stream_out(x, con = stdout())
# {"Sepal.Length":5.1,"Sepal.Width":3.5,"Petal.Length":1.4,"Petal.Width":0.2,"Species":"setosa"}
# {"Sepal.Length":4.9,"Sepal.Width":3,"Petal.Length":1.4,"Petal.Width":0.2,"Species":"setosa"}
# {"Sepal.Length":4.7,"Sepal.Width":3.2,"Petal.Length":1.3,"Petal.Width":0.2,"Species":"setosa"}

Also note that because line-breaks are used as separators, prettified JSON is not permitted: the JSON lines must be minified. In this respect, the format is a bit different from fromJSON and toJSON where all lines are part of a single JSON structure with optional line breaks.

Streaming to/from a file

The nycflights13 package contains a dataset with about 5 million values. To stream this to a file:

library(nycflights13)
stream_out(flights, con = file("~/flights.json"))

Running this code will open the file connection, write json to the connection in batches of 500 rows, and afterwards close the connection. Status messages will be printed to the console while writing output. The entire process should take a few seconds and generate a json file of about 7MB.

We use the same file to illustrate how to stream the json back into R. The following code will stream-parse the json in batches of 500 lines. Afterward we verify that the output is indeed identical to the original one:

flights2 <- stream_in(file("~/flights.json"))
all.equal(flights2, as.data.frame(flights))
# [1] TRUE

Because the data is read in small batches, this require much less memory than when we would try to parse a huge json blob all at once. The pagesize argument in stream_in and stream_out can be used to specify the number of rows that will be read/written per iteration.

Streaming from a URL

We can use the standard url function in R to stream from a HTTP connection.

diamonds2 <- stream_in(url("http://jeroenooms.github.io/data/diamonds.json"))

If the data source is gzipped, simply wrap the connection in gzcon.

flights3 <- stream_in(gzcon(url("http://jeroenooms.github.io/data/nycflights13.json.gz")))
all.equal(flights3, as.data.frame(flights))

Because R currently does not support SSL, we use a curl pipe to stream over HTTPS:

flights4 <- stream_in(gzcon(pipe("curl https://jeroenooms.github.io/data/nycflights13.json.gz")))
all.equal(flights4, as.data.frame(flights))

For this to work, the curl executable needs to be installed and available in the search path, which requires cygwin on Windows. Unfortunately the RCurl package does not seem to support binary streaming at this point.

Next up

These examples illustrate basic line-by-line json streaming of data frames from/to a connection, which allows for importing/exporting large json datasets.

In the next blog post we will make the step to full JSON IO streaming by defining a custom handler function. This allows for constructing a json data processing pipeline in R that can handle an infinite data stream. Impatient readers can have a look at the examples in the stream_in help page.

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)