Publishing dynamic data on ocpu.io

February 16, 2014
By

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

opencpu logo

Suppose you would like to publish some data, for example to accompany a journal article. One way would be to put a CSV file on your website, and share the URL with your colleagues. However CSV has many limitations: it only works for tabular structures, has limited type safety (pretty much everything gets coersed into strings) and leads to loss of numeric precision.

There are many alternative data interchange formats, each with their own benefits and limitations. For example JSON is widely supported and can be parsed in almost any language, however it can be verbose and slow. A binary format such as Protocol Buffers is more efficient, but many users might not know how to parse it. You could even use save or saveRDS in R to share the native R structures, however this limits your audience to R users.

Retrieving dynamic data

What we really need is a method to publish the data itself rather than some representation of the data in a particular format. With OpenCPU you can publish R objects (including datasets) in a way that lets the clients select the format and formatting options for retrieving the dataset. This is implemented using native R functionality to include arbitrary data/objects in packages, and standard R functions for exporting these data. For example, the CRAN package MASS includes a dataset called bacteria:

library(MASS)
data(bacteria)
print(bacteria)

Via OpenCPU, the dataset can downloaded by anyone, using one of many formats:

FormatExport FunctionURL (short)
textprintcran.ocpu.io/MASS/data/bacteria/print
CSVwrite.csvcran.ocpu.io/MASS/data/bacteria/csv
TSVwrite.tablecran.ocpu.io/MASS/data/bacteria/tab
JSONjsonlite::asJSONcran.ocpu.io/MASS/data/bacteria/json
Protocol BuffersRProtoBuf::serialize_pbcran.ocpu.io/MASS/data/bacteria/pb
RDatasavecran.ocpu.io/MASS/data/bacteria/rda
RDSsaveRDScran.ocpu.io/MASS/data/bacteria/rds
ascii Rdputcran.ocpu.io/MASS/data/bacteria/ascii

The client can also control formatting options by passing HTTP parameters. These parameters map directly to function arguments for the respective export function in the table above. Some random examples:

Output FormatEquivalent URL on Public OpenCPU Server
write.csv(bacteria, row.names=TRUE)cran.ocpu.io/MASS/data/bacteria/csv?row.names=true
jsonlite::asJSON(Boston, digits=4)cran.ocpu.io/MASS/data/Boston/json?digits=4
jsonlite::asJSON(Boston, dataframe="columns")cran.ocpu.io/MASS/data/Boston/json?dataframe=columns
jsonlite::asJSON(Boston, pretty=FALSE)cran.ocpu.io/MASS/data/Boston/json?pretty=false

Creating a data package

To start publishing your own dynamic data you need to put your data objects in an R package following the standard guidelines as documented in section 1.1.6 of Writing R Extensions. This might sound cumbersome, but once you get a hold of it, it only takes a few seconds. You’ll realize that packages are actually a beautiful, standardized and well-tested container format for R objects and much more. Have a look at the data folder in the opencpu/appdemo package for some examples.

After creating and installing your package on your local R, test it using the OpenCPU single user server:

library(opencpu)
opencpu$browse("/library/mypackage/data")
opencpu$browse("/library/mypackage/data/myobject")

Publishing dynamic data on ocpu.io

To make your data available through the public OpenCPU server and ocpu.io, all you need to do is put your package up on Github. OpenCPU requires the name of the Github repository to match the name of the R package it contains. Use devtools to test if your package is working:

library(devtools)
install_github("pkgname", "username")

If this succeeds you’re good to go. Navigate to username.ocpu.io/pkgname/data where username is your Github login. By default the OpenCPU public server updates packages installed from Github every 24 hours. However, the Github webhook can be used to update the package immediately every time a commit is pushed to github.

Publishing dynamic data on your own server

OpenCPU does not lock you into some commercial hosting service. Your data is stored on Github in a standard format under your control. The ocpu.io public server is there for your convenience. You can also install your own OpenCPU cloud server to publish data at e.g. http://opencpu.yourserver.com/ocpu/library/pkgname/data/myobject. No need to put anything on Github, just install the package in R on the server.

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.