# Getting numpy data into R — Take Two

July 10, 2012
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A couple of days ago, I had
posted a short Python script
to convert numpy files into a simple
binary format which R can read quickly. Nice, but still needing an extra
file. Shortly thereafter, I found Carl Rogers
cnpy library
which makes reading and writing numpy files from C++ a breeze, and I quickly
wrapped this up into a
new package RcppCNPy
which was released a few days ago.

This post will show a quick example, also summarized in the
short pdf vignette
describing the package, and provided as a demo within the package.

```R> library(RcppCNPy)
R> library(rbenchmark)
R>
R> n <- 1e5
R> k <- 50
R>
R> M <- matrix(seq(1.0, n*k, by=1.0), n, k)
R>
R> txtfile <- tempfile(fileext=".txt")
R> write.table(M, file=txtfile)
R>
R> pyfile <- tempfile(fileext=".py")
R> npySave(pyfile, M)
R>
R> pygzfile <- tempfile(fileext=".py")
R> npySave(pygzfile, M)
R> system(paste("gzip -9", pygzfile))
R> pygzfile <- paste(pygzfile, ".gz", sep="")
R>
R>
```

We first load the new package (as well as the
rbenchmark package used for the
benchmarking example) into R. We then create a large matrix of 100,000 rows and 50
columns. Not quite big data by any stretch, but large enough for
ascii reading to be painfully slow. We also write two npy files and compress the
second one.

Next, we use the `benchmark` function to time the three
approaches:

```R> res <- benchmark(read.table(txtfile),
+                  order="relative",
+                  columns=c("test", "replications", "elapsed", "relative"),
+                  replications=10)
R> print(res)
test replications elapsed relative
R>
```

As shown by this example, loading a numpy file directly beats the pants off
compressed file is somewhat slower as the data stream has to be passed through the
uncompressor provide by the zlib library. So instead of reading a binary blob
in one go (once the file header has been parsed) we have to operate
piecemeal—which is bound to be slower. It does however save in storage
space (and users can make this tradeoff between speed and size) and is still
orders of magnitude faster than parsing the ascii file. Finally, and
not shown here, we unlink the temporary files.

Summing up, this post demonstrated how the
RcppCNPy package
can be a useful to access data in numpy files (which may even be
compressed). Data can also be written from R to be accessed later by numpy.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.