Getting numpy data into R — Take Two

Posted on July 10, 2012 by Thinking inside the box in Uncategorized | 0 Comments

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A couple of days ago, I had posted a short Python script to convert numpy files into a simple binary format which R can read quickly. Nice, but still needing an extra file. Shortly thereafter, I found Carl Rogers cnpy library which makes reading and writing numpy files from C++ a breeze, and I quickly wrapped this up into a new package RcppCNPy which was released a few days ago.

This post will show a quick example, also summarized in the short pdf vignette describing the package, and provided as a demo within the package.

R> library(RcppCNPy)
Loading required package: Rcpp
R> library(rbenchmark)
R> 
R> n <- 1e5
R> k <- 50
R> 
R> M <- matrix(seq(1.0, n*k, by=1.0), n, k)
R> 
R> txtfile <- tempfile(fileext=".txt")
R> write.table(M, file=txtfile)
R> 
R> pyfile <- tempfile(fileext=".py")
R> npySave(pyfile, M)
R> 
R> pygzfile <- tempfile(fileext=".py")
R> npySave(pygzfile, M)
R> system(paste("gzip -9", pygzfile))
R> pygzfile <- paste(pygzfile, ".gz", sep="")
R> 
R>

We first load the new package (as well as the rbenchmark package used for the benchmarking example) into R. We then create a large matrix of 100,000 rows and 50 columns. Not quite big data by any stretch, but large enough for ascii reading to be painfully slow. We also write two npy files and compress the second one.

Next, we use the benchmark function to time the three approaches:

R> res <- benchmark(read.table(txtfile),
+                  npyLoad(pyfile),
+                  npyLoad(pygzfile),
+                  order="relative",
+                  columns=c("test", "replications", "elapsed", "relative"),
+                  replications=10)
R> print(res)
                 test replications elapsed relative
2     npyLoad(pyfile)           10   1.241  1.00000
3   npyLoad(pygzfile)           10   3.098  2.49637
1 read.table(txtfile)           10  96.744 77.95649
R>

As shown by this example, loading a numpy file directly beats the pants off reading the data from ascii: it is about 78 times faster. Reading a compressed file is somewhat slower as the data stream has to be passed through the uncompressor provide by the zlib library. So instead of reading a binary blob in one go (once the file header has been parsed) we have to operate piecemeal—which is bound to be slower. It does however save in storage space (and users can make this tradeoff between speed and size) and is still orders of magnitude faster than parsing the ascii file. Finally, and not shown here, we unlink the temporary files.

Summing up, this post demonstrated how the RcppCNPy package can be a useful to access data in numpy files (which may even be compressed). Data can also be written from R to be accessed later by numpy.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Getting numpy data into R — Take Two

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)