**Thinking inside the box**, and kindly contributed to R-bloggers)

A couple of days ago, I had

posted a short Python script

to convert numpy files into a simple

binary format which R can read quickly. Nice, but still needing an extra

file. Shortly thereafter, I found Carl Rogers

cnpy library

which makes reading and writing numpy files from C++ a breeze, and I quickly

wrapped this up into a

new package RcppCNPy

which was released a few days ago.

This post will show a quick example, also summarized in the

short pdf vignette

describing the package, and provided as a demo within the package.

R> library(RcppCNPy) Loading required package: Rcpp R> library(rbenchmark) R> R> n <- 1e5 R> k <- 50 R> R> M <- matrix(seq(1.0, n*k, by=1.0), n, k) R> R> txtfile <- tempfile(fileext=".txt") R> write.table(M, file=txtfile) R> R> pyfile <- tempfile(fileext=".py") R> npySave(pyfile, M) R> R> pygzfile <- tempfile(fileext=".py") R> npySave(pygzfile, M) R> system(paste("gzip -9", pygzfile)) R> pygzfile <- paste(pygzfile, ".gz", sep="") R> R>

We first load the new package (as well as the

rbenchmark package used for the

benchmarking example) into R. We then create a large matrix of 100,000 rows and 50

columns. Not quite *big data* by any stretch, but large enough for

ascii reading to be painfully slow. We also write two npy files and compress the

second one.

Next, we use the `benchmark`

function to time the three

approaches:

R> res <- benchmark(read.table(txtfile), + npyLoad(pyfile), + npyLoad(pygzfile), + order="relative", + columns=c("test", "replications", "elapsed", "relative"), + replications=10) R> print(res) test replications elapsed relative 2 npyLoad(pyfile) 10 1.241 1.00000 3 npyLoad(pygzfile) 10 3.098 2.49637 1 read.table(txtfile) 10 96.744 77.95649 R>

As shown by this example, loading a numpy file directly beats the pants off

reading the data from ascii: it is about 78 times faster. Reading a

compressed file is somewhat slower as the data stream has to be passed through the

uncompressor provide by the zlib library. So instead of reading a binary blob

in one go (once the file header has been parsed) we have to operate

piecemeal—which is bound to be slower. It does however save in storage

space (and users can make this tradeoff between speed and size) and is still

orders of magnitude faster than parsing the ascii file. Finally, and

not shown here, we unlink the temporary files.

Summing up, this post demonstrated how the

RcppCNPy package

can be a useful to access data in numpy files (which may even be

compressed). Data can also be written from R to be accessed later by numpy.

**leave a comment**for the author, please follow the link and comment on their blog:

**Thinking inside the box**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...