Getting numpy data into R

June 30, 2012

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The other day, I found myself confronted with a large number of large
files. Which were presented in (gzip-)compressed ascii format—which R reads
directly via gzfile() connections—as well as (compressed)
numpy files.

The numpy can be read very efficiently into Python. We can do the same in R
via save() and load(), of course. But the trouble
is that you need to read them first. And reading hundreds of megabytes from ascii is
slow, no matter which language you use. Concerning R, I poked aound scan(),
played with the colClasses argument and looked at the recent LaF package written just for
this purpose. And all these solutions were still orders of magnitude slower than
reading numpy. Which is no surprise as it is really hard to beat binary
formats when you have to parse countless ascii tokens.

So the obvious next idea was to read the numpy file in Python, and to write a simple
binary format. One helpful feature with this data set was that it contained
only regular (rectangular) matrices of floats. So we could just store two integers for
the dimensions, followed by the total data in either one large binary blob,
or a sequence of column vectors.

But one minor trouble was that the Intertubes lead to no easy solution to
unpack the numpy format. StackOverflow
had plenty of question around this topic converned with, say, how to
serialize in language-independent way. But no converters. And nobody
local knew how to undo the “pickle” format underlying numpy.

But a remote friend did:
well-known for his Rpy2
package, pointed me towards using the struct module and steered
me towards the solution shown below. So a shameless plug: if you need a very
experienced Python or R consultant for sciece work, consider
his consulting firm.

Finally, to round out this post, let’s show the simple solution we crafted so that the
next guy searching the Intertubes will have an easier. Let us start with a
minimal Python program writing numpy data to disk:

#!/usr/bin/env python
# simple example for creating numpy data to demonstrate converter

import numpy as np

# simple float array
a = np.arange(15).reshape(3,5) * 1.1

outfile = "/tmp/data.npy", a)

Next, the simple Python converter to create a binary file containing two
integers for row and column dimension, followed by row times columns of

# read a numpy file, and write a simple binary file containing
#   two integers 'n' and 'k' for rows and columns
#   n times k floats with the actual matrix
# which can be read by any application or language that can read binary
import struct
import numpy as np

inputfile = "/tmp/data.npy"
outputfile = "/tmp/data.bin"

# load from the file
mat = np.load(inputfile)

# create a binary file
binfile = file(outputfile, 'wb')
# and write out two integers with the row and column dimension
header = struct.pack('2I', mat.shape[0], mat.shape[1])
# then loop over columns and write each
for i in range(mat.shape[1]):
    data = struct.pack('%id' % mat.shape[0], *mat[:,i])

Lastly, a quick littler script
showing how R can read the data in a handful of lines:


infile <- "/tmp/data.bin"
con <- file(infile, "rb")
dim <- readBin(con, "integer", 2)
Mat <- matrix( readBin(con, "numeric", prod(dim)), dim[1], dim[2])


That did the job—and I already used to converter to read a few weeks worth
of data for further analysis in R. This obviously isn’t the last word on
possible solutions as the additional temporary file can be wasteful (unless
it forms a cache for data read multiple times). If someone has nice
solutions, please don’t hold back and contact me. Thanks again to Laurent
for the winning suggestion concerning struct, and help in
getting the examples shown here to work.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)