The other day, I found myself confronted with a large number of large files. Which were presented in (gzip-)compressed ascii format—which R reads directly via
gzfile()connections—as well as (compressed) numpy files.
The numpy can be read very efficiently into Python. We can do the same in R
load(), of course. But the trouble
is that you need to read them first. And reading hundreds of megabytes from ascii is
slow, no matter which language you use. Concerning R, I poked aound
played with the
colClasses argument and looked at the recent LaF package written just for
this purpose. And all these solutions were still orders of magnitude slower than
reading numpy. Which is no surprise as it is really hard to beat binary
formats when you have to parse countless ascii tokens.
So the obvious next idea was to read the numpy file in Python, and to write a simple binary format. One helpful feature with this data set was that it contained only regular (rectangular) matrices of floats. So we could just store two integers for the dimensions, followed by the total data in either one large binary blob, or a sequence of column vectors.
But one minor trouble was that the Intertubes lead to no easy solution to unpack the numpy format. StackOverflow had plenty of question around this topic converned with, say, how to serialize in language-independent way. But no converters. And nobody local knew how to undo the “pickle” format underlying numpy.
But a remote friend did:
well-known for his Rpy2
package, pointed me towards using the
struct module and steered
me towards the solution shown below. So a shameless plug: if you need a very
experienced Python or R consultant for sciece work, consider
his consulting firm.
Finally, to round out this post, let’s show the simple solution we crafted so that the next guy searching the Intertubes will have an easier. Let us start with a minimal Python program writing numpy data to disk:
#!/usr/bin/env python # # simple example for creating numpy data to demonstrate converter import numpy as np # simple float array a = np.arange(15).reshape(3,5) * 1.1 outfile = "/tmp/data.npy" np.save(outfile, a)
Next, the simple Python converter to create a binary file containing two integers for row and column dimension, followed by row times columns of floats:
#!/usr/bin/python # # read a numpy file, and write a simple binary file containing # two integers 'n' and 'k' for rows and columns # n times k floats with the actual matrix # which can be read by any application or language that can read binary import struct import numpy as np inputfile = "/tmp/data.npy" outputfile = "/tmp/data.bin" # load from the file mat = np.load(inputfile) # create a binary file binfile = file(outputfile, 'wb') # and write out two integers with the row and column dimension header = struct.pack('2I', mat.shape, mat.shape) binfile.write(header) # then loop over columns and write each for i in range(mat.shape): data = struct.pack('%id' % mat.shape, *mat[:,i]) binfile.write(data) binfile.close()
Lastly, a quick littler script showing how R can read the data in a handful of lines:
That did the job---and I already used to converter to read a few weeks worth of data for further analysis in R. This obviously isn't the last word on possible solutions as the additional temporary file can be wasteful (unless it forms a cache for data read multiple times). If someone has nice solutions, please don't hold back and contact me. Thanks again to Laurent for the winning suggestion concerning#!/usr/bin/r infile <- "/tmp/data.bin" con <- file(infile, "rb") dim <- readBin(con, "integer", 2) Mat <- matrix( readBin(con, "numeric", prod(dim)), dim, dim) close(con) print(Mat)
struct, and help in getting the examples shown here to work.