**Thinking inside the box**, and kindly contributed to R-bloggers)

The other day, I found myself confronted with a large number of large

files. Which were presented in (gzip-)compressed ascii format—which R reads

directly via `gzfile()`

connections—as well as (compressed)

numpy files.

The numpy can be read very efficiently into Python. We can do the same in R

via `save()`

and `load()`

, of course. But the trouble

is that you need to read them first. And reading hundreds of megabytes from ascii is

slow, no matter which language you use. Concerning R, I poked aound `scan()`

,

played with the `colClasses`

argument and looked at the recent LaF package written just for

this purpose. And all these solutions were still orders of magnitude slower than

reading numpy. Which is no surprise as it is *really hard* to beat binary

formats when you have to parse countless ascii tokens.

So the obvious next idea was to read the numpy file in Python, and to write a simple

binary format. One helpful feature with this data set was that it contained

only regular (rectangular) matrices of floats. So we could just store two integers for

the dimensions, followed by the total data in either one large binary blob,

or a sequence of column vectors.

But one minor trouble was that the Intertubes lead to no easy solution to

unpack the numpy format. StackOverflow

had plenty of question around this topic converned with, say, how to

serialize in language-independent way. But no converters. And nobody

local knew how to undo the “pickle” format underlying numpy.

But a remote friend did:

Laurent,

well-known for his Rpy2

package, pointed me towards using the `struct`

module and steered

me towards the solution shown below. So a shameless plug: if you need a very

experienced Python or R consultant for sciece work, consider

his consulting firm.

Finally, to round out this post, let’s show the simple solution we crafted so that the

next guy searching the Intertubes will have an easier. Let us start with a

minimal Python program writing numpy data to disk:

#!/usr/bin/env python # # simple example for creating numpy data to demonstrate converter import numpy as np # simple float array a = np.arange(15).reshape(3,5) * 1.1 outfile = "/tmp/data.npy" np.save(outfile, a)

Next, the simple Python converter to create a binary file containing two

integers for row and column dimension, followed by row times columns of

floats:

#!/usr/bin/python # # read a numpy file, and write a simple binary file containing # two integers 'n' and 'k' for rows and columns # n times k floats with the actual matrix # which can be read by any application or language that can read binary import struct import numpy as np inputfile = "/tmp/data.npy" outputfile = "/tmp/data.bin" # load from the file mat = np.load(inputfile) # create a binary file binfile = file(outputfile, 'wb') # and write out two integers with the row and column dimension header = struct.pack('2I', mat.shape[0], mat.shape[1]) binfile.write(header) # then loop over columns and write each for i in range(mat.shape[1]): data = struct.pack('%id' % mat.shape[0], *mat[:,i]) binfile.write(data) binfile.close()

Lastly, a quick littler script

showing how R can read the data in a handful of lines:

#!/usr/bin/r infile <- "/tmp/data.bin" con <- file(infile, "rb") dim <- readBin(con, "integer", 2) Mat <- matrix( readBin(con, "numeric", prod(dim)), dim[1], dim[2]) close(con) print(Mat)

That did the job—and I already used to converter to read a few weeks worth

of data for further analysis in R. This obviously isn’t the last word on

possible solutions as the additional temporary file can be wasteful (unless

it forms a cache for data read multiple times). If someone has nice

solutions, please don’t hold back and contact me. Thanks again to Laurent

for the winning suggestion concerning `struct`

, and help in

getting the examples shown here to work.

**leave a comment**for the author, please follow the link and comment on their blog:

**Thinking inside the box**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...