Hitting the Big Data Ceiling in R

May 16, 2010
By

(This article was first published on Byte Mining » R, and kindly contributed to R-bloggers)

As a true R fan, I like to believe that R can do anything, no matter how big, how small or how complicated: there is some way to do it in R. I decided to approach my large, sparse matrix problem with this attitude. But here I sit a broken man.

There is no “native” big data support built into R, even if using the 64bit build of R. Before venturing on this endeavor, I consulted with my advisor who reassured me that R uses the state of the art for sparse matrices. That was enough for me.

My Problem

For part of my Masters thesis, I wrote code to extract all of the friends and followers out to network degree 2 to construct a “small-world” snapshot of a user via their relationships. In a graph, nodes and edges grow exponentially as the degree increases. The number of nodes was on the order of 300,000. The number of edges I predict will be around 900,000. The code is still running. This means that a dense matrix would have size $300000^2 \approx 9 \times 10^{10}$. Some of you already know how this story is going to end…

The matrix is very sparse.

$\frac{900000}{9 \times 10^{10}} \approx 10^{-6}$

Very sparse.

The raw data graph.log consists of an edgelist with Twitter usernames separated by a comma. Another file names.row contains the node order from NetworkX.

R has some packages for working with large datasets. Let’s take a look.

Bigmemory

Bigmemory is a high-level interface for creating and manipulating very large C++ matrices in shared memory or on disk. Storing matrices in shared memory allows several instances of R to work on the same dataset, particularly useful in a multicore system or an analytics server that needs to have several instances of R operating on the same data.

Even better, Bigmemory supports filebacked matrices. A filebacked matrix resides on disk and also comes with a descriptor file that essentially tells R where the data is. Once a big.matrix is constructed, R simply retains a pointer to the matrix on disk. Wonderful! So how did I try to use it?

the.names < - read.table("names.row")[,1] dims <- length(the.names) graph <- read.table("graph.log",sep=",",header=FALSE) graph[,1] <- levels(graph[,1])[graph[,1]] graph[,2] <- levels(graph[,2])[graph[,2]] indices <- cbind(match(graph[,1], the.names), match(graph[,2], the.names))   library(bigmemory) X <- filebacked.big.matrix(dims,dims, type="char", init=0, separated=FALSE, backingfile="incidence_matrix.bin", descriptor="incidence_matrix.desc") for (row in 1:nrow(indices)) { X[indices[row,1], indices[row,2]] <- '1' }

The good news is that it worked. It would be nice if the constructor though had some type of progress indicator because the filesystem does not report the filesize of the matrix while it is being written on disk. As a workaround, I used separated=TRUE so that each column was written to its own file and I could see which files have been created. This worked, but it caused the filesystem to choke due to the number of files created. Trying to do anything yielded the following nightmare:

Argument list too long.

So when I had to delete this mess of files, I had to use some shell magic.

find . -type f -name 'incidence_matrix.bin_column_*' -exec rm {} \;

Now that I knew how long it took to write all 300,000 columns/files to disk (about 25 mins), I knew it would take about the same amount of time with separated=FALSE.

The bad news? The file was 117GB! Of course, this is because Bigmemory apparently stores matrices as dense matrices. There does not seem to be a sparse option with Bigmemory.

ff

Then there’s ff. What does it stand for? Fun files? Fancy files? Flat files? ff and its sister package bit have very fine-grained controls for matrices on disk, but unfortunately, R was not up to my challenge.

library(ff) ffMatrix < - ff(vmode="logical", dim=c(300000,300000))

yielded

Error in if (length < 0 || length > .Machine$integer.max) stop("length must be between 1 and .Machine$integer.max") :
missing value where TRUE/FALSE needed
In ff(vmode = "logical", dim = c(3e+05, 3e+05)) :
NAs introduced by coercion

So what did I do wrong? Well, nothing. What did the package do wrong? There must be something wrong to yield such a type error, right? Again, nothing is wrong. This is an R problem. I have a machine with 64bit CPUs, a 64bit capable OS (Snow Leopard) running in 64bit mode (hold keys 6 4 at bootup), running 64bit R, 64bit ff and all prerequisite libraries built for 64bit. So, what gives?

The missing piece of the puzzle is that although R is 64bit capable, the ramifications mostly involve memory and the amount of memory that can be addressed. The kicker is that R does not have an int64, or long long type. R will not recognize integers that are larger than .Machine\$max.integer for anything. Thanks to Hadley Wickham for pointing this out.

sparseMatrix() in Matrix

The Matrix package contains a handy function sparseMatrix that takes in a vector of row coordinates, a vector of column coordinates and a vector of values for location (row, col), and some other arguments. Constructing the matrix was fast, but there was a major problem: there is no way to modify the sparsity structure of the matrix. The best I could do for the full matrix (much bigger than the one used for bigmemory and ff) was to create sparse matrices based on one million elements at a time, and accumulate by adding the individual matrices together. This worked fine for a while, but the system started to drastically slow down around 65 million entries. Thanks to David Smith for suggesting sparseMatrix as it looks powerful for smaller problems.

So, What to Do?

The common response is, "just read in what you need." That's fine, but not all matrix factorizations can be treated this way, and all of these algorithms would need to be implemented by the user as nothing convenient seems to exist. If you know of any piecemeal filewise packages, comment!

If what I just mentioned is not an option, I suggest using another language for the time being. Python's NumPy and SciPy modules are awesome with huge, sparse matrices, but still some issues exist when using a matrix of my size requirement. Clojure has become another good option due to its pure functional credentials. Right now, I just sound like a Clojure fanboy because I really have no explanation as to why Clojure may be better than R and Python in this regard, but it seems to be better for parallel operations necessary for large data.

The conclusion? R is great, but really falls flat with large and sparse matrices. If SciPy can do it, R will do it...eventually.