In a previous post, I went over the basics of linking up bigmemory and the eigen C++ library via RcppEigen. In this post I’ll take this a bit further by creating a version of the fastLm() function of RcppEigen that can accept bigmemory objects. By doing so, we will create a fast way to fit linear models using data which is too big to fit in RAM. With RcppEigen, fitting linear models using out-of-memory computation doesn’t have to be slow. The code for this is all on github in the bigFastlm package.
Before we even start, most of the work is already done as we’ll just need to change a few lines of the core C++ code of the fastLm() function so that it we can map the bigmemory pointer to data on disk to an eigen matrix object.
The core code of fastLm can be found here. The data object which is being loaded into C++ from R is mapped to an eigen matrix object at line 208 of fastLm.cpp.
We need to change the above code to
The above modification first takes Xs as an Rcpp external pointer object (XPtr) and then checks to make sure it’s a double type (for now I’m ignoring all other data types (int, etc) for simplicity). Now that X is a mapped eigen matix object which points to data on disk, what else do we need to do? Well, not much! We just need to make sure that the correct object types are defined for the R-callable function. To do this, we need to change
infastLm.cpp. // [[Rcpp::export]] had some trouble doing this automatically, so the above is just what that should have created.
So now with the proper R functions to call this, we’re basically done. I had to create a few utility functions to make everything work nicely, but the main work is just the above.
One important detail: for now, we can only use the LLT and LDLT methods for computation, as the other decompositions create objects which scale with the size of the data, so for now I’m ignoring the more robust decompositions like QR. Perhaps someone else can figure out how to perform these in a memory-conscious manner.
Comparison with biglm
Now we’ll run a (perhaps not-so-fair) comparison with the biglm function of biglm. Specifically, we’ll use the biglm.big.matrix function provided by the biganalytics which interfaces bigmemory and biglm. The following code creates a bigmemory object on disk (actually, two because biglm requires the matrix object to contain the response, whereas bigFastlm requires that the response be an R vector. It’s hard to say which is a better design choice, but I’m sticking with the approach which doesn’t allow an R formula expression).
Now let’s see how biglm and bigLm (eigen + bigmemory) stack up. Note that this is an unfair comparison because biglm requires a formula argument and bigLm assumes you’re passing in a design matrix already and biglm uses the QR decomposition, which is slower than LLT or LDLT.
bigLm seems to be quite a bit faster than biglm, but how fast is bigLm compared with fastLm (which requires the data to be loaded into memory)? It turns out it’s pretty close on my computer and I don’t even have anything fancy like a solid state drive.
Future work would be to try to figure out how to make the QR decomposition memory-feasible and also to write a function for generalized linear models.
To leave a comment for the author, please follow the link and comment on their blog: jared huling.