**jared huling**, and kindly contributed to R-bloggers)

In a previous post, I went over the basics of linking up bigmemory and the eigen C++ library via RcppEigen. In this post I’ll take this a bit further by creating a version of the `fastLm()`

function of RcppEigen that can accept bigmemory objects. By doing so, we will create a fast way to fit linear models using data which is too big to fit in RAM. With RcppEigen, fitting linear models using out-of-memory computation doesn’t have to be slow. The code for this is all on github in the bigFastlm package.

Before we even start, most of the work is already done as we’ll just need to change a few lines of the core C++ code of the `fastLm()`

function so that it we can map the bigmemory pointer to data on disk to an eigen matrix object.

The core code of `fastLm`

can be found here. The data object which is being loaded into C++ from R is mapped to an eigen matrix object at line 208 of `fastLm.cpp`

.

We need to change the above code to

The above modification first takes Xs as an Rcpp external pointer object (`XPtr`

) and then checks to make sure it’s a double type (for now I’m ignoring all other data types (int, etc) for simplicity). Now that X is a mapped eigen matix object which points to data on disk, what else do we need to do? Well, not much! We just need to make sure that the correct object types are defined for the R-callable function. To do this, we need to change

To

in`fastLm.cpp`

. `// [[Rcpp::export]]`

had some trouble doing this automatically, so the above is just what that *should* have created.

So now with the proper R functions to call this, we’re basically done. I had to create a few utility functions to make everything work nicely, but the main work is just the above.

One important detail: for now, we can only use the LLT and LDLT methods for computation, as the other decompositions create objects which scale with the size of the data, so for now I’m ignoring the more robust decompositions like QR. Perhaps someone else can figure out how to perform these in a memory-conscious manner.

### Comparison with biglm

Now we’ll run a (perhaps not-so-fair) comparison with the `biglm`

function of biglm. Specifically, we’ll use the `biglm.big.matrix`

function provided by the biganalytics which interfaces bigmemory and biglm. The following code creates a bigmemory object on disk (actually, two because biglm requires the matrix object to contain the response, whereas bigFastlm requires that the response be an R vector. It’s hard to say which is a better design choice, but I’m sticking with the approach which doesn’t allow an R formula expression).

Now let’s see how `biglm`

and `bigLm`

(eigen + bigmemory) stack up. Note that this is an unfair comparison because `biglm`

requires a formula argument and `bigLm`

assumes you’re passing in a design matrix already and biglm uses the QR decomposition, which is slower than LLT or LDLT.

`bigLm`

seems to be quite a bit faster than `biglm`

, but how fast is `bigLm`

compared with `fastLm`

(which requires the data to be loaded into memory)? It turns out it’s pretty close on my computer and I don’t even have anything fancy like a solid state drive.

Future work would be to try to figure out how to make the QR decomposition memory-feasible and also to write a function for generalized linear models.

**leave a comment**for the author, please follow the link and comment on their blog:

**jared huling**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...