**Thinking inside the box**, and kindly contributed to R-bloggers)

Linear regression models are a major component of every applied researcher’s

toolbox. Obtaining results more quickly is therefore of central importance,

particularly when many such models have to be fit. Common examples in this

context are Monte Carlo simulation or bootstrapping.

My talks introducing High Performance Computing with R (see e.g.

these

slides from a five-hour workshop at the ISM in Tokyo) frequently feature an example

of how to extend R with dedicated

compiled code for linear regressions. Romain and I also frequently use this

a motivating examples with our

Rcpp package for

seamless R and C++ integration. In fact, the examples directory for Rcpp still

contains an earlier version of a benchmark for `fastLm()`, a faster

alternative for R’s `lm()` and `lm.fit()` functions. We have also

extended this with the

RcppArmadillo

package which brings Conrad Sanderson’s excellent

Armadillo library with templated C++ code

for linear algebra to R, as well as a simple integration to the GNU GSL via our

RcppGSL

package. The Rcpp

section on my blog contains several posts about `fastLm` benchmarks.

Doug Bates has been a key Rcpp contributor, helping particularly with the

initial Armadillo integration. His research, however, also requires highly

performing sparse matrix operations which Armadillo does not yet offer. So

Doug has started to explore the

Eigen project—a free C++ template

math library mainly focused on vectors, matrices, and linear algebra (note

that we will refer to the Eigen, Eigen2 and Eigen3 APIs as just ‘Eigen’ here,

focusing on the latest version, Eigen3). Better

still, Doug went to work and pretty much single-handedly wrote a new package

RcppEigen which

integrates the templated C++ library

Eigen with

R using

Rcpp.

RcppEigen also provides a `fastLm` implementation and benchmark

script. In fact, it contains a full six

different implementations as Doug is keenly interested in rank-revealing

decompositions which can guard against ill-conditioned model matrices. Some more

background information on this is also available in Doug’s article on *Least
Squares Calculations in R* in

R News 4(1).

Doug’s implementation also uses an elegant design. It comprises a base class with common

functionality, and six subclasses which specialize accordingly for these

six different decompositions approaches:

- a column-pivoting QR,
- a standard QR,
- an LL,
- an LDL,
- an SVD, and
- a standard symmetric one.

On my server, the result of running the included benchmark script

`lmBenchmark` is as follows:

lm benchmark for n = 100000 and p = 40: nrep = 20 test relative elapsed user.self sys.self 7 LLt 1.000000 0.918 0.91 0.00 3 LDLt 1.002179 0.920 0.92 0.00 5 SymmEig 3.021786 2.774 2.19 0.57 6 QR 5.136166 4.715 4.24 0.48 2 PivQR 5.303922 4.869 4.27 0.58 8 arma 6.592593 6.052 6.03 0.02 1 lm.fit 9.386710 8.617 7.14 1.45 4 SVD 33.858388 31.082 30.19 0.84 9 GSL 114.972767 105.545 104.79 0.63

From this first set of results, the preferred method may be ‘PivQR’, the

pivoted QR. Strictly-speaking, it is the only one we can compare to

`lm.fit()` which also uses a pivoting scheme. In the case of a degenerated model matrix,

*all* the other methods, including the four fastest approaches, are

susceptible to producing incorrect estimates. Doug plans to

make SVD and SymmEig rank-revealing too.

As for pure speed, the LL and LDL decomposition have almost identical

performance, and are clearly faster than the other approaches. Compared to

`lm.fit()`, which is the best one could do with just R, we see an

**improvement by a factor of eight** which is quite impressive (albeit

not robust to rank-deficient model matrices). Apart from the SVD, all

approaches using Eigen are faster than the one using Armadillo, which itself

is still faster than R’s `lm.fit()`. Doug and I were very surprised

by the poor performance of the GNU GSL (which also uses SVD) via RcppGSL.

Now, Eigen uses its own code for all linear algebra operations, bypassing

BLAS and LAPACK libraries. The results above were achieved with the current

Atlas package in Ubuntu. If we take advantage of the BLAS / LAPACK plug-in

architecture offered on Debian / Ubuntu systems (see the vignette in my

gcbd package for more)

and use Goto BLAS which provide tuning as well as parallelism on

multi-core machines, the results are as follow:

lm benchmark for n = 100000 and p = 40: nrep = 20 test relative elapsed user.self sys.self 3 LDLt 1.000000 0.907 0.90 0.00 7 LLt 1.000000 0.907 0.91 0.00 5 SymmEig 2.981257 2.704 2.14 0.56 6 QR 5.004410 4.539 4.03 0.50 8 arma 5.072767 4.601 15.30 3.05 2 PivQR 5.307607 4.814 4.27 0.55 1 lm.fit 8.302095 7.530 9.55 12.25 4 SVD 33.015436 29.945 29.06 0.85 9 GSL 195.413451 177.240 244.64 319.89

We see that the BLAS-using Armadillo approach improves a little and moves

just slightly ahead of the pivoted QR. On the other hand, `lm.fit()`, which also

uses a pivoting scheme and hence only level 1 BLAS operations, changes less.

GSL performs even worse (and it is unclear why).

Doug’s post announcing RcppEigen on the Eigen list has a few more sets of results.

This post has illustrated some of the performance gains that can be obtained

from using Eigen via RcppEigen. When not

using rank-revealing methods, computing time can be reduced by up to eight

times relative to `lm.fit()`. Rank-revealing method can still improve

by almost a factor of two. The main disadvantage of Eigen may be one of the

reasons behind its impressive performance: its heavily templated code does

not use BLAS, and the resulting object code (as e.g. in RcppEigen) becomes

enormous. As one illustration, the shared library for RcppEigen on my Ubuntu

64-bit system has a size of 24.6 mb whereas RcppArmadillo comes in at a mere

0.78 mb.

The performance of Eigen is certainly intriguiging, and its API is rather

complete. It seems safe to say that we may see more R projects going to make

use of Eigen thanks to the RcppEigen wrapper.

**leave a comment**for the author, please follow the link and comment on their blog:

**Thinking inside the box**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...