A Speed Comparison Between Flexible Linear Regression Alternatives in R

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Everybody loves speed comparisons! Is R faster than Python? Is dplyr faster than data.table? Is STAN faster than JAGS? It has been said that speed comparisons are utterly meaningless, and in general I agree, especially when you are comparing apples and oranges which is what I’m going to do here. I’m going to compare a couple of alternatives to lm(), that can be used to run linear regressions in R, but that are more general than lm(). One reason for doing this was to see how much performance you’d loose if you would use one of these tools to run a linear regression (even if you could have used lm()). But as speed comparisons are utterly meaningless, my main reason for blogging about this is just to highlight a couple of tools you can use when you grown out of lm(). The speed comparison was just to lure you in. Let’s run!

The Contenders

Below are the seven different methods that I’m going to compare by using each method to run the same linear regression. If you are just interested in the speed comparisons, just scroll to the bottom of the post. And if you are actually interested in running standard linear regressions as fast as possible in R, then Dirk Eddelbuettel has a nice post that covers just that.

lm()

This is the baseline, the “default” method for running linear regressions in R. If we have a data.frame d with the following layout:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">head(d)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">##         y      x1      x2</span>
<span style="color: #408080; font-style: italic">## 1 -64.579 -1.8088 -1.9685</span>
<span style="color: #408080; font-style: italic">## 2 -19.907 -1.3988 -0.2482</span>
<span style="color: #408080; font-style: italic">## 3  -4.971  0.8366 -0.5930</span>
<span style="color: #408080; font-style: italic">## 4  19.425  1.3621  0.4180</span>
<span style="color: #408080; font-style: italic">## 5  -1.124 -0.7355  0.4770</span>
<span style="color: #408080; font-style: italic">## 6 -12.123 -0.9050 -0.1259</span>
</pre></div>

Then this would run a linear regression with y as the outcome variable and x1 and x2 as the predictors:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">lm(y <span style="color: #666666">~</span> <span style="color: #666666">1</span> <span style="color: #666666">+</span> x1 <span style="color: #666666">+</span> x2, data<span style="color: #666666">=</span>d)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Call:</span>
<span style="color: #408080; font-style: italic">## lm(formula = y ~ 1 + x1 + x2, data = d)</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Coefficients:</span>
<span style="color: #408080; font-style: italic">## (Intercept)           x1           x2  </span>
<span style="color: #408080; font-style: italic">##      -0.293       10.364       21.225</span>
</pre></div>

glm()

This is a generalization of lm() that allows you to assume a number of different distributions for the outcome variable, not just the normal distribution as you are stuck with when using lm(). However, if you don’t specify any distribution glm() will default to using a normal distribution and will produce output identical to lm():

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">glm(y <span style="color: #666666">~</span> <span style="color: #666666">1</span> <span style="color: #666666">+</span> x1 <span style="color: #666666">+</span> x2, data<span style="color: #666666">=</span>d)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Call:  glm(formula = y ~ 1 + x1 + x2, data = d)</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Coefficients:</span>
<span style="color: #408080; font-style: italic">## (Intercept)           x1           x2  </span>
<span style="color: #408080; font-style: italic">##      -0.293       10.364       21.225  </span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Degrees of Freedom: 29 Total (i.e. Null);  27 Residual</span>
<span style="color: #408080; font-style: italic">## Null Deviance:	    13200 </span>
<span style="color: #408080; font-style: italic">## Residual Deviance: 241 	AIC: 156</span>
</pre></div>

bayesglm()

Found in the arm package, this is a modification of glm that allows you to assume custom prior distributions over the coefficients (instead of the implicit flat priors of glm()). This can be super useful, for example, when you have to deal with perfect separation in logistic regression or when you want to include prior information in the analysis. While there is bayes in the function name, note that bayesglm() does not give you the whole posterior distribution, only point estimates. This is how to run a linear regression with flat priors, which should give similar results as when using lm():

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">library(arm)
bayesglm(y <span style="color: #666666">~</span> <span style="color: #666666">1</span> <span style="color: #666666">+</span> x1 <span style="color: #666666">+</span> x2, data <span style="color: #666666">=</span> d, prior.scale<span style="color: #666666">=</span><span style="color: #008000; font-weight: bold">Inf</span>, prior.df<span style="color: #666666">=</span><span style="color: #008000; font-weight: bold">Inf</span>)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Call:  bayesglm(formula = y ~ 1 + x1 + x2, data = d, prior.scale = Inf, </span>
<span style="color: #408080; font-style: italic">##     prior.df = Inf)</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Coefficients:</span>
<span style="color: #408080; font-style: italic">## (Intercept)           x1           x2  </span>
<span style="color: #408080; font-style: italic">##      -0.293       10.364       21.225  </span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Degrees of Freedom: 29 Total (i.e. Null);  30 Residual</span>
<span style="color: #408080; font-style: italic">## Null Deviance:	    13200 </span>
<span style="color: #408080; font-style: italic">## Residual Deviance: 241 	AIC: 156</span>
</pre></div>

nls()

While lm() can only fit linear models, nls() can also be used to fit non-linear models by least squares. For example, you could fit a sine curve to a data set with the following call: nls(y ~ par1 + par2 * sin(par3 + par4 * x )). Notice here that the syntax is a little bit different from lm() as you have to write out both the variables and the parameters. Here is how to run the linear regression:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">nls(y <span style="color: #666666">~</span> intercept <span style="color: #666666">+</span> x1 <span style="color: #666666">*</span> beta1 <span style="color: #666666">+</span> x2 <span style="color: #666666">*</span> beta2, data <span style="color: #666666">=</span> d)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## Nonlinear regression model</span>
<span style="color: #408080; font-style: italic">##   model: y ~ intercept + x1 * beta1 + x2 * beta2</span>
<span style="color: #408080; font-style: italic">##    data: d</span>
<span style="color: #408080; font-style: italic">## intercept     beta1     beta2 </span>
<span style="color: #408080; font-style: italic">##    -0.293    10.364    21.225 </span>
<span style="color: #408080; font-style: italic">##  residual sum-of-squares: 241</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Number of iterations to convergence: 1 </span>
<span style="color: #408080; font-style: italic">## Achieved convergence tolerance: 3.05e-08</span>
</pre></div>

mle2()

In the bblme package we find mle2(), a function for general maximum likelihood estimation. While mle2() can be used to maximize a handcrafted likelihood function, it also has a formula interface which is simple to use, but powerful, and that plays nice with R’s built in distributions. Here is how to roll a linear regression:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">library(bbmle)
inits <span style="color: #666666"><-</span> list(log_sigma <span style="color: #666666">=</span> rnorm(<span style="color: #666666">1</span>), intercept <span style="color: #666666">=</span> rnorm(<span style="color: #666666">1</span>),
              beta1 <span style="color: #666666">=</span> rnorm(<span style="color: #666666">1</span>), beta2 <span style="color: #666666">=</span> rnorm(<span style="color: #666666">1</span>))
mle2(y <span style="color: #666666">~</span> dnorm(mean <span style="color: #666666">=</span> intercept <span style="color: #666666">+</span> x1 <span style="color: #666666">*</span> beta1 <span style="color: #666666">+</span> x2 <span style="color: #666666">*</span> beta2, sd <span style="color: #666666">=</span> exp(log_sigma)),
     start <span style="color: #666666">=</span> inits, data <span style="color: #666666">=</span> d)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Call:</span>
<span style="color: #408080; font-style: italic">## mle2(minuslogl = y ~ dnorm(mean = intercept + x1 * beta1 + x2 * </span>
<span style="color: #408080; font-style: italic">##     beta2, sd = exp(log_sigma)), start = inits, data = d)</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Coefficients:</span>
<span style="color: #408080; font-style: italic">## log_sigma intercept     beta1     beta2 </span>
<span style="color: #408080; font-style: italic">##    1.0414   -0.2928   10.3641   21.2248 </span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Log-likelihood: -73.81</span>
</pre></div>

Note, that we need to explicitly initialize the parameters before the maximization and that we now also need a parameter for the standard deviation. For an even more versatile use of the formula interface for building statistical models, check out the very cool rethinking package by Richard McElreath.

optim()

Of course, if we want to be really versatile, we can craft our own log-likelihood function to maximized using optim(), also part of base R. This gives us all the options, but there are also more things that can go wrong: We might make mistakes in the model specification and if the search for the optimal parameters is not initialized well the model might not converge at all! A linear regression log-likelihood could look like this:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">log_like_fn <span style="color: #666666"><-</span> <span style="color: #008000; font-weight: bold">function</span>(par, d) {
  sigma <span style="color: #666666"><-</span> exp(par[<span style="color: #666666">1</span>])
  intercept <span style="color: #666666"><-</span> par[<span style="color: #666666">2</span>]
  beta1 <span style="color: #666666"><-</span> par[<span style="color: #666666">3</span>]
  beta2 <span style="color: #666666"><-</span> par[<span style="color: #666666">4</span>]
  mu <span style="color: #666666"><-</span> intercept <span style="color: #666666">+</span> d<span style="color: #666666">$</span>x1 <span style="color: #666666">*</span> beta1 <span style="color: #666666">+</span> d<span style="color: #666666">$</span>x2 <span style="color: #666666">*</span> beta2
  sum(dnorm(d<span style="color: #666666">$</span>y, mu, sigma, log<span style="color: #666666">=</span><span style="color: #008000; font-weight: bold">TRUE</span>))
}

inits <span style="color: #666666"><-</span> rnorm(<span style="color: #666666">4</span>)
optim(par <span style="color: #666666">=</span> inits, fn <span style="color: #666666">=</span> log_like_fn, control <span style="color: #666666">=</span> list(fnscale <span style="color: #666666">=</span> <span style="color: #666666">-1</span>), d <span style="color: #666666">=</span> d)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## $par</span>
<span style="color: #408080; font-style: italic">## [1]  1.0399 -0.2964 10.3637 21.2139</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## $value</span>
<span style="color: #408080; font-style: italic">## [1] -73.81</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## $counts</span>
<span style="color: #408080; font-style: italic">## function gradient </span>
<span style="color: #408080; font-style: italic">##      431       NA </span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## $convergence</span>
<span style="color: #408080; font-style: italic">## [1] 0</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## $message</span>
<span style="color: #408080; font-style: italic">## NULL</span>
</pre></div>

As the convergence returned 0 it hopeful worked fine (a 1 indicates non-convergence). The control = list(fnscale = -1) argument is just there to make optim() do maximum likelihood estimation rather than minimum likelihood estimation (which must surely be the worst estimation method ever).

Stan’s optimizing()

Stan is a stand alone program that plays well with R, and that allows you to specify a model in Stan’s language which will compile down to very efficient C++ code. Stan was originally built for doing Hamiltonian Monte Carlo, but now also includes an optimizing() function that, like R’s optim(), allows you to do maximum likelihood estimation (or maximum a posteriori estimation, if you explicitly included priors in the model definition). Here we need to do a fair bit of work before we can fit a linear regression but what we gain is extreme flexibility in extending this model, would we need to. We have come a long way from lm

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">library(rstan)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## Loading required package: inline</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Attaching package: 'inline'</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## The following object is masked from 'package:Rcpp':</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">##     registerPlugin</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## rstan (Version 2.6.0, packaged: 2015-02-06 21:02:34 UTC, GitRev: 198082f07a60)</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Attaching package: 'rstan'</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## The following object is masked from 'package:arm':</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">##     traceplot</span>
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">model_string <span style="color: #666666"><-</span> <span style="color: #BA2121">"</span>
<span style="color: #BA2121">data {</span>
<span style="color: #BA2121">  int n;</span>
<span style="color: #BA2121">  vector[n] y;</span>
<span style="color: #BA2121">  vector[n] x1;</span>
<span style="color: #BA2121">  vector[n] x2;</span>
<span style="color: #BA2121">}</span>

<span style="color: #BA2121">parameters {</span>
<span style="color: #BA2121">  real intercept;</span>
<span style="color: #BA2121">  real beta1;</span>
<span style="color: #BA2121">  real beta2;</span>
<span style="color: #BA2121">  real<lower=0> sigma;</span>
<span style="color: #BA2121">}</span>

<span style="color: #BA2121">model {</span>
<span style="color: #BA2121">  vector[n] mu;</span>
<span style="color: #BA2121">  mu <- intercept + x1 * beta1 + x2 * beta2;</span>
<span style="color: #BA2121">  y ~ normal(mu, sigma);</span>
<span style="color: #BA2121">}</span>
<span style="color: #BA2121">"</span>

data_list <span style="color: #666666"><-</span> list(n <span style="color: #666666">=</span> nrow(d), y <span style="color: #666666">=</span> d<span style="color: #666666">$</span>y, x1 <span style="color: #666666">=</span> d<span style="color: #666666">$</span>x1, x2 <span style="color: #666666">=</span> d<span style="color: #666666">$</span>x2)
model <span style="color: #666666"><-</span> stan_model(model_code <span style="color: #666666">=</span> model_string)
fit <span style="color: #666666"><-</span> optimizing(model, data_list)
fit
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## $par</span>
<span style="color: #408080; font-style: italic">## intercept     beta1     beta2     sigma </span>
<span style="color: #408080; font-style: italic">##   -0.2929   10.3642   21.2248    2.8331 </span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## $value</span>
<span style="color: #408080; font-style: italic">## [1] -46.24</span>
</pre></div>

An Utterly Meaningless Speed Comparison

So, just for fun, here is the speed comparison, first for running a linear regression with 1000 data points and 5 predictors:

plot of chunk unnamed-chunk-12

This should be taken with a huge heap of salt (which is not too good for your health!). While all these methods produce a result equivalent to a linear regression they do it in different ways, and not necessary in equally good ways, for example, my homemade optim routine is not converging correctly when trying to fit a model with too many predictors. As I have used the standard settings there is surely a multitude of ways in which any of these methods can be made faster. Anyway, here is what happens if we vary the number of predictors and the number of data points:

plot of chunk unnamed-chunk-13

To make these speed comparisons I used the microbenchmark package, the full script replicating the plots above can be found here. This speed comparison was made on my laptop running R version 3.1.2, on 32 bit Ubuntu 12.04, with an average amount of RAM and a processor that is starting to get a bit tired.

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)