A Speed Comparison Between Flexible Linear Regression Alternatives in R

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Everybody loves speed comparisons! Is R faster than Python? Is dplyr faster than data.table? Is STAN faster than JAGS? It has been said that speed comparisons are utterly meaningless, and in general I agree, especially when you are comparing apples and oranges which is what I’m going to do here. I’m going to compare a couple of alternatives to lm(), that can be used to run linear regressions in R, but that are more general than lm(). One reason for doing this was to see how much performance you’d loose if you would use one of these tools to run a linear regression (even if you could have used lm()). But as speed comparisons are utterly meaningless, my main reason for blogging about this is just to highlight a couple of tools you can use when you grown out of lm(). The speed comparison was just to lure you in. Let’s run!

The Contenders

Below are the seven different methods that I’m going to compare by using each method to run the same linear regression. If you are just interested in the speed comparisons, just scroll to the bottom of the post. And if you are actually interested in running standard linear regressions as fast as possible in R, then Dirk Eddelbuettel has a nice post that covers just that.

lm()

This is the baseline, the “default” method for running linear regressions in R. If we have a data.frame d with the following layout:

head(d)
## y x1 x2 ## 1 -64.579 -1.8088 -1.9685 ## 2 -19.907 -1.3988 -0.2482 ## 3 -4.971 0.8366 -0.5930 ## 4 19.425 1.3621 0.4180 ## 5 -1.124 -0.7355 0.4770 ## 6 -12.123 -0.9050 -0.1259

Then this would run a linear regression with y as the outcome variable and x1 and x2 as the predictors:

lm(y ~ 1 + x1 + x2, data=d)
## ## Call: ## lm(formula = y ~ 1 + x1 + x2, data = d) ## ## Coefficients: ## (Intercept) x1 x2 ## -0.293 10.364 21.225

glm()

This is a generalization of lm() that allows you to assume a number of different distributions for the outcome variable, not just the normal distribution as you are stuck with when using lm(). However, if you don’t specify any distribution glm() will default to using a normal distribution and will produce output identical to lm():

glm(y ~ 1 + x1 + x2, data=d)
## ## Call: glm(formula = y ~ 1 + x1 + x2, data = d) ## ## Coefficients: ## (Intercept) x1 x2 ## -0.293 10.364 21.225 ## ## Degrees of Freedom: 29 Total (i.e. Null); 27 Residual ## Null Deviance: 13200 ## Residual Deviance: 241 AIC: 156

bayesglm()

Found in the arm package, this is a modification of glm that allows you to assume custom prior distributions over the coefficients (instead of the implicit flat priors of glm()). This can be super useful, for example, when you have to deal with perfect separation in logistic regression or when you want to include prior information in the analysis. While there is bayes in the function name, note that bayesglm() does not give you the whole posterior distribution, only point estimates. This is how to run a linear regression with flat priors, which should give similar results as when using lm():

library(arm)
bayesglm(y ~ 1 + x1 + x2, data = d, prior.scale=Inf, prior.df=Inf)
## ## Call: bayesglm(formula = y ~ 1 + x1 + x2, data = d, prior.scale = Inf, ## prior.df = Inf) ## ## Coefficients: ## (Intercept) x1 x2 ## -0.293 10.364 21.225 ## ## Degrees of Freedom: 29 Total (i.e. Null); 30 Residual ## Null Deviance: 13200 ## Residual Deviance: 241 AIC: 156

nls()

While lm() can only fit linear models, nls() can also be used to fit non-linear models by least squares. For example, you could fit a sine curve to a data set with the following call: nls(y ~ par1 + par2 * sin(par3 + par4 * x )). Notice here that the syntax is a little bit different from lm() as you have to write out both the variables and the parameters. Here is how to run the linear regression:

nls(y ~ intercept + x1 * beta1 + x2 * beta2, data = d)
## Nonlinear regression model ## model: y ~ intercept + x1 * beta1 + x2 * beta2 ## data: d ## intercept beta1 beta2 ## -0.293 10.364 21.225 ## residual sum-of-squares: 241 ## ## Number of iterations to convergence: 1 ## Achieved convergence tolerance: 3.05e-08

mle2()

In the bblme package we find mle2(), a function for general maximum likelihood estimation. While mle2() can be used to maximize a handcrafted likelihood function, it also has a formula interface which is simple to use, but powerful, and that plays nice with R’s built in distributions. Here is how to roll a linear regression:

library(bbmle)
inits <- list(log_sigma = rnorm(1), intercept = rnorm(1),
              beta1 = rnorm(1), beta2 = rnorm(1))
mle2(y ~ dnorm(mean = intercept + x1 * beta1 + x2 * beta2, sd = exp(log_sigma)),
     start = inits, data = d)
## ## Call: ## mle2(minuslogl = y ~ dnorm(mean = intercept + x1 * beta1 + x2 * ## beta2, sd = exp(log_sigma)), start = inits, data = d) ## ## Coefficients: ## log_sigma intercept beta1 beta2 ## 1.0414 -0.2928 10.3641 21.2248 ## ## Log-likelihood: -73.81

Note, that we need to explicitly initialize the parameters before the maximization and that we now also need a parameter for the standard deviation. For an even more versatile use of the formula interface for building statistical models, check out the very cool rethinking package by Richard McElreath.

optim()

Of course, if we want to be really versatile, we can craft our own log-likelihood function to maximized using optim(), also part of base R. This gives us all the options, but there are also more things that can go wrong: We might make mistakes in the model specification and if the search for the optimal parameters is not initialized well the model might not converge at all! A linear regression log-likelihood could look like this:

log_like_fn <- function(par, d) {
  sigma <- exp(par[1])
  intercept <- par[2]
  beta1 <- par[3]
  beta2 <- par[4]
  mu <- intercept + d$x1 * beta1 + d$x2 * beta2
  sum(dnorm(d$y, mu, sigma, log=TRUE))
}

inits <- rnorm(4)
optim(par = inits, fn = log_like_fn, control = list(fnscale = -1), d = d)
## $par ## [1] 1.0399 -0.2964 10.3637 21.2139 ## ## $value ## [1] -73.81 ## ## $counts ## function gradient ## 431 NA ## ## $convergence ## [1] 0 ## ## $message ## NULL

As the convergence returned 0 it hopeful worked fine (a 1 indicates non-convergence). The control = list(fnscale = -1) argument is just there to make optim() do maximum likelihood estimation rather than minimum likelihood estimation (which must surely be the worst estimation method ever).

Stan’s optimizing()

Stan is a stand alone program that plays well with R, and that allows you to specify a model in Stan’s language which will compile down to very efficient C++ code. Stan was originally built for doing Hamiltonian Monte Carlo, but now also includes an optimizing() function that, like R’s optim(), allows you to do maximum likelihood estimation (or maximum a posteriori estimation, if you explicitly included priors in the model definition). Here we need to do a fair bit of work before we can fit a linear regression but what we gain is extreme flexibility in extending this model, would we need to. We have come a long way from lm

library(rstan)
## Loading required package: inline ## ## Attaching package: ‘inline’ ## ## The following object is masked from ‘package:Rcpp’: ## ## registerPlugin ## ## rstan (Version 2.6.0, packaged: 2015-02-06 21:02:34 UTC, GitRev: 198082f07a60) ## ## Attaching package: ‘rstan’ ## ## The following object is masked from ‘package:arm’: ## ## traceplot model_string <- " data { int n; vector[n] y; vector[n] x1; vector[n] x2; } parameters { real intercept; real beta1; real beta2; real sigma; } model { vector[n] mu; mu <- intercept + x1 * beta1 + x2 * beta2; y ~ normal(mu, sigma); } " data_list <- list(n = nrow(d), y = d$y, x1 = d$x1, x2 = d$x2) model <- stan_model(model_code = model_string) fit <- optimizing(model, data_list) fit ## $par ## intercept beta1 beta2 sigma ## -0.2929 10.3642 21.2248 2.8331 ## ## $value ## [1] -46.24

An Utterly Meaningless Speed Comparison

So, just for fun, here is the speed comparison, first for running a linear regression with 1000 data points and 5 predictors:

plot of chunk unnamed-chunk-12

This should be taken with a huge heap of salt (which is not too good for your health!). While all these methods produce a result equivalent to a linear regression they do it in different ways, and not necessary in equally good ways, for example, my homemade optim routine is not converging correctly when trying to fit a model with too many predictors. As I have used the standard settings there is surely a multitude of ways in which any of these methods can be made faster. Anyway, here is what happens if we vary the number of predictors and the number of data points:

plot of chunk unnamed-chunk-13

To make these speed comparisons I used the microbenchmark package, the full script replicating the plots above can be found here. This speed comparison was made on my laptop running R version 3.1.2, on 32 bit Ubuntu 12.04, with an average amount of RAM and a processor that is starting to get a bit tired.

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)