How long could it take to run a regression

April 6, 2016
By

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

This afternoon, while I was discussing with Montserrat (aka @mguillen_estany) we were wondering how long it might take to run a regression model. More specifically, how long it might take if we use a Bayesian approach. My guess was that the time should probably be linear in , the number of observations. But I thought I would be good to check.

Let us generate a big dataset, with one million rows,

> n=1e6
> X=runif(n)
> Y=2+5*X+rnorm(n)
> B=data.frame(X,Y)

Consider as a benchmark the standard linear regression,

> lm_freq = function(n){
+   idx = sample(1:1e6,size=n)
+   reg = lm(Y~X,data=B[idx,])
+   summary(reg)
+ }

Here the regression is a subset of smaller size. We can do the same with a Bayesian approach, using stan,

> stan_lm ="
+ data {
+ int N;
+ vector[N] x;
+ vector[N] y;
+ }
+ parameters {
+ real alpha;
+ real beta;
+ real tau;
+ }
+ transformed parameters {
+ real sigma;
+ sigma <- 1 / sqrt(tau);
+ }
+ model{
+ y ~ normal(alpha + beta * x, sigma);
+ alpha ~ normal(0, 10);
+ beta ~ normal(0, 10);
+ tau ~ gamma(0.001, 0.001);
+ }
+ "

Define then the model

> library(rstan)
> system.time( 
  stanmodel <<- stan_model(model_code = stan_lm))
utilisateur     système      écoulé 
      0.043       0.000       0.043

We want to see how long it might take to run a regression,

> lm_bayes = function(n){
+   idx = sample(1:1e6,size=n)
+   fit = sampling(stanmodel,
+       data = list(N=n,
+                   x=X[idx],
+                   y=Y[idx]),
+       iter = 1000, warmup=200)
+   summary(fit)
+ }

We use the following package to see how long it takes

> library(microbenchmark)
> time_lm = function(n){
+  M = microbenchmark(lm_freq(n),
+      lm_bayes(n),times=50)
+  return(apply( matrix(M$time,nrow=2),1,mean))
+ }

We can now compare the time it took with ten, one hundred, on thousand, and ten thousand observations,

> vN = c(10,100,1000,10000)
> T = Vectorize(time_lm)(vN)

we can then plot it

> plot(vN,T[2,]/1e6,log="xy",col="red",type="b",
+      xlab="Number of Observations",ylab="Time")
> lines(vN,T[1,]/1e6,col="blue",type="b")

It looks like (if we forget about the very small sample) that the time it takes to run a regression is linear, with the two techniques (the frequentist and the Bayesian ones).

And actually, the same story olds for logistic regressions. Consider the following dataset

> n=1e6
> X=runif(n)
> S=-3+2*X+rnorm(n)
> Y=rbinom(n,size=1,prob=exp(S)/(1+exp(S)))
> B=data.frame(X,Y)

The frequentist version of the logistic regression is

> glm_freq = function(n){
+   idx = sample(1:1e6,size=n)
+   reg = glm(Y~X,data=B[idx,],family=binomial)
+   summary(reg)
+ }

and the Bayesian one, using stan,

> stan_glm = "
+ data {
+ int N;
+ vector[N] x;
+ int y[N];
+ }
+ parameters {
+ real alpha;
+ real beta;
+ }
+ model {
+ alpha ~ normal(0, 10);
+ beta ~ normal(0, 10);
+ y ~ bernoulli_logit(alpha + beta * x);
+ }
+ "
> stanmodel = stan_model(model_code = stan_glm) )
> glm_bayes = function(n){
+   idx = sample(1:1e6,size=n)
+   fit = sampling(stanmodel,
+        data = list(N=n,
+        x = X[idx],
+        y = Y[idx]),
+        iter = 1000, warmup=200)
+   summary(fit)
+ }

Again, we can see how long it takes to run those regression models

> time_gl m= function(n){
+   M = microbenchmark(glm_freq(n),
+   glm_bayes(n),times=50)
+   return(apply( matrix(M$time,nrow=2),1,mean))
+ }

 

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)