# Speeding up model bootstrapping in GNU R

[This article was first published on

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After my last post I have recurringly received two questions: (a) is it worthwhile to analyze GNU R speed in simulations and (b) how would simulation speed compare between GNU R and Python. In this post I want to address the former question and next time I am going to tackle the latter.**R snippets**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

An area in which I use simulation in GNU R on a daily basis is bootstrapping. Therefore I have decided to check out how much speedup one can expect with tuning of standard model estimation procedures.

I started with linear regression: a simple model with two explanatory variables and 100 observations. I wanted to compare speed of generation of 10000 bootstrap replications using standard lm function and bare matrix calculus formula for

**b**solving the equation: X’X

**b**=X’y. Here is the simulation:

set.seed

**(**1**)**n

**<-**100x1

**<-**runif**(**n**)**x2

**<-**runif**(**n**)**y

**<-**x1**+**x2**+**1**+**rexp**(**n**)**# add non-normal error distributiondata.m

**<-**cbind**(**“(Intercept)”**=**1, x1, x2, y**)**data.f

**<-**data.frame**(**x1, x2, y**)**boot.bare

**<-****function****()****{** dm

**<-**data.m**[**sample.int**(**n, replace**=**T**)**,**]** X

**<-**dm**[**, 1**:**3**]** tX

**<-**t**(**X**)** y

**<-**dm**[**, 4**]** solve

**(**tX %*% X, tX %*% y**)****}**

boot.lm

**<-****function****()****{** lm

**(**y**~**., data**=**data.f**[**sample.int**(**n, replace**=**T**)**,**])$**coef**}**

set.seed

**(**2**)**system.time

**(**rb**<-**replicate**(**10000, boot.bare**())[**, 1,**])**# 0.73 seconds

set.seed

**(**2**)**system.time

**(**rl**<-**replicate**(**10000, boot.lm**()))**# 16.00 seconds

# check if both procedures produce the same results

range

# OK difference ranges from -5.373479e-14 to 4.751755e-14**(**rb**–**rl**)**Both formulas yield practically identical results but using lm produces over 20 times slower execution. Of course lm does much more work than simple calculation of parameter estimates – boot for bootstrapping I need only the parameters.

So next I thought to check out a more complex model. Therefore I switched to logistic regression.

I compared glm estimation with two alternatives: brute-force optimization of log-likelihood using nlm and direct computation of estimates using iteratively reweighted least squares (IRLS, the method actually used by glm internally). Here is the code:

set.seed

**(**1**)**n

**<-**100x1

**<-**runif**(**n**)**x2

**<-**runif**(**n**)**y

**<-**x1**+**x2**+**runif**(**n**)****<**1.5 # add non-logistic error distributiondata.m

**<-**cbind**(**1, x1, x2, y**)**data.f

**<-**data.frame**(**x1, x2, y**)**boot.nlm

**<-****function****()****{** llik

**<-****function****(**a**)****{** prob

**<-**1**/****(**1**+**exp**(-**X %*% a**))** prob

**[**y**==**F**]****<-**1**–**prob**[**y**==**F**]****–**sum

**(**log

**(**prob

**))**

**}**

dm

**<-**data.m**[**sample.int**(**n, replace**=**T**)**,**]** X

**<-**dm**[**, 1**:**3**]** tX

**<-**t**(**X**)** y

**<-**dm**[**, 4**]** nlm

**(**llik, c**(**1, 2, 3**))$**estimate**}**

boot.irls

**<-****function****()****{** dm

**<-**data.m**[**sample.int**(**n, replace**=**T**)**,**]** X

**<-**dm**[**, 1**:**3**]** tX

**<-**t**(**X**)** y

**<-**dm**[**, 4**]** b

**<-**rep**(**0, 3**)****while**

**(**

**TRUE**

**)**

**{**

prob

**<-**1**/****(**1**+**exp**(-**X %*% b**))** pp

**<-**as.vector**(**prob*******(**1**–**prob**))** pp

**<-**rbind**(**pp, pp, pp**)** grad

**<-**tX %*%**(**y**–**prob**)** b

**<-**b**+**solve**((**tX*****pp**)**%*% X, grad**)****if**

**(**sum

**(**abs

**(**grad

**))**

**<**1e

**–**6

**)**

**{**

return

**(**b**)****}**

**}**

**}**

boot.glm

**<-****function****()****{** x

**<-**data.f**[**sample.int**(**n, replace**=**T**)**,**]** glm

**(**y**~**., data**=**x, family**=**“binomial”**)$**coef**}**

set.seed

**(**2**)**system.time

**(**rn**<-**replicate**(**10000, boot.nlm**()))**# 35.48 seconds

# warnings related to procedure convergence produced

set.seed

**(**2**)**system.time

**(**ri**<-**replicate**(**10000, boot.irls**())[**, 1,**])**# 5.49 seconds

set.seed

**(**2**)**system.time

**(**rg**<-**replicate**(**10000, boot.glm**()))**# 34.68 seconds

# check if all procedures produce the same results

range

**(**rn**–**ri**)**# -0.0002572192 0.0002470589range

range**(**rn**–**rg**)**# -0.0002572191 0.0002470588**(**ri

**–**rg

**)**# -4.196647e-07 3.602730e-07

We can see that using nlm is the slowest and produces unstable results and that IRLS and glm give identical results but IRLS is 6 times faster.

What are the conclusions? If you really need it is is possible to substantially reduce computational time even for “core” functions.

However, simplest means are not always guaranteed to be efficient (like nlm in logistic regression) and there is a question if the speedup pays of, as there is a substantial additional development effort needed.

To

**leave a comment**for the author, please follow the link and comment on their blog:**R snippets**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.