# Speeding up model bootstrapping in GNU R

December 2, 2013
By

(This article was first published on R snippets, and kindly contributed to R-bloggers)

After my last post I have recurringly received two questions: (a) is it worthwhile to analyze GNU R speed in simulations and (b) how would simulation speed compare between GNU R and Python. In this post I want to address the former question and next time I am going to tackle the latter.

An area in which I use simulation in GNU R on a daily basis is bootstrapping. Therefore I have decided to check out how much speedup one can expect with tuning of standard model estimation procedures.

I started with linear regression: a simple model with two explanatory variables and 100 observations. I wanted to compare speed of generation of 10000 bootstrap replications using standard lm function and bare matrix calculus formula for b solving the equation: X'Xb=X'y. Here is the simulation:

set.seed(1)
n <- 100
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + x2 + 1 + rexp(n) # add non-normal error distribution
data.m <- cbind("(Intercept)" = 1, x1, x2, y)
data.f <- data.frame(x1, x2, y)

boot.bare <- function() {
dm <- data.m[sample.int(n, replace = T),]
X <- dm[, 1:3]
tX <- t(X)
y <- dm[, 4]
solve(tX %*% X, tX %*% y)
}

boot.lm <- function() {
lm(y~., data=data.f[sample.int(n, replace = T),])$coef } set.seed(2) system.time(rb <- replicate(10000, boot.bare())[, 1,]) # 0.73 seconds set.seed(2) system.time(rl <- replicate(10000, boot.lm())) # 16.00 seconds # check if both procedures produce the same results range(rb - rl) # OK difference ranges from -5.373479e-14 to 4.751755e-14 Both formulas yield practically identical results but using lm produces over 20 times slower execution. Of course lm does much more work than simple calculation of parameter estimates - boot for bootstrapping I need only the parameters. So next I thought to check out a more complex model. Therefore I switched to logistic regression. I compared glm estimation with two alternatives: brute-force optimization of log-likelihood using nlm and direct computation of estimates using iteratively reweighted least squares (IRLS, the method actually used by glm internally). Here is the code: set.seed(1) n <- 100 x1 <- runif(n) x2 <- runif(n) y <- x1 + x2 + runif(n) < 1.5 # add non-logistic error distribution data.m <- cbind(1, x1, x2, y) data.f <- data.frame(x1, x2, y) boot.nlm <- function() { llik <- function(a) { prob <- 1 / (1 + exp(-X %*% a)) prob[y == F] <- 1 - prob[y == F] -sum(log(prob)) } dm <- data.m[sample.int(n, replace = T),] X <- dm[, 1:3] tX <- t(X) y <- dm[, 4] nlm(llik, c(1, 2, 3))$estimate
}

boot.irls <- function() {
dm <- data.m[sample.int(n, replace = T),]
X <- dm[, 1:3]
tX <- t(X)
y <- dm[, 4]
b <- rep(0, 3)
while(TRUE) {
prob <- 1 / (1 + exp(-X %*% b))
pp <- as.vector(prob * (1 - prob))
pp <- rbind(pp, pp, pp)
grad <- tX %*% (y - prob)
b <- b + solve((tX * pp) %*% X, grad)
return(b)
}
}
}

boot.glm <- function() {
x <- data.f[sample.int(n, replace = T),]
glm(y~., data=x, family = "binomial")\$coef
}

set.seed(2)
system.time(rn <- replicate(10000, boot.nlm()))
# 35.48 seconds
# warnings related to procedure convergence produced

set.seed(2)
system.time(ri <- replicate(10000, boot.irls())[, 1,])
# 5.49 seconds

set.seed(2)
system.time(rg <- replicate(10000, boot.glm()))
# 34.68 seconds

# check if all procedures produce the same results
range(rn - ri) # -0.0002572192  0.0002470589
range(rn - rg) # -0.0002572191  0.0002470588
range(ri - rg)# -4.196647e-07  3.602730e-07

We can see that using nlm is the slowest and produces unstable results and that IRLS and glm give identical results but IRLS is 6 times faster.

What are the conclusions? If you really need it is is possible to substantially reduce computational time even for "core" functions.
However, simplest means are not always guaranteed to be efficient (like nlm in logistic regression) and there is a question if the speedup pays of, as there is a substantial additional development effort needed.