Bagging, the perfect solution for model instability

Posted on May 22, 2017 by insightr in R bloggers | 0 Comments

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos

Motivation

The name bagging comes from boostrap aggregating. It is a machine learning technique proposed by Breiman (1996) to increase stability in potentially unstable estimators. For example, suppose you want to run a regression with a few variables in two steps. First, you run the regression with all the variables in your data and select the significant ones. Second, you run a new regression using only the selected variables and compute the predictions.

This procedure is not wrong if your problem is forecasting. However, this two step estimation may result in highly unstable models. If many variables are important but individually their importance is small, you will probably leave some of them out, and small perturbations on the data may drastically change the results.

Bagging

The Bagging solves instability problem by giving each variable several chances to be in the model. The same procedure described in the previous section is repeated in many bootstrap samples and the final forecast will be the average forecast across all these samples.

In this post I will show an example of Bagging with regression in the selection context described above. But you should keep in mind that this technique can be used for any unstable model. The most successful case is to use Bagging on regression trees, resulting in the famous random forest. The Bagging is also very robust to over fitting and produces forecasts with less variance.

Application

I am going to show an application of Bagging from the scratch using the boot package for the bootstrap replications. We start by generating some data with the following dgp:

$y_i = \sum_{j=1}^K \beta_j x_{i,j} + \varepsilon_i,~~~~~~i=1,\dots,N; ~j=1,\dots,K$
$x_{i,j} = A_j f_i+\epsilon_{i,j}~~~~~~i=1,\dots,N; ~j=1,\dots,K$

The first equation shows how the variable we want to predict, $y_i$ , is generated. The second equation shows that the independent variables are generated by a common factor, which will induce a correlation structure. Since we need instability, the data will be generated to have an $R^2$ around $0.33$ and $K=40$ in order to have many variables with small individual importance.

N = 300 # = Number of obs. = #
K = 40 # = Number of Variables = #
set.seed(123) # = Seed for replication = #
f = rnorm(N) # = Common factor = #
A = runif(K,-1,1) # = Coefficients from the second equation = #
X = t(A%*%t(f)) + matrix(rnorm(N*K), N, K) # = Generate xij = #
beta = c((-1)^(1:(K-10)) ,rep(0, 10)) # = Coefficients from the first equation, the first 30 are equal 1 or -1 and the last 10 are 0 = #

# = R2 setup = #
aux = var(X%*%beta)
erro = rnorm(N, 0, sqrt(2*aux)) # = The variance of the error will be twice the variance of X%*%beta = #

y = X %*% beta+erro # = Generate y = #
1-var(erro)/var(y) # = R2 = #

##           [,1]
## [1,] 0.2925866

# = Break data into in-sample and out-of-sample = #
y.in = y[1:(N-100)]
X.in = X[1:(N-100),]
y.out = y[-c(1:(N-100))]
X.out = X[-c(1:(N-100)),]

Now we must define a function to be called by the boostrap. This function must receive the data and an argument for indexes that will tell R where the sampling must be made. This function will run a linear regression with all variables, select those with t-statistics bigger than $1.96$ , run a new regression with the selected variables and store the coefficients.

bagg = function(data,ind){
  sample = data[ind,]
  y = sample[, 1]
  X = sample[, 2:ncol(data)]
  model1 = lm(y ~ X)
  t.stat = summary(model1)$coefficients[-1, 3]
  selected = which(abs(t.stat)>=1.96)
  model2 = lm(y ~ X[, selected])
  coefficients = coef(model2)
  cons = coefficients[1]
  betas = coefficients[-1]
  aux = rep(0, ncol(X))
  aux[selected] = betas
  res = c(cons, aux)
  return(res)
}

Now we are ready to use the boot function to run the Bagging. The code below does precisely that with $500$ boostrap replications. It also runs the simple OLS procedure for us to compare. If you type selected after running this code you will see that many variables were left out and some variables with $\beta=0$ were selected.

library(boot)
# = Bagging = #
bagging = boot(data = cbind(y.in, X.in), statistic=bagg, R=500)
bagg.coef = bagging$t
y.pred.bagg = rowMeans(cbind(1, X.out)%*%t(bagg.coef))

# = OLS = #
ols0 = lm(y.in ~ X.in)
selected = which(abs(summary(ols0)$coefficients[-1, 3])>=1.96)
ols1 = lm(y.in ~ X.in[, selected])
y.pred = cbind(1, X.out[, selected])%*%coef(ols1)

# = Forecasting RMSE = #
sqrt(mean((y.pred-y.out)^2)) # = OLS = #

## [1] 10.43338

sqrt(mean((y.pred.bagg-y.out)^2)) # = Bagging = #

## [1] 9.725554

The results showed that the Bagging has a RMSE $7\%$ smaller than the OLS. It is also interesting to see what happens to the coefficients. The plot below shows that the bagging performs some shrinkage on the coefficients towards zero.

barplot(coef(ols0)[-1])
barplot(colMeans(bagg.coef)[-1], add=TRUE, col="red")

Finally, what would happen if we had no instability? If you run the same code again changing the $R^2$ to around $99\%$ you will find the plot below. To change the $R^2$ to $99\%$ just replace $2$ by $0.01$ in the code erro = rnorm(N, 0, sqrt(2*aux)) in the dgp. As you can see, the bagging and the simple procedure are the same in the absence of instability.