# Complete Subset Regressions, simple and powerful

May 31, 2017
By

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

The complete subset regressions (CSR) is a forecasting method proposed by Elliott, Gargano and Timmermann in 2013. It is as very simple but powerful technique. Suppose you have a set of variables and you want to forecast one of them using information from the others. If your variables are highly correlated and the variable you want to predict is noisy you will have collinearity problems and in-sample overfitting because the model will try to fit the noise.

These problems may be solved if you estimate a smaller model using only a subset of the explanatory variables, however, if you do not know which variables are important you may loose information. What if we estimate models from many different subsets and combine their forecasts? Even better, what if we estimate models for all possible combinations of variables? This would be a good solution, however, if you have only 20 variables, the number of regressions would be more the 1 million. The CSR is a solution between using only one subset and all possible subsets. Instead of estimating all possible combinations of models, we fix a value $k$ and estimate all possible models with $k$ variables. Then we compute the forecasts from all these models to get our final result. Naturaly, $k$ must be smaller than the number of variables you have. Let us review the steps:

• Suppose you have $K$ explanatory variables. Estimate all possible models with $k variables,
• Compute the forecasts for all the models,
• Compute the average of all the forecasts to have the final result.

## Aplication

We are going to generate data from a linear model where the explanatory variables, $X$, are draw from a multivariate normal distribution. The coefficients are generated from a normal distribution and multiplied by a parameter $\alpha$ to ensure the dependent variable $y$ has a significant random component. The value for $K$ is 10 and the CSR is estimated with $k=4$.

set.seed(1) # = Seed for replication = #
K = 10 # = Number of Variables = #
T = 300 # = Number of Observations = #

# = Generate covariance matrix = #
library(mvtnorm)
D = diag(0.1, K)
P = matrix(rnorm(K * K), K, K)
sigma = t(P)%*%D%*%P

alpha = 0.1
beta = rnorm(K) * alpha # = coefficients = #
X = rmvnorm(T, sigma = sigma) # = Explanatory Variables = #
u = rnorm(T) # = Error = #
y = X%*%beta + u # = Generate y = #

# = Break data into in-sample and out-of-sample = #
y.in = y[1:200]
y.out = y[-c(1:200)]
X.in = X[1:200, ]
X.out = X[-c(1:200), ]

# = Estimate model by OLS to compare = #
model = lm(y.in ~ X.in)
pred = cbind(1, X.out)%*%coef(model)

# = CSR = #
k = 4
models = combn(K, k) # = Gives all combinations with 4 variables = #
csr.fitted = rep(0, length(y.in)) # = Store in-sample fitted values = #
csr.pred = rep(0, length(y.out)) # = Store forecast = #
for(i in 1:ncol(models)){
m = lm(y.in ~ X.in[ ,models[ ,i]])
csr.fitted = csr.fitted + fitted(m)
csr.pred = csr.pred + cbind(1, X.out[ ,models[ ,i]])%*%coef(m)
}

R2ols = 1 - var(y.in - fitted(model))/var(y.in) # = R2 OLS = #
R2csr = 1 - var(y.in - csr.fitted/ncol(models))/var(y.in) # = R2 CSR = #
c(R2ols, R2csr)

## [1] 0.1815733 0.1461342

# = In-sample fit = #
plot(y.in, type="l")
lines(fitted(model), col=2)
lines(csr.fitted/ncol(models), col=4)


# = Out-of-sample fit = #
plot(y.out, type="l")
lines(pred, col = 2)
lines(csr.pred/ncol(models), col = 4)


# = MAPE = #
MAPEols=mean(abs(y.out - pred))
MAPEcsr=mean(abs(y.out - csr.pred/ncol(models)))
c(MAPEols, MAPEcsr)

## [1] 0.8820019 0.8446682


The main conclusion from the results is that the CSR gives up some in-sample performance to improve the forecasts. In fact, the CSR is very robust to overfitting and you should definitively add it to your collection to use when you believe that you are having this type of problem. Most modern forecasting techniques have the same idea of accepting some bias in-sample to have a more parsimonious or stable model out-of-sample.

This is the simplest implementation possible for the CSR. In some cases you may have fixed controls you want to include in all regressions such as lags of the dependent variable. The CSR computational costs increase fast with the number of variables. If you are in a high-dimensional framework you may need to do some type of pre-selection of the variables to reduce the problem’s size.

## References

Elliott, Graham, Antonio Gargano, and Allan Timmermann. “Complete subset regressions.” Journal of Econometrics 177.2 (2013): 357-373.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...