Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Finding the best subset of variables for a regression is a very common task in statistics and machine learning. There are statistical methods based on asymptotic normal theory that can help you decide whether to add or remove a variable at a time. The problem with this is that it is a greedy approach and you can easily get stuck in a local mode. Another approach is to look at all possible subsets of the variables and see which one maximizes an objective function (accuracy on a test set, for example).

Heuristics are required if the number of variables, p, gets large (>40) because the number of possible subsets is equal to 2^p, excluding interactions. One method that works well is called tabu search (TS). It is a more general optimization method, but in the case of finding the best subset, it is similar to stepwise methods. At any point, it will try to improve the objective function the most by adding or removing one variable. One difference is that TS will not add or remove variables that have been recently added or removed. This avoids the optimization from getting stuck at a local maximum. There are more details to the algorithm that you can read up on if you would like.

There is a tabuSearch package in R that implements this algorithm. I wanted to find a best subset regression using generalized additive models (GAMs) of the package mgcv. An issue arose, because you need to specify which terms are smooth and which are not in a formula to use mgcv. The general form of the formula looks like:
gam(y ~ s(x1) + s(x2) + x3, data=train)
where x1 and x2 are smooth terms and x3 is treated like it is in a linear model.

My data set has a combination of continuous variables, which I want to treat as smooth, and categorical variables, which I want to treat as factors. For each variable in the subset, I need to identify whether it is a factor or not, and then creating a string of the variable with or without s().

For example, th is a vector of 0’s and 1’s which indicate whether each variable (column) is in the regression. I used the housing data set from UCI ML repository. With 13 explanatory variables, there are 8192 possible main effects models. I am predicting MEDV, so I start the formula string with “MEDV ~”. Then I loop through each element of th. If it is 1 then I want to add it to the formula. I check if it is a factor and if so I just add the name of the variable plus a “+”. If it is continuous, I add the column name with “s()” around it. Finally I convert the string to a formula using formula(). I can plug in this formula into the gam function.

num.cols=sum(th)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
fo=as.formula(fo.str)

For evaluating the subsets, I split the data into training, validation, and testing. I trained the subset on the training data set and measured the R-squared on the prediction of the validation set. The full code can be found below.

library(mgcv)
library(tabuSearch)
# http://archive.ics.uci.edu/ml/datasets/Housing
colnames(housing)=c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")

housing$CHAS=as.factor(housing$CHAS)
housing$RAD=as.factor(housing$RAD) # Changed to factor bc only 9 unique values
summary(housing)

set.seed(20120823)
cv=sample(nrow(housing))
train=housing[cv[1:300],]
valid=housing[cv[301:400],]
test=housing[cv[401:nrow(housing)],]

ssto=sum((valid$MEDV-mean(valid$MEDV))^2)
evaluate <- function(th){
num.cols=sum(th)
if (num.cols == 0) return(0)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
#   colnames(train)[c(th,0)==1]
fo=as.formula(fo.str)
gam1 <- gam(fo,data=train)
pred1 <- predict(gam1,valid,se=FALSE)
sse <- sum((pred1-valid$MEDV)^2,na.rm=TRUE) return(1-sse/ssto) } res <- tabuSearch(size = ncol(train)-1, iters = 20, objFunc = evaluate, listSize = 5, config = rbinom(ncol(train)-1,1,.5), nRestarts = 4,verbose=TRUE) It was able to find a subset with a 0.8678 R-squared on the validation set (and 0.8349 on the test set). The formula found was: MEDV ~ s(CRIM) + s(INDUS) + s(NOX) + s(RM) + s(DIS) + RAD + s(TAX) + s(PTRATIO) + s(LSTAT). ### Visualizing the results The tabu search gave me a subset which it thinks is best. But I would like to get a better handle on how it derived it, or if there were lots of models with similar quality, but different variables. Or if there were variables that were always in the top performing models. I created a heat plot that showed whether or not a variable was in the regression or not at each iteration. Below you can see which variables were included in each iteration. There are a few variables that seem to be more important because they are included in almost every iteration, like LSAT, PTRATIO,and RM. But this doesn't tell me which iterations were the best. In the chart below, I shaded each region by the ranking of model in that iteration. The higher ranking mean it did better. It is not easy, but we can glean a little more information from this chart. The models with RAD and DIS do significantly better than the models without them, even though they are not in every iteration. Further, the models with AGE seem a bit worse than those without it. The R code to make these plots is below. library(reshape) library(ggplot2); theme_set(theme_bw()) tabu.df=data.frame(res$configKeep)
colnames(tabu.df)=colnames(train)[1:(ncol(train)-1)]
tabu.df$Iteration=1:nrow(tabu.df) tabu.df$RSquared=res$eUtilityKeep tabu.df$Rank=rank(tabu.df$RSquared) tabu.melt=melt(tabu.df,id=c("Iteration","RSquared","Rank")) tabu.melt$RSquared=ifelse(tabu.melt$value==1,tabu.melt$RSquared,0)
tabu.melt$Rank=ifelse(tabu.melt$value==1,tabu.melt\$Rank,0)
(pHeat01 <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = value)) +
scale_fill_gradient(low = "white", high = "steelblue",guide=FALSE))
(pHeatRank <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = Rank)) +
scale_fill_gradient(low = "white", high = "steelblue"))