One of the techniques that has caused the most excitement in the machine learning community is boosting, which in essence is a process of iteratively refining, e.g. by reweighting, of estimated regression and classification functions (though it has primarily been applied to the latter), in order to improve predictive ability.
Much has been made of the remark by the late statistician Leo Breiman that boosting is “the best off-the-shelf classifier in the world,” his term off-the-shelf meaning that the given method can be used by nonspecialist users without special tweaking. Many analysts have indeed reported good results from the method.
In this post I will
- Briefly introduce the topic.
- Give a view of boosting that may not be well known.
- Give a surprising example.
As with some of my recent posts, this will be based on material from the book I’m writing on regression and classification.
The key point, almost always missed in technical discussions, is that boosting is really about bias reduction. Take the linear model, our example in this posting.
A linear model is rarely if ever exactly correct. Thus use of a linear model will result in bias; in some regions of the predictor vector X, the model will overestimate the true regression function, while in others it will underestimate — no matter how large our sample is. It thus may be profitable to try to reduce bias in regions in which our unweighted predictions are very bad, at the hopefully small sacrifice of some prediction accuracy in places where our unweighted analysis is doing well. (In the classification setting, a small loss in accuracy in estimating the conditional probability function won’t hurt our predictions at all, since our predictions won’t change.) The reweighting (or other iterative) process is aimed at achieving a positive tradeoff of that nature.
To motivate the notion of boosting, following is an adaptation of some material in Richard Berk’s book, Statistical Learning from a Regression Perspective, which by the way is one of the most thoughtful, analytic books on statistics I’ve ever seen.
Say we have fit an OLS model to our data, and through various means (see my book for some old and new methods) suspect that we have substantial model bias. Berk’s algorithm, which he points out is not boosting but is similar in spirit, is roughly as follows (I’ve made some changes):
- Fit the OLS model, naming the resulting coefficient vector b0. Calculate the residuals and their sum of squares, and set i =0.
- Update i to i+1. Fit a weighted least-squares model, using as weights the absolute residuals, naming the result bi. Calculate the new residuals and sum of squares.
- If i is less than the desired number of iterations k, go to Step 2.
In the end, take your final coefficient vector to be a weighted average of all the bi, with the weights being inversely proportional to the sums of squared residuals.
Again, we do all this in the hope of reducing model bias where it is most important. If all goes well, our ability to predict future observations will be enhanced.
Choice of weapons:
R offers a nice variety of packages for boosting. We’ll use the mboost package here, because it is largely geared toward parametric models such as the linear. In particular, it provides us with revised coefficients, rather than just outputting a “black box” prediction machine.
Of course, like any self-respecting R package, mboost offers a bewildering set of arguments in its functions. But Leo Breiman was a really smart guy, extraordinarily insightful. Based on his “off-the-shelf” remark, we will simply use the default values of the arguments.
Fong and Ouliaris (1995) do an analysis of relations between currency rates for Canada, Germany, France, the UK and Japan (pre-European Union days). Do they move together? Let’s look at predicting the Japanese yen from the others.
This is time series data, and the authors of the above paper do a very sophisticated analysis along those lines. But we’ll just do straight linear modeling here.
After applying OLS (not shown here), we find a pretty good fit, with an adjusted R-squared value of 0.89. However, there are odd patterns in the residuals, and something disturbing occurs when we take a k-Nearest Neighbors approach.
R-squared, whether a population value or the sample estimate reported by lm(), is the squared correlation between Y and its predicted value. Thus R-squared can be calculated for any method of regression function estimation, not just the linear model. In particular, we can apply the concept to kNN.
We’ll use the functions from my regtools package.
> xdata <- preprocessx(curr1[,-5],25,xval=TRUE) > kout <- knnest(curr1[,5],xdata,25) > ypredknn <- knnpred(kout,xdata$x) > cor(ypredknn,curr1[,5])^2  0.9817131
This is rather troubling. It had seemed that our OLS fit was very nice, but apparently we are “leaving money on the table” — we can do substantially better than that simple linear model.
So, let’s give boosting a try. Let’s split the data into training and test sets, and compare boosting to OLS.
> trnidxs <- sample(1:761,500)
> predidxs <- setdiff(1:761,trnidxs)
> mbout <- glmboost(Yen ~ .,data=curr1[trnidxs,])
> lmout <- lm(Yen ~ .,data=curr1[trnidxs,])
> mbpred <- predict(mbout,curr1[predidxs,])
> lmpred <- predict(lmout,curr1[predidxs,])
> predy <- curr1[predidxs,]$Yen
Well, lo and behold, boosting actually did worse than OLS! Clearly we can’t generalize from this, and as mentioned, many analysts have reported big gains from boosting. And though Breiman was one of the giants of this field, the example here shows that boosting is not ready for off-the-shelf usage. On the contrary, there are also numerous reports of boosting having problems, such as bizarre cases in which the iterations of boosting seemed to be converging, only to have them suddenly diverge.
Once again: There are no “silver bullets” in this field.