Under the Hood of AdaBoost

Posted on January 7, 2019 by Holly Emblem in R bloggers | 0 Comments

[This article was first published on Stories by Holly Emblem on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A short introduction to the AdaBoost algorithm

*An Ensemble Orchestra:* Photo by Kael Bloom on Unsplash

In this post, we will cover a very brief introduction to boosting algorithms, as well as delve under the hood of a popular boosting algorithm, AdaBoost. The purpose of this post is to provide a gentle introduction to some of the key concepts of boosting and AdaBoost. This isn’t a definitive pros and cons of AdaBoost vs Gradient Boosting etc, but more of a summary of the key theory points to understand the algorithm.

Real World Applications for AdaBoost

AdaBoost can be used to solve a variety of real-world problems, such as predicting customer churn and classifying the types of topics customers are talking/calling about. The algorithm is heavily utilised for solving classification problems, given its relative ease of implementation in languages such as R and Python.

What are Boosting Algorithms?

Boosting algorithms fall within the broader family of ensemble modelling. Broadly speaking, there are two key approaches to model building within data science; building a single model and building an ensemble of models. Boosting falls within the latter approach and with reference to AdaBoost, the models are constructed as follows: For each iteration, a new weak learner is introduced sequentially and aims to compensate the “shortcomings” of the prior models to create a strong classifier. The overall goal of this exercise is to consecutively fit new models to provide more accurate estimations of our response variable.

Boosting works from the assumption that each weak hypothesis, or model, has a higher accuracy than randomly guessing: This assumption is known as the “weak learning condition”.

What is AdaBoost?

The AdaBoost algorithm was developed by Freund and Schapire in 1996 and is still heavily used in various industries. AdaBoost reaches its end goal of a classifier by sequentially introducing new models to compensate the “shortcomings” of prior models. Scikit Learn summarises AdaBoost’s core principle as that it “fits a sequence of weak learners on repeatedly modified versions of the data.” This definition will allow us to understand and expand upon AdaBoost’s processes.

Getting Started

To begin with, a weak classifier is trained, and all of the example data samples are given an equal weight. Once the initial classifier is trained, two things happen. A weight is calculated for the classifier, with more accurate classifiers being given a higher weight, and less accurate a lower weight. The weight is calculated based on the classifier’s error rate, which is the number of misclassifications in the training set, divided by total training set size. This output weight per model is known as the “alpha”.

Calculating the Alpha

Each classifier will have a weight calculated, which is based on the classifier’s error rate.

For each iteration, the alpha of the classifier is calculated, with the lower the error rate = the higher the alpha. This is visualised as follows:

*Image from Chris McCormick’s excellent AdaBoost tutorial*

Intuitively, there is also a relationship between the weight of the training example and the alpha. If we have a classifier with a high alpha that misclassifies a training example, this example will be given more weight than a weaker classifier which also misclassifies a training example. This is referred to as “intuitive”, as we can consider a classifier with a higher alpha as being a more reliable witness; when it misclassifies something, we want to investigate that further.

Understanding Weights for Training Samples:

Secondly, the AdaBoost algorithm directs its attention to misclassified data examples from our first weak classifier, by assigning weights to each data sample, the value of which is defined by whether the classifier correctly or incorrectly classified the sample.

We can break down a visualisation of weights per example, below:

Step 1: Our first model where wi=1/N

In this instance, we can see that each training example has an equal weight, and that the model has correctly and incorrectly classified certain examples. After each iteration, the sample weights are modified, and those with higher weights (examples that have been incorrectly classified) are more likely to be included within the training sets. When a sample is correctly classified, it is given less weightage in the next step of model building.

An example of a later model, where the weights have been changed:

The formula for this weight update is shown below:

Building the Final Classifier

Once all of the iterations have been completed, all of the weak learners are combined with their weights to form a strong classifier, as expressed in the below equation:

The final classifier is therefore built up of “T” weak classifiers, ht(x) is the output of the weak classifier, with at the weight applied to the classifier. The final output is therefore a combination of all of the classifiers.

This is a whistle-stop tour of the theory of AdaBoost, and should be seen as an introductory exploration of the boosting algorithm. For further reading, I recommend the following resources:

R-bloggers

R news and tutorials contributed by hundreds of R bloggers