Adaboosting is proven to be one of the most effective class prediction algorithms. It mainly consists of an ensemble simpler models (known as “weak learners”) that, although not very effective individually, are very performant combined.
The process by which these weak learners are combined is though more complex than simply averaging results. Very briefly, Adaboosting training process could be described as follows:
For each weak learner:
1) Train weak learner so that the weighted error sum of squares is minimised
2) Update weights, so that correctly classified cases have their weight reduced and misclassified cases have their weights increased.
3) Determine weak learner’s weight, i.e., the total contribution of the weak learner’s result to the overall score. This is known as alpha and is calculated as 0.5 * ln((1- error.rate)/error.rate))
As the weight is updated on each iteration, each weak learner will tend to focus more on those cases that were misclassified on previous instances.
For further information about Adaboosting Algorithm, this Schapire’s article provides a very useful high-level guidance http://rob.schapire.net/papers/explaining-adaboost.pdf
Decision stumps as weak learners
The most common weak learner used in Adaboosting is known as Decision Stump and consists basically on a decision tree of depth 1, i.e., a model that returns an output based on a single condition, which could be summarised as “If (condition) then A else B”.
ada package in R
Although the implementation provides very good results in terms of model performance, “ada” package has two main problems:
- It creates too big objects: Even with not so big datasets (around 500k x 50) that the final model object can be extremely big (5 or 6 Gb) and consequently too expensive to keep in memory. Of course, this is needed to perform any kind of prediction with the model. This is because the ada object is an ensemble of rpart objects, which holds a bunch of other information which is actually not needed when you just want to train an adaBoost model with Stumps and then predict with it.
- Its execution time is too high: As both ada and rpart objects have a lot of other uses, they are not optimal for Stumps in terms of timings
Considering that each stump consists just of a condition, two probability estimates (one when the condition is TRUE and the other when it’s FALSE) and a multiplier (alpha), I thought it could be much better, both in terms of size and timings, even just using R.
For that reason I developed adaStump, which is a package that takes advantage of ada’s excellent training process implementation, but creates a smaller output object and generates faster predictions.
Before running the example, you need to download the package from my Github repo. In order to do that, you need to have devtools installed
After the package is installed, let’s start with the example! In this case, the letters dataset is loaded, its variable names assigned and the variable isA is created, which gets a value 1 if true and 0 if false. That will be the class we will predict.
library(adaStump) #Load Letters Dataset letters.data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data", header = F) names(letters.data) <- c("lettr", "xbox", "ybox", "width", "high", "onpix", "xbar", "ybar", "x2bar", "y2bar", "xybar", "x2ybr", "xy2br", "xege", "xegvy", "yege", "yegvx") #Create a binary class letters.data$isA <- ifelse(letters.data$lettr == "A","Yes","No") #Create model formula. Exclude letter var for obvious reasons antec <- paste0(setdiff(names(letters.data),c("isA","lettr")), collapse = "+") fla <- as.formula(paste("isA", antec, sep = "~"))
Now we are going to create two identical models. As AdaBoosting is usually ran with bagging, the bag.frac parameter is set to 1 so that all the cases are included. As what we want to demonstrate is that we can get the same results with much less memory usage and in less time, we need to ensure that both models are identical.
#ada model fit passing the corresponding control function to create Stumps fit_with.ada <- ada(formula = fla,data = letters.data, type = "real", control = rpart.control(maxdepth=1,cp=-1,minsplit=0,xval=0) , iter = 40, nu = 0.05, bag.frac = 1) #adaStump fit_with.adaStump <- adaStump(formula = fla,data = letters.data, type = "real", iter = 40, nu = 0.05, bag.frac = 1)
Let us take a look now at the object sizes:
format(object.size(fit_with.ada), units = "KB")  "56006.3 Kb" format(object.size(fit_with.adaStump), units = "KB")  "4.7 Kb"
Taking a look at these numbers we can definitely say that the first problem is solved. Besides, the size of ada objects depend on the bag fraction, the sample size and the amount of iterations while adaStump objects only depend on the amount of iterations. There is almost no scenario where the size of the adaStump object can turn into a resource problem even on old personal computers. Now, let’s see if the new model can also predict quicker:
> system.time(predictions_ada <- predict(fit_with.ada, letters.data, type = "prob")) user system elapsed 1.100 0.000 1.114 > system.time(predictions_adaStump <- predict(fit_with.adaStump, letters.data)) user system elapsed 0.42 0.00 0.42
Indeed 🙂 . Let us check if the results are the same. Contrary to ada, adaStump prediction method only returns the probability estimate of the class outcome (i.e., the probability that the class variable is 1). Its complement can be easily calculated by doing 1 – the vector obtained.
> summary(abs(predictions_ada[,2] - predictions_adaStump)) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000e+00 0.000e+00 0.000e+00 8.391e-18 1.041e-17 1.110e-16
Although there is some difference in some cases, it tends to be minimal. We can say now that we can produce almost the same outcome using 1% of the memory and more than 2.5 times faster. However, we can do better
Usually, AdaBoosting generates stumps based on the same condition. In this case, for example, we can see by accessing to the model frame (fit_with.adaStump$model), that the condition x2ybr >= 2.5 is repeated constantly. Of course, all these rows can be collapsed so that the condition is evaluated only once. For that purpose you can use pruneTree()
> fit_with.prunedadaStump <- pruneTree(fit_with.adaStump) > system.time(predictions_prunedadaStump <- predict(fit_with.prunedadaStump, letters.data)) user system elapsed 0.124 0.000 0.121
Now it is more than 9 times faster. As a final check, let us see if the results are the same
> summary(abs(predictions_ada[,2] - predictions_prunedadaStump)) Min. 1st Qu. Median Mean 3rd Qu. Max. 6.489e-11 3.317e-10 7.411e-10 1.194e-09 8.965e-10 9.125e-09
Although the difference between both methods is much greater in this case (there is a pre-calculation step where some decimals are discarded) it is still acceptable for most use cases. So, choosing to do the reduction or not is a decision based on the level of precision needed.
You can download the full script used in this post here
Some observations before closing this post. Consider this is a first release. I will be completing and crossing out items of this to-do list:
- So far, the function only works with binary numeric classes. That is 0 for “No” and 1 for “Yes”
- From the 3 prediction types in “ada”, this package only covers type “prob”
- The function name pruneTree() is awful, but changing the name implies re-building and re-uploading. I will include that with other necessary changes
- Although faster, execution times (and their relationship) can be slightly different from computer to computer and from problem to problem. So, don’t stick to adaStump is X times higher.
- If you find any bug, please write me. I have done several checks but of course problems can arise everywhere.
As usual, feel free to drop me a line with comments, observations, critics, etc