Ensemble Packages in R

Posted on April 8, 2014 by Joseph Rickert in R bloggers | 0 Comments

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Mike Bowles

Mike Bowles is a machine learning expert and serial entrepreneur. This is the second post in what is envisioned as a four part series that began with Mike's Thumbnail History of Ensemble Models.

One of the main reasons for using R is the vast array of high-quality statistical algorithms available in R. Ensemble methods provide a prime example. As statistics researchers have advanced the forefront in statistical learning, they have produced R packages that incorporate their latest techniques. The table below demonstrates this by compares several of the ensemble packages available in R.

Name	Author	Algorithms	1st Vers Date	Last Update
ipred	Peters et al	Bagging	3/29/2002	9/3/2013
adabag	Alfaro et al	AdaBoost and Bagging	6/6/2006	7/5/2012
ada	Culp et al	AdaBoost + Friedman’s mods	9/29/2006	7/30/2010
randomForest	Breiman et al	Random Forest	4/1/2002	1/6/2012
gbm	Ridgeway et al	Stochastic Gradient Boosting	2/21/2003	1/18/2013
party	Hothorn	RF with faster tree growing	6/24/2005	1/17/2014
mboost	Hothorn	Boosting appl to glm, gam	6/16/2006	2/8/2013

Table 1. Ensemble packages available in R

The table gives the package name, the lead author and the basic contents of the package. The dates in the rightmost two columns are the date on the first version of the package and the date on the last version. The dates more or less track the development of development of these methods and the publication the corresponding papers in the area. The date for the last package update is provided to indicate how actively some of these packages are maintained and how active the field remains.

A number of these packages are worth having a look at, even though the methods they implement have been subsumed in other newer methods. For example ipred does bagging which has been incorporated into both Random Forest and Gradient Boosting. But the ipred package has the ability to incorporate more than one type of base learner. One of the examples in the package documentation incorporates Linear Discriminant Analysis in addition to Binary Decision Tree. It is hard to find ensemble methods using base learners other than binary decision trees. Simultaneously using two (or more) different base learners is singular to this package.

The randomForest algorithm wins the machine learning competitions and the R package was written by late Professor Leo Breiman of Berkeley. It contains the functionality that Prof Breiman describes in his papers. It solves regression and classification problems, has an unsupervised mode, produces marginal plots of prediction versus individual attributes, ranks attributes by importance. It also produces a similarity matrix measuring how frequently two rows from the input wind up in the same leaf node together. That gives a measure of how close the two rows are in their effect on the trained model.

The gbm package is heavily used and commercially important. It’s written by Greg Ridgeway and contributors. The package incorporates the methods outlined in Professor Jerome Friedman’s papers. Those include regression under mean square and mean absolute loss, binary classification under AdaBoost penalty and Bernoulli loss and multiclass classification. The package includes a number of extensions (Cox proportional hazard and pairwise ranking as examples). The gbm package includes similar visualization tools as randomForest. It will draw 2-D or 3-D plots showing marginal predicted values versus 1 or 2 of the attributes and gives a table ranking attributes by importance as a guide for feature engineering. (After loading the package, type example(gbm) at the console.)

The R packages party and mboost reflect continued development of ensemble methods. The party package uses an alternative method for training binary decision trees. The method is called conditional inference trees. The package authors describe in their associated paper how conditional inference trees¹ reduce bias and reduce training time. In the party package, the authors use Breiman’s Random Forest procedure incorporating conditional inference trees as base learners.

The mboost package approaches generalized linear model and generalized additive model as boosting problems. The connection between boosting is described in Elements of Statistical Learning², Algorithm 16.1. If used for least squares regression then the method of taking base learners as being single attributes corresponds to Efron’s Least Angle Regression³ or Tibshirani’s Lasso regression⁴. The package authors extend the method to apply to generalized linear model and generalized additive model.

Here’s an example of the sort of results these methods will produce. These results are for predicting the compressive strength of concrete based on ingredients in the concrete (water, cement, coarse aggregate, fine aggregate etc.). The data set comes from the UC Irvine Data Repository. The results come from gbm package (3000 trees, 10x cross-validation, shrinkage=0.003). In Figure 1, going clockwise from the upper left are plots of the progress of training (green line is out-of-sample performance and black line is in-sample performance), relative importance of the various ingredients in predicting compressive strength, and the marginal changes in predicted strength as functions of fine aggregate and water. As the figures show modern ensemble methods are far from being black boxes. Besides delivering predictions, they deliver a significant amount of information about the character of their predictions.

Figure 1 – Outputs from gbm Model for UCI Compressive Strength of Concrete

References

http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf
Hastie, Tibshirani and Friedman Elements of Statistical Learning, 2^nd edition, Springer 2009
http://www.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf
http://statweb.stanford.edu/~tibs/lasso.html

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Ensemble Packages in R

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)