How to build Stacked Ensemble Models in R

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At this post, we will show you how you easily apply Stacked Ensemble Models in R using the H2O package. The models can treat both Classification and Regression problems. For this example, we will apply a classification problem, using the Breast Cancer Wisconsin dataset which can be found here.

Description of the Stacked Ensemble Models

The steps below describe the individual tasks involved in training and testing a Super Learner ensemble. H2O automates most of the steps below so that you can quickly and easily build ensembles of H2O models.

  1. Set up the ensemble.
    1. Specify a list of L base algorithms (with a specific set of model parameters).
    2. Specify a metalearning algorithm.
  2. Train the ensemble.
    1. Train each of the L base algorithms on the training set.
    2. Perform k-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the L algorithms.
    3. The N cross-validated predicted values from each of the L algorithms can be combined to form a new N x L matrix. This matrix, along wtih the original response vector, is called the “level-one” data. (N = number of rows in the training set.)
    4. Train the metalearning algorithm on the level-one data. The “ensemble model” consists of the L base learning models and the metalearning model, which can then be used to generate predictions on a test set.
  3. Predict on new data.
    1. To generate ensemble predictions, first generate predictions from the base learners.
    2. Feed those predictions into the metalearner to generate the ensemble prediction.


Example of the Stacked Ensemble Model

We will build a Stacked Ensemble Model by applying the following steps:

  • Split the dataset into Train (75%) and Test (25%) dataset.
  • Run 3 base models, such as Gradient Boost, Random Forest, and Logistic Regression using Cross-Validation of 5 Folds
  • Stack the 3 base model by applying Random Forest and train them. The X features are the predicted values of the 3 models obtained from the Cross-Validation.
  • Compare the AUC score of each 3 models and the Stacked one on the Test dataset.

df<-read.csv("breast_cancer.csv", stringsAsFactors = TRUE)

# remove the id_number from the features

# Split the data frame into Train and Test dataset
## 75% of the sample size
smp_size <- floor(0.75 * nrow(df))

## set the seed to make your partition reproducible
train_ind <- sample(seq_len(nrow(df)), size = smp_size)

train_df <- df[train_ind, ]
test_df <- df[-train_ind, ]

# initialize the h2o

# create the train and test h2o data frames


# Identify predictors and response
y <- "diagnosis"
x <- setdiff(names(train_df_h2o), y)

# Number of CV folds (to generate level-one data for stacking)
nfolds <- 5

# 1. Generate a 3-model ensemble (GBM + RF + Logistic)

# Train & Cross-validate a GBM
my_gbm <- h2o.gbm(x = x,
                  y = y,
                  training_frame = train_df_h2o,
                  nfolds = nfolds,
                  keep_cross_validation_predictions = TRUE,
                  seed = 5)

# Train & Cross-validate a RF
my_rf <- h2o.randomForest(x = x,
                          y = y,
                          training_frame = train_df_h2o,
                          nfolds = nfolds,
                          keep_cross_validation_predictions = TRUE,
                          seed = 5)

# Train & Cross-validate a LR
my_lr <- h2o.glm(x = x,
                 y = y,
                 training_frame = train_df_h2o,
                 family = c("binomial"),
                 nfolds = nfolds,
                 keep_cross_validation_predictions = TRUE,
                 seed = 5)

# Train a stacked random forest ensemble using the GBM, RF and LR above
ensemble <- h2o.stackedEnsemble(x = x,
                                y = y,
                                training_frame = train_df_h2o,
                                base_models = list(my_gbm, my_rf, my_lr))

# Eval ensemble performance on a test set
perf <- h2o.performance(ensemble, newdata = test_df_h2o)

# Compare to base learner performance on the test set
perf_gbm_test <- h2o.performance(my_gbm, newdata = test_df_h2o)
perf_rf_test <- h2o.performance(my_rf, newdata = test_df_h2o)
perf_lr_test <- h2o.performance(my_lr, newdata = test_df_h2o)
baselearner_best_auc_test <- max(h2o.auc(perf_gbm_test), h2o.auc(perf_rf_test), h2o.auc(perf_lr_test))
ensemble_auc_test <- h2o.auc(perf)
print(sprintf("Best Base-learner Test AUC:  %s", baselearner_best_auc_test))
print(sprintf("Ensemble Test AUC:  %s", ensemble_auc_test))

Running the above block of code we get the following results:

Model AUC
Gradient Boost 0.9978
Random Forest 0.9939
Logistic Regression 0.9880
Stacked 0.9982

As we can see all the models performed really well but the Stacked one achieved the highest AUC score. Whenever you test different models it is worthy to try also the Stacked Ensemble Models.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)