How to build Stacked Ensemble Models in R

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At this post, we will show you how you easily apply Stacked Ensemble Models in R using the H2O package. The models can treat both Classification and Regression problems. For this example, we will apply a classification problem, using the Breast Cancer Wisconsin dataset which can be found here.

Description of the Stacked Ensemble Models

The steps below describe the individual tasks involved in training and testing a Super Learner ensemble. H2O automates most of the steps below so that you can quickly and easily build ensembles of H2O models.

  1. Set up the ensemble.
    1. Specify a list of L base algorithms (with a specific set of model parameters).
    2. Specify a metalearning algorithm.
  2. Train the ensemble.
    1. Train each of the L base algorithms on the training set.
    2. Perform k-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the L algorithms.
    3. The N cross-validated predicted values from each of the L algorithms can be combined to form a new N x L matrix. This matrix, along wtih the original response vector, is called the “level-one” data. (N = number of rows in the training set.)
    4. Train the metalearning algorithm on the level-one data. The “ensemble model” consists of the L base learning models and the metalearning model, which can then be used to generate predictions on a test set.
  3. Predict on new data.
    1. To generate ensemble predictions, first generate predictions from the base learners.
    2. Feed those predictions into the metalearner to generate the ensemble prediction.

stacked

Example of the Stacked Ensemble Model

We will build a Stacked Ensemble Model by applying the following steps:

  • Split the dataset into Train (75%) and Test (25%) dataset.
  • Run 3 base models, such as Gradient Boost, Random Forest, and Logistic Regression using Cross-Validation of 5 Folds
  • Stack the 3 base model by applying Random Forest and train them. The X features are the predicted values of the 3 models obtained from the Cross-Validation.
  • Compare the AUC score of each 3 models and the Stacked one on the Test dataset.

library(tidyverse)
library(h2o)
df<-read.csv("breast_cancer.csv", stringsAsFactors = TRUE)

# remove the id_number from the features
df<-df%>%select(-id_number)

# Split the data frame into Train and Test dataset
## 75% of the sample size
smp_size <- floor(0.75 * nrow(df))

## set the seed to make your partition reproducible
set.seed(5)
train_ind <- sample(seq_len(nrow(df)), size = smp_size)

train_df <- df[train_ind, ]
test_df <- df[-train_ind, ]


# initialize the h2o
h2o.init()

# create the train and test h2o data frames

train_df_h2o<-as.h2o(train_df)
test_df_h2o<-as.h2o(test_df)

# Identify predictors and response
y <- "diagnosis"
x <- setdiff(names(train_df_h2o), y)

# Number of CV folds (to generate level-one data for stacking)
nfolds <- 5

# 1. Generate a 3-model ensemble (GBM + RF + Logistic)

# Train & Cross-validate a GBM
my_gbm <- h2o.gbm(x = x,
                  y = y,
                  training_frame = train_df_h2o,
                  nfolds = nfolds,
                  keep_cross_validation_predictions = TRUE,
                  seed = 5)

# Train & Cross-validate a RF
my_rf <- h2o.randomForest(x = x,
                          y = y,
                          training_frame = train_df_h2o,
                          nfolds = nfolds,
                          keep_cross_validation_predictions = TRUE,
                          seed = 5)


# Train & Cross-validate a LR
my_lr <- h2o.glm(x = x,
                 y = y,
                 training_frame = train_df_h2o,
                 family = c("binomial"),
                 nfolds = nfolds,
                 keep_cross_validation_predictions = TRUE,
                 seed = 5)



# Train a stacked random forest ensemble using the GBM, RF and LR above
ensemble <- h2o.stackedEnsemble(x = x,
                                y = y,
                                metalearner_algorithm="drf",
                                training_frame = train_df_h2o,
                                base_models = list(my_gbm, my_rf, my_lr))


# Eval ensemble performance on a test set
perf <- h2o.performance(ensemble, newdata = test_df_h2o)


# Compare to base learner performance on the test set
perf_gbm_test <- h2o.performance(my_gbm, newdata = test_df_h2o)
perf_rf_test <- h2o.performance(my_rf, newdata = test_df_h2o)
perf_lr_test <- h2o.performance(my_lr, newdata = test_df_h2o)
baselearner_best_auc_test <- max(h2o.auc(perf_gbm_test), h2o.auc(perf_rf_test), h2o.auc(perf_lr_test))
ensemble_auc_test <- h2o.auc(perf)
print(sprintf("Best Base-learner Test AUC:  %s", baselearner_best_auc_test))
print(sprintf("Ensemble Test AUC:  %s", ensemble_auc_test))
 

Running the above block of code we get the following results:

ModelAUC
Gradient Boost0.9978
Random Forest0.9939
Logistic Regression0.9880
Stacked0.9982

As we can see all the models performed really well but the Stacked one achieved the highest AUC score. Whenever you test different models it is worthy to try also the Stacked Ensemble Models.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)