Machine Learning Basics – Gradient Boosting & XGBoost

Posted on November 28, 2018 by Dr. Shirin Glander in R bloggers | 0 Comments

[This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a recent video, I covered Random Forests and Neural Nets as part of the codecentric.ai Bootcamp.

In the most recent video, I covered Gradient Boosting and XGBoost.

You can find the video on YouTube and the slides on slides.com. Both are again in German with code examples in Python.

But below, you find the English version of the content, plus code examples in R for caret, xgboost and h2o. 🙂

Like Random Forest, Gradient Boosting is another technique for performing supervised machine learning tasks, like classification and regression. The implementations of this technique can have different names, most commonly you encounter Gradient Boosting machines (abbreviated GBM) and XGBoost. XGBoost is particularly popular because it has been the winning algorithm in a number of recent Kaggle competitions.

Similar to Random Forests, Gradient Boosting is an ensemble learner. This means it will create a final model based on a collection of individual models. The predictive power of these individual models is weak and prone to overfitting but combining many such weak models in an ensemble will lead to an overall much improved result. In Gradient Boosting machines, the most common type of weak model used is decision trees – another parallel to Random Forests.

How Gradient Boosting works

Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.

Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boosting as tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.

The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function – so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.

In Neural nets, gradient descent is used to look for the minimum of the loss function, i.e. learning the model parameters (e.g. weights) for which the prediction error is lowest in a single model. In Gradient Boosting we are combining the predictions of multiple models, so we are not optimizing the model parameters directly but the boosted model predictions. Therefore, the gradients will be added to the running training process by fitting the next tree also to these values.

Because we apply gradient descent, we will find learning rate (the “step size” with which we descend the gradient), shrinkage (reduction of the learning rate) and loss function as hyperparameters in Gradient Boosting models – just as with Neural Nets. Other hyperparameters of Gradient Boosting are similar to those of Random Forests:

the number of iterations (i.e. the number of trees to ensemble),
the number of observations in each leaf,
tree complexity and depth,
the proportion of samples and
the proportion of features on which to train on.

Gradient Boosting Machines vs. XGBoost

XGBoost stands for Extreme Gradient Boosting; it is a specific implementation of the Gradient Boosting method which uses more accurate approximations to find the best tree model. It employs a number of nifty tricks that make it exceptionally successful, particularly with structured data. The most important are

1.) computing second-order gradients, i.e. second partial derivatives of the loss function (similar to Newton’s method), which provides more information about the direction of gradients and how to get to the minimum of our loss function. While regular gradient boosting uses the loss function of our base model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd order derivative as an approximation.

2.) And advanced regularization (L1 & L2), which improves model generalization.

XGBoost has additional advantages: training is very fast and can be parallelized / distributed across clusters.

Code in R

Here is a very quick run through how to train Gradient Boosting and XGBoost models in R with caret, xgboost and h2o.

Data

First, data: I’ll be using the ISLR package, which contains a number of datasets, one of them is College.

Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.

library(tidyverse)
library(ISLR)

ml_data <- College
ml_data %>%
  glimpse()
## Observations: 777
## Variables: 18
## $ Private     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ...
## $ Apps        <dbl> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, ...
## $ Accept      <dbl> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 4...
## $ Enroll      <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 4...
## $ Top10perc   <dbl> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38...
## $ Top25perc   <dbl> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64...
## $ F.Undergrad <dbl> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 7...
## $ P.Undergrad <dbl> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110,...
## $ Outstate    <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 138...
## $ Room.Board  <dbl> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 44...
## $ Books       <dbl> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, ...
## $ Personal    <dbl> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, ...
## $ PhD         <dbl> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60...
## $ Terminal    <dbl> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 8...
## $ S.F.Ratio   <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3...
## $ perc.alumni <dbl> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21,...
## $ Expend      <dbl> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487...
## $ Grad.Rate   <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74...

Gradient Boosting in caret

The most flexible R package for machine learning is caret. If you go to the Available Models section in the online documentation and search for “Gradient Boosting”, this is what you’ll find:

Model	method Value	Type	Libraries	Tuning Parameters
eXtreme Gradient Boosting	xgbDART	Classification, Regression	xgboost, plyr	nrounds, max_depth, eta, gamma, subsample, colsample_bytree, rate_drop, skip_drop, min_child_weight
eXtreme Gradient Boosting	xgbLinear	Classification, Regression	xgboost	nrounds, lambda, alpha, eta
eXtreme Gradient Boosting	xgbTree	Classification, Regression	xgboost, plyr	nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample
Gradient Boosting Machines	gbm_h2o	Classification, Regression	h2o	ntrees, max_depth, min_rows, learn_rate, col_sample_rate
Stochastic Gradient Boosting	gbm	Classification, Regression	gbm, plyr	n.trees, interaction.depth, shrinkage, n.minobsinnode

A table with the different Gradient Boosting implementations, you can use with caret. Here I’ll show a very simple Stochastic Gradient Boosting example:

library(caret)

# Partition into training and test data
set.seed(42)
index <- createDataPartition(ml_data$Private, p = 0.7, list = FALSE)
train_data <- ml_data[index, ]
test_data  <- ml_data[-index, ]

# Train model with preprocessing & repeated cv
model_gbm <- caret::train(Private ~ .,
                          data = train_data,
                          method = "gbm",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                  number = 5, 
                                                  repeats = 3, 
                                                  verboseIter = FALSE),
                          verbose = 0)
model_gbm
## Stochastic Gradient Boosting 
## 
## 545 samples
##  17 predictor
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: scaled (17), centered (17) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 437, 436, 435, 436, 436, 436, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9217830  0.7940197
##   1                  100      0.9327980  0.8264864
##   1                  150      0.9370795  0.8389860
##   2                   50      0.9352501  0.8321982
##   2                  100      0.9358337  0.8356107
##   2                  150      0.9333816  0.8301596
##   3                   50      0.9364511  0.8357210
##   3                  100      0.9400927  0.8463975
##   3                  150      0.9346048  0.8330068
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

With predict(), we can use this model to make predictions on test data. Here, I’ll be feeding this directly to the confusionMatrix function:

caret::confusionMatrix(
  data = predict(model_gbm, test_data),
  reference = test_data$Private
  )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   57   6
##        Yes   6 163
##                                          
##                Accuracy : 0.9483         
##                  95% CI : (0.9114, 0.973)
##     No Information Rate : 0.7284         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8693         
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9048         
##             Specificity : 0.9645         
##          Pos Pred Value : 0.9048         
##          Neg Pred Value : 0.9645         
##              Prevalence : 0.2716         
##          Detection Rate : 0.2457         
##    Detection Prevalence : 0.2716         
##       Balanced Accuracy : 0.9346         
##                                          
##        'Positive' Class : No             
##

The xgboost library

We can also directly work with the xgboost package in R. It’s a bit more involved but also includes advanced possibilities.

The easiest way to work with xgboost is with the xgboost() function. The four most important arguments to give are

data: a matrix of the training data
label: the response variable in numeric format (for binary classification 0 & 1)
objective: defines what learning task should be trained, here binary classification
nrounds: number of boosting iterations

library(xgboost)

xgboost_model <- xgboost(data = as.matrix(train_data[, -1]), 
                         label = as.numeric(train_data$Private)-1,
                         max_depth = 3, 
                         objective = "binary:logistic", 
                         nrounds = 10, 
                         verbose = FALSE,
                         prediction = TRUE)
xgboost_model
## ##### xgb.Booster
## raw: 6.7 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, max_depth = 3, objective = "binary:logistic", 
##     prediction = TRUE)
## params (as set within xgb.train):
##   max_depth = "3", objective = "binary:logistic", prediction = "TRUE", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 17 
## niter: 10
## nfeatures : 17 
## evaluation_log:
##     iter train_error
##        1    0.064220
##        2    0.051376
## ---                 
##        9    0.036697
##       10    0.033028

We can again use predict(); because here, we will get prediction probabilities, we need to convert them into labels to compare them with the true class:

predict(xgboost_model, 
        as.matrix(test_data[, -1])) %>%
  as.tibble() %>%
  mutate(prediction = round(value),
         label = as.numeric(test_data$Private)-1) %>%
  count(prediction, label)
## # A tibble: 4 x 3
##   prediction label     n
##        <dbl> <dbl> <int>
## 1          0     0    56
## 2          0     1     6
## 3          1     0     7
## 4          1     1   163

Alternatively, we can use xgb.train(), which is more flexible and allows for more advanced settings compared to xgboost(). Here, we first need to create a so called DMatrix from the data. Optionally, we can define a watchlist for evaluating model performance during the training run. I am also creating a parameter set as a list object, which I am feeding to the params argument.

dtrain <- xgb.DMatrix(as.matrix(train_data[, -1]), 
                      label = as.numeric(train_data$Private)-1)
dtest <- xgb.DMatrix(as.matrix(test_data[, -1]), 
                      label = as.numeric(test_data$Private)-1)

params <- list(max_depth = 3, 
               objective = "binary:logistic",
               silent = 0)

watchlist <- list(train = dtrain, eval = dtest)

bst_model <- xgb.train(params = params, 
                       data = dtrain, 
                       nrounds = 10, 
                       watchlist = watchlist,
                       verbose = FALSE,
                       prediction = TRUE)
bst_model
## ##### xgb.Booster
## raw: 6.7 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = 10, watchlist = watchlist, 
##     verbose = FALSE, prediction = TRUE)
## params (as set within xgb.train):
##   max_depth = "3", objective = "binary:logistic", silent = "0", prediction = "TRUE", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 17 
## niter: 10
## nfeatures : 17 
## evaluation_log:
##     iter train_error eval_error
##        1    0.064220   0.099138
##        2    0.051376   0.077586
## ---                            
##        9    0.036697   0.060345
##       10    0.033028   0.056034

The model can be used just as before:

predict(bst_model, 
        as.matrix(test_data[, -1])) %>%
  as.tibble() %>%
  mutate(prediction = round(value),
         label = as.numeric(test_data$Private)-1) %>%
  count(prediction, label)
## # A tibble: 4 x 3
##   prediction label     n
##        <dbl> <dbl> <int>
## 1          0     0    56
## 2          0     1     6
## 3          1     0     7
## 4          1     1   163

The third option, is to use xgb.cv, which will perform cross-validation. This function does not return a model, it is rather used to find optimal hyperparameters, particularly for nrounds.

cv_model <- xgb.cv(params = params,
                   data = dtrain, 
                   nrounds = 100, 
                   watchlist = watchlist,
                   nfold = 5,
                   verbose = FALSE,
                   prediction = TRUE) # prediction of cv folds

Here, we can see after how many rounds, we achieved the smallest test error:

cv_model$evaluation_log %>%
  filter(test_error_mean == min(test_error_mean))
##   iter train_error_mean train_error_std test_error_mean test_error_std
## 1   17        0.0082568     0.002338999       0.0550458     0.01160461
## 2   25        0.0018350     0.001716352       0.0550458     0.01004998
## 3   29        0.0009176     0.001123826       0.0550458     0.01421269
## 4   32        0.0009176     0.001123826       0.0550458     0.01535140
## 5   33        0.0004588     0.000917600       0.0550458     0.01535140
## 6   80        0.0000000     0.000000000       0.0550458     0.01004998

H2O

H2O is another popular package for machine learning in R. We will first set up the session and create training and test data:

library(h2o)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 50 minutes 
##     H2O cluster timezone:       Europe/Berlin 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.20.0.8 
##     H2O cluster version age:    2 months and 8 days  
##     H2O cluster name:           H2O_started_from_R_shiringlander_lci733 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.31 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.1 (2018-07-02)
h2o.no_progress()

data_hf <- as.h2o(ml_data)

splits <- h2o.splitFrame(data_hf, 
                         ratios = 0.75, 
                         seed = 1)

train <- splits[[1]]
test <- splits[[2]]

response <- "Private"
features <- setdiff(colnames(train), response)

Gradient Boosting

The Gradient Boosting implementation can be used as such:

h2o_gbm <- h2o.gbm(x = features, 
                   y = response, 
                   training_frame = train,
                   nfolds = 3) # cross-validation
h2o_gbm
## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  GBM_model_R_1543499512871_1815 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                       50               13001         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000          8         21    15.74000
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.00244139
## RMSE:  0.04941043
## LogLoss:  0.02582422
## Mean Per-Class Error:  0
## AUC:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error    Rate
## No     160   0 0.000000  =0/160
## Yes      0 419 0.000000  =0/419
## Totals 160 419 0.000000  =0/579
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.671121 1.000000 246
## 2                       max f2  0.671121 1.000000 246
## 3                 max f0point5  0.671121 1.000000 246
## 4                 max accuracy  0.671121 1.000000 246
## 5                max precision  0.996764 1.000000   0
## 6                   max recall  0.671121 1.000000 246
## 7              max specificity  0.996764 1.000000   0
## 8             max absolute_mcc  0.671121 1.000000 246
## 9   max min_per_class_accuracy  0.671121 1.000000 246
## 10 max mean_per_class_accuracy  0.671121 1.000000 246
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.05688845
## RMSE:  0.238513
## LogLoss:  0.2007733
## Mean Per-Class Error:  0.09630817
## AUC:  0.9668929
## Gini:  0.9337858
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error     Rate
## No     133  27 0.168750  =27/160
## Yes     10 409 0.023866  =10/419
## Totals 143 436 0.063903  =37/579
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.400785 0.956725 265
## 2                       max f2  0.132011 0.972352 287
## 3                 max f0point5  0.725883 0.953442 229
## 4                 max accuracy  0.400785 0.936097 265
## 5                max precision  0.997925 1.000000   0
## 6                   max recall  0.009298 1.000000 381
## 7              max specificity  0.997925 1.000000   0
## 8             max absolute_mcc  0.400785 0.837212 265
## 9   max min_per_class_accuracy  0.811928 0.906250 224
## 10 max mean_per_class_accuracy  0.725883 0.912552 229
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                                mean           sd  cv_1_valid  cv_2_valid
## accuracy                   0.939574 6.4933195E-4   0.9390863   0.9408602
## auc                       0.9701875  0.007612803   0.9708713  0.98301804
## err                     0.060425993 6.4933195E-4 0.060913704 0.059139784
## err_count                 11.666667   0.33333334        12.0        11.0
## f0point5                 0.95418453  0.006589541   0.9537167  0.96582466
## f1                       0.95859224  4.803105E-4   0.9577465   0.9594096
## f2                       0.96321476  0.006296414  0.96181047  0.95307916
## lift_top_group            1.3816328  0.012157884   1.3971632   1.3576642
## logloss                  0.20019953  0.016917419   0.2080731  0.16776533
## max_per_class_error      0.12948361  0.029007828       0.125  0.08163265
## mcc                      0.84875494  0.001501441  0.84894496  0.85125524
## mean_per_class_accuracy   0.9184681  0.009156114   0.9197695  0.93363625
## mean_per_class_error     0.08153185  0.009156114   0.0802305 0.066363774
## mse                     0.056778133 0.0035938106  0.06340453  0.05105359
## precision                0.95136136  0.010758136    0.951049   0.9701493
## r2                        0.7161539  0.014445015  0.68836546   0.7368911
## recall                   0.96641994  0.010696565    0.964539   0.9489051
## rmse                     0.23804487  0.007509063  0.25180256  0.22595042
## specificity              0.87051636  0.029007828       0.875   0.9183673
##                          cv_3_valid
## accuracy                 0.93877554
## auc                       0.9566731
## err                      0.06122449
## err_count                      12.0
## f0point5                 0.94301224
## f1                       0.95862067
## f2                        0.9747546
## lift_top_group            1.3900709
## logloss                  0.22476016
## max_per_class_error      0.18181819
## mcc                      0.84606457
## mean_per_class_accuracy   0.9019987
## mean_per_class_error     0.09800129
## mse                     0.055876285
## precision                 0.9328859
## r2                       0.72320527
## recall                    0.9858156
## rmse                     0.23638165
## specificity               0.8181818

We can calculate performance on test data with h2o.performance():

h2o.performance(h2o_gbm, test)
## H2OBinomialMetrics: gbm
## 
## MSE:  0.03509102
## RMSE:  0.187326
## LogLoss:  0.1350709
## Mean Per-Class Error:  0.05216017
## AUC:  0.9770811
## Gini:  0.9541623
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error    Rate
## No     48   4 0.076923   =4/52
## Yes     4 142 0.027397  =4/146
## Totals 52 146 0.040404  =8/198
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.580377 0.972603 136
## 2                       max f2  0.214459 0.979730 146
## 3                 max f0point5  0.907699 0.979827 127
## 4                 max accuracy  0.580377 0.959596 136
## 5                max precision  0.997449 1.000000   0
## 6                   max recall  0.006710 1.000000 187
## 7              max specificity  0.997449 1.000000   0
## 8             max absolute_mcc  0.580377 0.895680 136
## 9   max min_per_class_accuracy  0.821398 0.952055 131
## 10 max mean_per_class_accuracy  0.821398 0.956797 131
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

XGBoost

Alternatively, we can also use the XGBoost implementation of H2O:

h2o_xgb <- h2o.xgboost(x = features, 
                       y = response, 
                       training_frame = train,
                       nfolds = 3)
h2o_xgb
## Model Details:
## ==============
## 
## H2OBinomialModel: xgboost
## Model ID:  XGBoost_model_R_1543499512871_2178 
## Model Summary: 
##   number_of_trees
## 1              50
## 
## 
## H2OBinomialMetrics: xgboost
## ** Reported on training data. **
## 
## MSE:  0.25
## RMSE:  0.5
## LogLoss:  0.6931472
## Mean Per-Class Error:  0.5
## AUC:  0.5
## Gini:  0
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error      Rate
## No      0 160 1.000000  =160/160
## Yes     0 419 0.000000    =0/419
## Totals  0 579 0.276339  =160/579
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.500000 0.839679   0
## 2                       max f2  0.500000 0.929047   0
## 3                 max f0point5  0.500000 0.765996   0
## 4                 max accuracy  0.500000 0.723661   0
## 5                max precision  0.500000 0.723661   0
## 6                   max recall  0.500000 1.000000   0
## 7              max specificity  0.500000 0.000000   0
## 8             max absolute_mcc  0.500000 0.000000   0
## 9   max min_per_class_accuracy  0.500000 0.000000   0
## 10 max mean_per_class_accuracy  0.500000 0.500000   0
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: xgboost
## ** Reported on cross-validation data. **
## ** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.25
## RMSE:  0.5
## LogLoss:  0.6931472
## Mean Per-Class Error:  0.5
## AUC:  0.5
## Gini:  0
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error      Rate
## No      0 160 1.000000  =160/160
## Yes     0 419 0.000000    =0/419
## Totals  0 579 0.276339  =160/579
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.500000 0.839679   0
## 2                       max f2  0.500000 0.929047   0
## 3                 max f0point5  0.500000 0.765996   0
## 4                 max accuracy  0.500000 0.723661   0
## 5                max precision  0.500000 0.723661   0
## 6                   max recall  0.500000 1.000000   0
## 7              max specificity  0.500000 0.000000   0
## 8             max absolute_mcc  0.500000 0.000000   0
## 9   max min_per_class_accuracy  0.500000 0.000000   0
## 10 max mean_per_class_accuracy  0.500000 0.500000   0
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                               mean            sd  cv_1_valid cv_2_valid
## accuracy                  0.720723   0.032527234  0.77294683 0.72820514
## auc                            0.5           0.0         0.5        0.5
## err                     0.27927703   0.032527234  0.22705314  0.2717949
## err_count                53.333332      3.756476        47.0       53.0
## f0point5                 0.7629575   0.029264713   0.8097166 0.77006507
## f1                      0.83686095   0.022139339   0.8719346    0.84273
## f2                       0.9273414   0.010952134  0.94451004 0.93053734
## lift_top_group                 1.0           0.0         1.0        1.0
## logloss                  0.6931472 4.8956235E-17   0.6931472  0.6931472
## max_per_class_error            1.0           0.0         1.0        1.0
## mcc                            0.0           NaN         NaN        NaN
## mean_per_class_accuracy        0.5           0.0         0.5        0.5
## mean_per_class_error           0.5           0.0         0.5        0.5
## mse                           0.25           0.0        0.25       0.25
## precision                 0.720723   0.032527234  0.77294683 0.72820514
## r2                      -0.2677759    0.08917216 -0.42450133 -0.2631212
## recall                         1.0           0.0         1.0        1.0
## rmse                           0.5           0.0         0.5        0.5
## specificity                    0.0           0.0         0.0        0.0
##                           cv_3_valid
## accuracy                  0.66101694
## auc                              0.5
## err                       0.33898306
## err_count                       60.0
## f0point5                   0.7090909
## f1                        0.79591835
## f2                        0.90697676
## lift_top_group                   1.0
## logloss                    0.6931472
## max_per_class_error              1.0
## mcc                              NaN
## mean_per_class_accuracy          0.5
## mean_per_class_error             0.5
## mse                             0.25
## precision                 0.66101694
## r2                      -0.115705125
## recall                           1.0
## rmse                             0.5
## specificity                      0.0

And use it just as before:

h2o.performance(h2o_xgb, test)
## H2OBinomialMetrics: xgboost
## 
## MSE:  0.25
## RMSE:  0.5
## LogLoss:  0.6931472
## Mean Per-Class Error:  0.5
## AUC:  0.5
## Gini:  0
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error     Rate
## No      0  52 1.000000   =52/52
## Yes     0 146 0.000000   =0/146
## Totals  0 198 0.262626  =52/198
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.500000 0.848837   0
## 2                       max f2  0.500000 0.933504   0
## 3                 max f0point5  0.500000 0.778252   0
## 4                 max accuracy  0.500000 0.737374   0
## 5                max precision  0.500000 0.737374   0
## 6                   max recall  0.500000 1.000000   0
## 7              max specificity  0.500000 0.000000   0
## 8             max absolute_mcc  0.500000 0.000000   0
## 9   max min_per_class_accuracy  0.500000 0.000000   0
## 10 max mean_per_class_accuracy  0.500000 0.500000   0
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Video

Slides

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] h2o_3.20.0.8    bindrcpp_0.2.2  xgboost_0.71.2  caret_6.0-80   
##  [5] lattice_0.20-38 ISLR_1.2        forcats_0.3.0   stringr_1.3.1  
##  [9] dplyr_0.7.7     purrr_0.2.5     readr_1.1.1     tidyr_0.8.2    
## [13] tibble_1.4.2    ggplot2_3.1.0   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137       bitops_1.0-6       lubridate_1.7.4   
##  [4] dimRed_0.1.0       httr_1.3.1         rprojroot_1.3-2   
##  [7] tools_3.5.1        backports_1.1.2    utf8_1.1.4        
## [10] R6_2.3.0           rpart_4.1-13       lazyeval_0.2.1    
## [13] colorspace_1.3-2   nnet_7.3-12        withr_2.1.2       
## [16] gbm_2.1.4          gridExtra_2.3      tidyselect_0.2.5  
## [19] compiler_3.5.1     cli_1.0.1          rvest_0.3.2       
## [22] xml2_1.2.0         bookdown_0.7       scales_1.0.0      
## [25] sfsmisc_1.1-2      DEoptimR_1.0-8     robustbase_0.93-3 
## [28] digest_0.6.18      rmarkdown_1.10     pkgconfig_2.0.2   
## [31] htmltools_0.3.6    rlang_0.3.0.1      readxl_1.1.0      
## [34] ddalpha_1.3.4      rstudioapi_0.8     bindr_0.1.1       
## [37] jsonlite_1.5       ModelMetrics_1.2.2 RCurl_1.95-4.11   
## [40] magrittr_1.5       Matrix_1.2-15      fansi_0.4.0       
## [43] Rcpp_0.12.19       munsell_0.5.0      abind_1.4-5       
## [46] stringi_1.2.4      yaml_2.2.0         MASS_7.3-51.1     
## [49] plyr_1.8.4         recipes_0.1.3      grid_3.5.1        
## [52] pls_2.7-0          crayon_1.3.4       haven_1.1.2       
## [55] splines_3.5.1      hms_0.4.2          knitr_1.20        
## [58] pillar_1.3.0       reshape2_1.4.3     codetools_0.2-15  
## [61] stats4_3.5.1       CVST_0.2-2         magic_1.5-9       
## [64] glue_1.3.0         evaluate_0.12      blogdown_0.9      
## [67] data.table_1.11.8  modelr_0.1.2       foreach_1.4.4     
## [70] cellranger_1.1.0   gtable_0.2.0       kernlab_0.9-27    
## [73] assertthat_0.2.0   DRR_0.0.3          xfun_0.4          
## [76] gower_0.1.2        prodlim_2018.04.18 broom_0.5.0       
## [79] e1071_1.7-0        class_7.3-14       survival_2.43-1   
## [82] geometry_0.3-6     timeDate_3043.102  RcppRoll_0.3.0    
## [85] iterators_1.0.10   lava_1.6.3         ipred_0.9-8

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Machine Learning Basics – Gradient Boosting & XGBoost

How Gradient Boosting works

Gradient Boosting Machines vs. XGBoost

Code in R

Data

Gradient Boosting in caret

The xgboost library

H2O

Gradient Boosting

XGBoost

Video

Slides

Related

How Gradient Boosting works

Gradient Boosting Machines vs. XGBoost

Code in R

Data

Gradient Boosting in caret

The xgboost library

H2O

Gradient Boosting

XGBoost

Video

Slides

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)