Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My first post in 2022! A very happy new year to anyone reading this. 😄

I was looking for a simple and effective way to tune xgboost models in R and came across this package called ParBayesianOptimization. Here’s a quick tutorial on how to use it to tune a xgboost model.

# Pacman is a package management tool
install.packages("pacman")

library(pacman)

# p_load automatically installs packages if needed
p_load(xgboost, ParBayesianOptimization, mlbench, dplyr, skimr, recipes, resample)


Data prep

# Load up some data
data("BostonHousing2")

# Data summary
skim(BostonHousing2)


Table: Table 1: Data summary

Name BostonHousing2
Number of rows 506
Number of columns 19
_______________________
Column type frequency:
factor 2
numeric 17
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
town 0 1 FALSE 92 Cam: 30, Bos: 23, Lyn: 22, Bos: 19
chas 0 1 FALSE 2 0: 471, 1: 35

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tract 0 1 2700.36 1380.04 1.00 1303.25 3393.50 3739.75 5082.00 ▅▂▂▇▂
lon 0 1 -71.06 0.08 -71.29 -71.09 -71.05 -71.02 -70.81 ▁▂▇▂▁
lat 0 1 42.22 0.06 42.03 42.18 42.22 42.25 42.38 ▁▃▇▃▁
medv 0 1 22.53 9.20 5.00 17.02 21.20 25.00 50.00 ▂▇▅▁▁
cmedv 0 1 22.53 9.18 5.00 17.02 21.20 25.00 50.00 ▂▇▅▁▁
crim 0 1 3.61 8.60 0.01 0.08 0.26 3.68 88.98 ▇▁▁▁▁
zn 0 1 11.36 23.32 0.00 0.00 0.00 12.50 100.00 ▇▁▁▁▁
indus 0 1 11.14 6.86 0.46 5.19 9.69 18.10 27.74 ▇▆▁▇▁
nox 0 1 0.55 0.12 0.38 0.45 0.54 0.62 0.87 ▇▇▆▅▁
rm 0 1 6.28 0.70 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁
age 0 1 68.57 28.15 2.90 45.02 77.50 94.07 100.00 ▂▂▂▃▇
dis 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.13 ▇▅▂▁▁
rad 0 1 9.55 8.71 1.00 4.00 5.00 24.00 24.00 ▇▂▁▁▃
tax 0 1 408.24 168.54 187.00 279.00 330.00 666.00 711.00 ▇▇▃▁▇
ptratio 0 1 18.46 2.16 12.60 17.40 19.05 20.20 22.00 ▁▃▅▅▇
b 0 1 356.67 91.29 0.32 375.38 391.44 396.22 396.90 ▁▁▁▁▇
lstat 0 1 12.65 7.14 1.73 6.95 11.36 16.96 37.97 ▇▇▅▂▁

Looks like there is are two factor variables. We’ll need to convert them into numeric variables before we proceed. I’ll use the recipes package to one-hot encode them.

# Predicting median house prices
rec <- recipe(cmedv ~ ., data = BostonHousing2) %>%

# Collapse categories where population is < 3%
step_other(town, chas, threshold = .03, other = "Other") %>%

# Create dummy variables for all factor variables
step_dummy(all_nominal_predictors())

# Train the recipe on the data set
prep <- prep(rec, training = BostonHousing2)

# Create the final model matrix
model_df <- bake(prep, new_data = BostonHousing2)

# All levels have been one hot encoded and separate columns have been appended to the model matrix
colnames(model_df)
##  [1] "tract"                  "lon"                    "lat"
##  [4] "medv"                   "crim"                   "zn"
##  [7] "indus"                  "nox"                    "rm"
## [13] "tax"                    "ptratio"                "b"
## [16] "lstat"                  "cmedv"                  "town_Boston.Savin.Hill"
## [19] "town_Cambridge"         "town_Lynn"              "town_Newton"
## [22] "town_Other"             "chas_X1"


Next, we can use the resample package to create test/train splits.

splits <- rsample::initial_split(model_df, prop = 0.7)

# Training set
train_df <- rsample::training(splits)

# Test set
test_df <- rsample::testing(splits)

dim(train_df)
## [1] 354  23

dim(test_df)
## [1] 152  23


Finding optimal parameters

Now we can start to run some optimisations using the ParBayesianOptimization package.

# The xgboost interface accepts matrices
X <- train_df %>%
# Remove the target variable
select(!medv, !cmedv) %>%
as.matrix()

# Get the target variable
y <- train_df %>%
pull(cmedv)

# Cross validation folds
folds <- list(fold1 = as.integer(seq(1, nrow(X), by = 5)),
fold2 = as.integer(seq(2, nrow(X), by = 5)))


We’ll need an objective function which can be fed to the optimiser. We’ll use the value of the evaluation metric from xgb.cv() as the value that needs to be optimised.

# Function must take the hyper-parameters as inputs
obj_func <- function(eta, max_depth, min_child_weight, subsample, lambda, alpha) {

param <- list(

# Hyter parameters
eta = eta,
max_depth = max_depth,
min_child_weight = min_child_weight,
subsample = subsample,
lambda = lambda,
alpha = alpha,

# Tree model
booster = "gbtree",

# Regression problem
objective = "reg:squarederror",

# Use the Mean Absolute Percentage Error
eval_metric = "mape")

xgbcv <- xgb.cv(params = param,
data = X,
label = y,
nround = 50,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = F)

lst <- list(

# First argument must be named as "Score"
# Function finds maxima so inverting the output
Score = -min(xgbcv$evaluation_log$test_mape_mean),

# Get number of trees for the best performing model
nrounds = xgbcv$best_iteration ) return(lst) }  Once we have the objective function, we’ll need to define some bounds for the optimiser to search within. bounds <- list(eta = c(0.001, 0.2), max_depth = c(1L, 10L), min_child_weight = c(1, 50), subsample = c(0.1, 1), lambda = c(1, 10), alpha = c(1, 10))  We can now run the optimiser to find a set of optimal hyper-parameters. set.seed(1234) bayes_out <- bayesOpt(FUN = obj_func, bounds = bounds, initPoints = length(bounds) + 2, iters.n = 3) # Show relevant columns from the summary object bayes_out$scoreSummary[1:5, c(3:8, 13)]
##           eta max_depth min_child_weight subsample   lambda    alpha      Score
## 1: 0.13392137         8         4.913332 0.2105925 4.721124 3.887629 -0.0902970
## 2: 0.19400811         2        25.454160 0.9594105 9.329695 3.173695 -0.1402720
## 3: 0.16079775         2        14.035652 0.5118349 1.229953 5.093530 -0.1475580
## 4: 0.08957707         4        12.534842 0.3844404 4.358837 1.788342 -0.1410245
## 5: 0.02876388         4        36.586761 0.8107181 6.137100 6.039125 -0.3061535

# Get best parameters
data.frame(getBestPars(bayes_out))
##         eta max_depth min_child_weight subsample lambda alpha
## 1 0.1905414         8         1.541476 0.8729207      1     1


Fitting the model

We can now fit a model and check how well these parameters work.

# Combine best params with base params
opt_params <- append(list(booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "mae"),
getBestPars(bayes_out))

# Run cross validation
xgbcv <- xgb.cv(params = opt_params,
data = X,
label = y,
nround = 100,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = F)

# Get optimal number of rounds
nrounds = xgbcv$best_iteration # Fit a xgb model mdl <- xgboost(data = X, label = y, params = opt_params, maximize = F, early_stopping_rounds = 5, nrounds = nrounds, verbose = 0) # Evaluate performance actuals <- test_df$cmedv
predicted <- test_df %>%
select_at(mdl$feature_names) %>% as.matrix %>% predict(mdl, newdata = .) # Compute MAPE mean(abs(actuals - predicted)/actuals) ## [1] 0.008391282  grd <- expand.grid( eta = seq(0.001, 0.2, length.out = 5), max_depth = seq(2L, 10L, by = 1), min_child_weight = seq(1, 25, length.out = 3), subsample = c(0.25, 0.5, 0.75, 1), lambda = c(1, 5, 10), alpha = c(1, 5, 10)) dim(grd) grd_out <- apply(grd, 1, function(par){ par <- append(par, list(booster = "gbtree",objective = "reg:squarederror",eval_metric = "mae")) mdl <- xgboost(data = X, label = y, params = par, nrounds = 50, early_stopping_rounds = 5, maximize = F, verbose = 0) lst <- data.frame(par, score = mdl$best_score)

return(lst)
})

grd_out <- do.call(rbind, grd_out)

best_par <- grd_out %>%
data.frame() %>%
arrange(score) %>%
.[1,]

# Fit final model
params <- as.list(best_par[-length(best_par)])
xgbcv <- xgb.cv(params = params,
data = X,
label = y,
nround = 100,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = F)

nrounds = xgbcv$best_iteration mdl <- xgboost(data = X, label = y, params = params, maximize = F, early_stopping_rounds = 5, nrounds = nrounds, verbose = 0) # Evaluate on test set act <- test_df$medv
pred <- test_df %>%
select_at(mdl\$feature_names) %>%
as.matrix %>%
predict(mdl, newdata = .)

mean(abs(act - pred)/act)
## [1] 0.009407723


While both the methods offer similar final results, the bayesian optimiser completed its search in less than a minute where as the grid search took over seven minutes. Also, I find that I can use bayesian optimisation to search a larger parameter space more quickly than a traditional grid search.