# Using bayesian optimisation to tune a XGBOOST model in R

**R'tichoke**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My first post in 2022! A very happy new year to anyone reading this. ?

I was looking for a simple and effective way to tune `xgboost`

models in `R`

and came across this package called ParBayesianOptimization. Here’s a quick tutorial on how to use it to tune a `xgboost`

model.

# Pacman is a package management tool install.packages("pacman") library(pacman) # p_load automatically installs packages if needed p_load(xgboost, ParBayesianOptimization, mlbench, dplyr, skimr, recipes, resample)

## Data prep

# Load up some data data("BostonHousing2") # Data summary skim(BostonHousing2)

Table: Table 1: Data summary

Name | BostonHousing2 |

Number of rows | 506 |

Number of columns | 19 |

_______________________ | |

Column type frequency: | |

factor | 2 |

numeric | 17 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

town | 0 | 1 | FALSE | 92 | Cam: 30, Bos: 23, Lyn: 22, Bos: 19 |

chas | 0 | 1 | FALSE | 2 | 0: 471, 1: 35 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

tract | 0 | 1 | 2700.36 | 1380.04 | 1.00 | 1303.25 | 3393.50 | 3739.75 | 5082.00 | ▅▂▂▇▂ |

lon | 0 | 1 | -71.06 | 0.08 | -71.29 | -71.09 | -71.05 | -71.02 | -70.81 | ▁▂▇▂▁ |

lat | 0 | 1 | 42.22 | 0.06 | 42.03 | 42.18 | 42.22 | 42.25 | 42.38 | ▁▃▇▃▁ |

medv | 0 | 1 | 22.53 | 9.20 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |

cmedv | 0 | 1 | 22.53 | 9.18 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |

crim | 0 | 1 | 3.61 | 8.60 | 0.01 | 0.08 | 0.26 | 3.68 | 88.98 | ▇▁▁▁▁ |

zn | 0 | 1 | 11.36 | 23.32 | 0.00 | 0.00 | 0.00 | 12.50 | 100.00 | ▇▁▁▁▁ |

indus | 0 | 1 | 11.14 | 6.86 | 0.46 | 5.19 | 9.69 | 18.10 | 27.74 | ▇▆▁▇▁ |

nox | 0 | 1 | 0.55 | 0.12 | 0.38 | 0.45 | 0.54 | 0.62 | 0.87 | ▇▇▆▅▁ |

rm | 0 | 1 | 6.28 | 0.70 | 3.56 | 5.89 | 6.21 | 6.62 | 8.78 | ▁▂▇▂▁ |

age | 0 | 1 | 68.57 | 28.15 | 2.90 | 45.02 | 77.50 | 94.07 | 100.00 | ▂▂▂▃▇ |

dis | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.21 | 5.19 | 12.13 | ▇▅▂▁▁ |

rad | 0 | 1 | 9.55 | 8.71 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▂▁▁▃ |

tax | 0 | 1 | 408.24 | 168.54 | 187.00 | 279.00 | 330.00 | 666.00 | 711.00 | ▇▇▃▁▇ |

ptratio | 0 | 1 | 18.46 | 2.16 | 12.60 | 17.40 | 19.05 | 20.20 | 22.00 | ▁▃▅▅▇ |

b | 0 | 1 | 356.67 | 91.29 | 0.32 | 375.38 | 391.44 | 396.22 | 396.90 | ▁▁▁▁▇ |

lstat | 0 | 1 | 12.65 | 7.14 | 1.73 | 6.95 | 11.36 | 16.96 | 37.97 | ▇▇▅▂▁ |

Looks like there is are two factor variables. We’ll need to convert them into numeric variables before we proceed. I’ll use the `recipes`

package to one-hot encode them.

# Predicting median house prices rec <- recipe(cmedv ~ ., data = BostonHousing2) %>% # Collapse categories where population is < 3% step_other(town, chas, threshold = .03, other = "Other") %>% # Create dummy variables for all factor variables step_dummy(all_nominal_predictors()) # Train the recipe on the data set prep <- prep(rec, training = BostonHousing2) # Create the final model matrix model_df <- bake(prep, new_data = BostonHousing2) # All levels have been one hot encoded and separate columns have been appended to the model matrix colnames(model_df) ## [1] "tract" "lon" "lat" ## [4] "medv" "crim" "zn" ## [7] "indus" "nox" "rm" ## [10] "age" "dis" "rad" ## [13] "tax" "ptratio" "b" ## [16] "lstat" "cmedv" "town_Boston.Savin.Hill" ## [19] "town_Cambridge" "town_Lynn" "town_Newton" ## [22] "town_Other" "chas_X1"

Next, we can use the `resample`

package to create test/train splits.

splits <- rsample::initial_split(model_df, prop = 0.7) # Training set train_df <- rsample::training(splits) # Test set test_df <- rsample::testing(splits) dim(train_df) ## [1] 354 23 dim(test_df) ## [1] 152 23

## Finding optimal parameters

Now we can start to run some optimisations using the `ParBayesianOptimization`

package.

# The xgboost interface accepts matrices X <- train_df %>% # Remove the target variable select(!medv, !cmedv) %>% as.matrix() # Get the target variable y <- train_df %>% pull(cmedv) # Cross validation folds folds <- list(fold1 = as.integer(seq(1, nrow(X), by = 5)), fold2 = as.integer(seq(2, nrow(X), by = 5)))

We’ll need an objective function which can be fed to the optimiser. We’ll use the value of the evaluation metric from `xgb.cv()`

as the value that needs to be optimised.

# Function must take the hyper-parameters as inputs obj_func <- function(eta, max_depth, min_child_weight, subsample, lambda, alpha) { param <- list( # Hyter parameters eta = eta, max_depth = max_depth, min_child_weight = min_child_weight, subsample = subsample, lambda = lambda, alpha = alpha, # Tree model booster = "gbtree", # Regression problem objective = "reg:squarederror", # Use the Mean Absolute Percentage Error eval_metric = "mape") xgbcv <- xgb.cv(params = param, data = X, label = y, nround = 50, folds = folds, prediction = TRUE, early_stopping_rounds = 5, verbose = 0, maximize = F) lst <- list( # First argument must be named as "Score" # Function finds maxima so inverting the output Score = -min(xgbcv$evaluation_log$test_mape_mean), # Get number of trees for the best performing model nrounds = xgbcv$best_iteration ) return(lst) }

Once we have the objective function, we’ll need to define some bounds for the optimiser to search within.

bounds <- list(eta = c(0.001, 0.2), max_depth = c(1L, 10L), min_child_weight = c(1, 50), subsample = c(0.1, 1), lambda = c(1, 10), alpha = c(1, 10))

We can now run the optimiser to find a set of optimal hyper-parameters.

set.seed(1234) bayes_out <- bayesOpt(FUN = obj_func, bounds = bounds, initPoints = length(bounds) + 2, iters.n = 3) # Show relevant columns from the summary object bayes_out$scoreSummary[1:5, c(3:8, 13)] ## eta max_depth min_child_weight subsample lambda alpha Score ## 1: 0.13392137 8 4.913332 0.2105925 4.721124 3.887629 -0.0902970 ## 2: 0.19400811 2 25.454160 0.9594105 9.329695 3.173695 -0.1402720 ## 3: 0.16079775 2 14.035652 0.5118349 1.229953 5.093530 -0.1475580 ## 4: 0.08957707 4 12.534842 0.3844404 4.358837 1.788342 -0.1410245 ## 5: 0.02876388 4 36.586761 0.8107181 6.137100 6.039125 -0.3061535 # Get best parameters data.frame(getBestPars(bayes_out)) ## eta max_depth min_child_weight subsample lambda alpha ## 1 0.1905414 8 1.541476 0.8729207 1 1

## Fitting the model

We can now fit a model and check how well these parameters work.

# Combine best params with base params opt_params <- append(list(booster = "gbtree", objective = "reg:squarederror", eval_metric = "mae"), getBestPars(bayes_out)) # Run cross validation xgbcv <- xgb.cv(params = opt_params, data = X, label = y, nround = 100, folds = folds, prediction = TRUE, early_stopping_rounds = 5, verbose = 0, maximize = F) # Get optimal number of rounds nrounds = xgbcv$best_iteration # Fit a xgb model mdl <- xgboost(data = X, label = y, params = opt_params, maximize = F, early_stopping_rounds = 5, nrounds = nrounds, verbose = 0) # Evaluate performance actuals <- test_df$cmedv predicted <- test_df %>% select_at(mdl$feature_names) %>% as.matrix %>% predict(mdl, newdata = .) # Compute MAPE mean(abs(actuals - predicted)/actuals) ## [1] 0.008391282

## Compare with grid search

grd <- expand.grid( eta = seq(0.001, 0.2, length.out = 5), max_depth = seq(2L, 10L, by = 1), min_child_weight = seq(1, 25, length.out = 3), subsample = c(0.25, 0.5, 0.75, 1), lambda = c(1, 5, 10), alpha = c(1, 5, 10)) dim(grd) grd_out <- apply(grd, 1, function(par){ par <- append(par, list(booster = "gbtree",objective = "reg:squarederror",eval_metric = "mae")) mdl <- xgboost(data = X, label = y, params = par, nrounds = 50, early_stopping_rounds = 5, maximize = F, verbose = 0) lst <- data.frame(par, score = mdl$best_score) return(lst) }) grd_out <- do.call(rbind, grd_out) best_par <- grd_out %>% data.frame() %>% arrange(score) %>% .[1,] # Fit final model params <- as.list(best_par[-length(best_par)]) xgbcv <- xgb.cv(params = params, data = X, label = y, nround = 100, folds = folds, prediction = TRUE, early_stopping_rounds = 5, verbose = 0, maximize = F) nrounds = xgbcv$best_iteration mdl <- xgboost(data = X, label = y, params = params, maximize = F, early_stopping_rounds = 5, nrounds = nrounds, verbose = 0) # Evaluate on test set act <- test_df$medv pred <- test_df %>% select_at(mdl$feature_names) %>% as.matrix %>% predict(mdl, newdata = .) mean(abs(act - pred)/act) ## [1] 0.009407723

While both the methods offer similar final results, the bayesian optimiser completed its search in less than a minute where as the grid search took over seven minutes. Also, I find that I can use bayesian optimisation to search a larger parameter space more quickly than a traditional grid search.

*Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!*

**leave a comment**for the author, please follow the link and comment on their blog:

**R'tichoke**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.