Tuning xgboost in R: Part II

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos

In this previous post I discussed some of the parameters we have to tune to estimate a boosting model using the xgboost package. In this post I will discuss the two parameters that were left out in part I, which are the gamma and the min_child_weight. These two parameters are much less obvious to understand but they can significantly change the results. Unfortunately, the best way to set them changes from dataset to dataset and we have to test a few values to select the best model. Note that there are many other parameters in the xgboost package. I am only showing the ones I use more.


  • min_child_weigth: This parameter puts a lower bound on the number of instances in a terminal node. Bigger values require more instances in terminal nodes, which makes each tree in the boosting smaller and the algorithm more conservative.

  • gamma: Controls the minimum reduction in the loss function required to grow a new node in a tree. This parameter is sensitive to the scale of the loss function, which will be linked to the scale of your response variable. The xgboost documentation defines general loss functions as L(\theta) = \sum_i \ell(\theta). In the quadratic loss, for example, we have \ell(\theta) = (y_i - \hat{y}_i)^2. If we want to use a gamma that requires a RMSE reduction in a split of at least m our gamma must be in the order of (m*N)^2, where N is the sample size.


I will use a housing dataset from the Ecdat package, which includes 546 observations of sold houses with their prices and characteristics. The objective is to change the two parameters keeping everything else constant and see what happens.


data = Housing
N = nrow(data)

# = divided price by 10000 to use smaller values of gamma = #
y = data$price/10000
x = data[,-1]
# = Transform categorical variables into dummies = #
for(i in 1:ncol(x)){
    x[,i] = ifelse(x[,i] == "yes",1,0)
x = as.matrix(x)

# = select train and test indexes = #

# = min_child_weights candidates = #
# = gamma candidates = #

# = train and test data = #
xtrain = x[train,]
ytrain = y[train]
xtest = x[test,]
ytest = y[test]


The code below runs the boosting for the four min_child_weights in the mcw vector. You can see the convergence and the test RMSE for each setup below the code. The results show a significant change in the model as we move on the min_child_weight. Values of 1 and 10 seems to overfit the data a little while and a value of 400 ignores a lot of information and returns a poor predictive model. The best solution in this case was for min_child_weight = 100.

conv_mcw = matrix(NA,500,length(mcw))
pred_mcw = matrix(NA,length(test), length(mcw))
colnames(conv_mcw) = colnames(pred_mcw) = mcw
for(i in 1:length(mcw)){
  params = list(eta = 0.1, colsample_bylevel=2/3,
              subsample = 1, max_depth = 6,
              min_child_weight = mcw[i], gamma = 0)
  xgb = xgboost(xtrain, label = ytrain, nrounds = 500, params = params)
  conv_mcw[,i] = xgb$evaluation_log$train_rmse
  pred_mcw[,i] = predict(xgb, xtest)

conv_mcw = data.frame(iter=1:500, conv_mcw)
conv_mcw = melt(conv_mcw, id.vars = "iter")
ggplot(data = conv_mcw) + geom_line(aes(x = iter, y = value, color = variable))

plot of chunk unnamed-chunk-2

(RMSE_mcw = sqrt(colMeans((ytest-pred_mcw)^2)))

##        1       10      100      400
## 1.744679 1.751487 1.690150 2.607713


The results for gamma are similar to the min_child_weight. The models in the middle (gamma = 1 and gamma = 10) are superior in terms of predictive accuracy. Unfortunately, the convergence plot does not give us any clue on which model is the best. We have to test the model in a test sample or in a cross-validation scheme to select the most accurate. Both the min_child_weight and the gamma are applying shrinkage to the trees by limiting their size, bu they look at different measures to do so.

conv_gamma = matrix(NA,500,length(gamma))
pred_gamma = matrix(NA,length(test), length(gamma))
colnames(conv_gamma) = colnames(pred_gamma) = gamma
for(i in 1:length(gamma)){
  params = list(eta = 0.1, colsample_bylevel=2/3,
              subsample = 1, max_depth = 6, min_child_weight = 1,
              gamma = gamma[i])
  xgb = xgboost(xtrain, label = ytrain, nrounds = 500, params = params)
  conv_gamma[,i] = xgb$evaluation_log$train_rmse
  pred_gamma[,i] = predict(xgb, xtest)

conv_gamma = data.frame(iter=1:500, conv_gamma)
conv_gamma = melt(conv_gamma, id.vars = "iter")
ggplot(data = conv_gamma) + geom_line(aes(x = iter, y = value, color = variable))

plot of chunk unnamed-chunk-4

(RMSE_gamma = sqrt(colMeans((ytest-pred_gamma)^2)))

##      0.1        1       10      100
## 1.810786 1.737987 1.739655 1.909785

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)