Machine Learning with R: A Complete Guide to Gradient Boosting and XGBoost

[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Gradient Boosting Thumbnail

Gradient Boosting with R

Gradient boosting is one of the most effective techniques for building machine learning models. It is based on the idea of improving the weak learners (learners with insufficient predictive power).

Do you want to learn more about machine learning with R? Check our complete guide to decision trees.

Navigate to a section:

Introduction to Gradient Boosting

The general idea behind gradient boosting is combining weak learners to produce a more accurate model. These “weak learners” are essentially decision trees, and gradient boosting aims to combine multiple decision trees to lower the model error somehow. 

The term “boosting” was introduced the first time successfully in AdaBoost (Adaptive Boosting). This algorithm combines multiple single split decision trees. AdaBoost puts more emphasis on observations that are more difficult to classify by adding new weak learners where needed.

In a nutshell, gradient boosting is comprised of only three elements:

  • Weak Learners – simple decision trees that are constructed based on purity scores (e.g., Gini).
  • Loss Function – a differentiable function you want to minimize. In regression, this could be a mean squared error, and in classification, it could be log loss
  • Additive Models – additional trees are added where needed, and a functional gradient descent procedure is used to minimize the loss when adding trees.

You now know the basics of gradient boosting. The following section will introduce the most popular gradient boosting algorithm – XGBoost.

Introduction to XGBoost

XGBoost stands for eXtreme Gradient Boosting and represents the algorithm that wins most of the Kaggle competitions. It is an algorithm specifically designed to implement state-of-the-art results fast.

XGBoost is used both in regression and classification as a go-to algorithm. As the name suggests, it utilizes the gradient boosting technique to accomplish enviable results – by adding more and more weak learners until no further improvement can be made. 

Today you’ll learn how to use the XGBoost algorithm with R by modeling one of the most trivial datasets – the Iris dataset – starting from the next section.

Dataset Loading and Preparation

As mentioned earlier, the Iris dataset will be used to demonstrate how the XGBoost algorithm works. Let’s start simple with a necessary first step – library and dataset imports. You’ll need only a few, and the dataset is built into R:

Here’s how the first couple of rows look like:

Image 1 - The first six rows of the Iris dataset

Image 1 – The first six rows of the Iris dataset

There’s no point in further exploration of the dataset, as anyone in the world of data already knows everything about it. 

The next step is dataset splitting into training and testing subsets. The following code snippet splits the dataset in a 70:30 ratio and then further splits the dataset in features (X) and target (y) for both subsets. This step is necessary for the training process:

You now have everything needed to start with the training process. Let’s do that in the next section.

Modeling

XGBoost uses something knows as a DMatrix to store data. DMatrix is nothing but a specific data structure used to store data in a way optimized for both memory efficiency and training speed. 

Besides the DMatrix, you’ll also have to specify the parameters for the XGBoost model. You can learn more about all the available parameters here, but we’ll stick to a subset of the most basic ones.

The following snippet shows you how to construct DMatrix data structures for both training and testing subsets and how to build a list of parameters:

Now you have everything needed to build a model. Here’s how:

The results of calling xgb_model are displayed below:

Image 2 - XGBoost model after training

Image 2 – XGBoost model after training

And that’s all you have to do to train your first gradient boosting model! You’ll learn how to evaluate it in the next section.

Predictions and Evaluations

You can use the predict() function to make predictions with the XGBoost model, just as with any other model. The next step is to covert the predictions to a data frame and assign column names, as the predictions are returned in the form of probabilities:

Here’s what the above code snippet produces:

Image 3 - Prediction probabilities for every flower species

Image 3 – Prediction probabilities for every flower species

As you would imagine, these probabilities add up to 1 for a single row. The column with the highest probability is the flower species predicted by the model.

Still, it would be nice to have two additional columns. The first one represents the predicted class (max of predicted probabilities). The other represents the actual class, so we can estimate how good the model performs on previously unseen data. 

The following snippet does just that:

The results are displayed in the following figure:

Image 4 - Predicted class vs. actual class on the test set

Image 4 – Predicted class vs. actual class on the test set

Things look promising, to say at least, but that’s no reason to jump to conclusions. Next, we can calculate the overall accuracy score as a sum of instances where predicted and actual classes match divided by the total number of rows:

Executing the above code prints out 0.9333 to the console, indicating we have a 93% accurate model on previously unseen data.

While we’re here, we can also print the confusion matrix to see what exactly did the model misclassify:

The results are shown below:

Image 5 - Confusion matrix for XGBoost model on the test set

Image 5 – Confusion matrix for XGBoost model on the test set

As you can see, only three virginica species were classified as versicolor. There were no misclassifications in the setosa species. 

And that’s how you can train and evaluate XGBoost models with R. Let’s wrap things up in the next section.

Conclusion

XGBoost is a complex state-of-the-art algorithm for both classification and regression – thankfully, with a simple R API. Entire books are written on this single algorithm alone, so cramming everything in a single article isn’t possible. 

You’ve still learned a lot – from the basic theory and intuition to implementation and evaluation in R. If you want to learn more, please stay tuned to the Appsilon blog. More guides on the topic are expected in the following weeks.

If you want to implement machine learning in your organization, you can always reach out to Appsilon for help.

Learn More

Appsilon is hiring for remote roles! See our Careers page for all open positions, including R Shiny DevelopersFullstack EngineersFrontend Engineers, a Senior Infrastructure Engineer, and a Community Manager. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.

Article Machine Learning with R: A Complete Guide to Gradient Boosting and XGBoost comes from Appsilon | End­ to­ End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)