Site icon R-bloggers

Being successful on Kaggle using `mlr`

[This article was first published on mlr-org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Achieving a good score on a Kaggle competition is typically quite difficult. This blog post outlines 7 tips for beginners to improve their ranking on the Kaggle leaderboards. For this purpose, I also created a Kernel for the Kaggle bike sharing competition that shows how the R package, mlr, can be used to tune a xgboost model with random search in parallel (using 16 cores). The R script scores rank 90 (of 3251) on the Kaggle leaderboard.

7 Rules

  1. Use good software
  2. Understand the objective
  3. Create and select features
  4. Tune your model
  5. Validate your model
  6. Ensemble different models
  7. Track your progress

1. Use good software

Whether you choose R, Python or another language to work on Kaggle, you will most likely need to leverage quite a few packages to follow best practices in machine learning. To save time, you should use ‘software’ that offers a standardized and well-tested interface for the important steps in your workflow:

Examples of ‘software’ that implement the steps above and more:

2. Understand the objective

To develop a good understanding of the Kaggle challenge, you should:

Make sure you choose an approach that directly optimizes the measure of interest! Example:

3. Create and select features:

In many kaggle competitions, finding a “magic feature” can dramatically increase your ranking. Sometimes, better data beats better algorithms! You should therefore try to introduce new features containing valuable information (which can’t be found by the model) or remove noisy features (which can decrease model performance):

4. Tune your model

Typically you can focus on a single model (e.g. xgboost) and tune its hyperparameters for optimal performance.

5. Validate your model

Good machine learning models not only work on the data they were trained on, but also on unseen (test) data that was not used for training the model. When you use training data to make any kind of decision (like feature or model selection, hyperparameter tuning, …), the data becomes less valuable for generalization to unseen data. So if you just use the public leaderboard for testing, you might overfit to the public leaderboard and lose many ranks once the private leaderboard is revealed. A better approach is to use validation to get an estimate of performane on unseen data:

6. Ensemble different models (see, e.g. this guide):

After training many different models, you might want to ensemble them into one strong model using one of these methods:

7. Track your progress

A kaggle project might get quite messy very quickly, because you might try and prototype many different ideas. To avoid getting lost, make sure to keep track of:

If you do not want to use a tool like git, at least make sure you create subfolders for each prototype. This way you can later analyse which models you might want to ensemble or use for your final commits for the competition.

To leave a comment for the author, please follow the link and comment on their blog: mlr-org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.