Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to train and tune machine learning algorithms in a unified way?
With mlr R package ????

I am currently keen on automated machine learning, especially hyperparameter optimization. Therefore, recently I mainly explore frameworks for unified model training. In this post, I will show how to train ML algorithms and tune them using grid-search. I am going to show only basics, but mlr package has more sophisticated features, I strongly encourage you to visit mlr webpage and explore all tutorials.

# Data set

We will use BreastCancer data set from mlbench package and will perform binary classification. The aim of the model is to predict whether a cancer is benign or malignant (variable Class). It is worth to remove the first column that contains the id of a patient as it is redundant for modeling. To read more about the data set, see the documentation (?BreastCancer).

library("mlbench")
data("BreastCancer")
bc <- na.omit(BreastCancer[ ,-1])

##   Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 1            5         1          1             1            2           1
## 2            5         4          4             5            7          10
## 3            3         1          1             1            2           2
## 4            6         8          8             1            3           4
## 5            4         1          1             3            2           1
## 6            8        10         10             8            7          10
##   Bl.cromatin Normal.nucleoli Mitoses     Class
## 1           3               1       1    benign
## 2           3               2       1    benign
## 3           3               1       1    benign
## 4           3               7       1    benign
## 5           3               1       1    benign
## 6           9               7       1 malignant

# Installation

First of all, make sure that you have installed mlr, It is on CRAN, so you can simply use install.packages() function.

install.packages("mlr")

After installation, load mlr and set seed to make results reproducible.

library(mlr)
set.seed(1)

# Modeling

## Fitting a model

First, you need to define a task. The task is the definition of a machine learning problem. Our problem is classification, therefor we use makeClassifTask() function. For regression, it would be makeRegrTask() and for clustering makeClusterTask().

Parameter id define the name of the task, data is the data model will be trained on and target indicates the target variable.

classif_task = makeClassifTask(id = "bc", data = bc, target = "Class")
## Type: classif
## Target: Class
## Observations: 683
## Features:
##    numerics     factors     ordered functionals
##           0           4           5           0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
##    benign malignant
##       444       239
## Positive class: benign

The second step is defining a model. Please, note that we do not train a model yet. We only create an object that describes our algorithm.

classif_lrn = makeLearner("classif.randomForest", par.vals = list(ntree = 200))

In the example above, we have created an object that defines classification random forest with 200 trees. To get names of hyperparameters, their ranges, and default values use function getParamSet().

getParamSet(classif_lrn)
##                      Type  len   Def   Constr Req Tunable Trafo
## ntree             integer    -   500 1 to Inf   -    TRUE     -
## mtry              integer    -     - 1 to Inf   -    TRUE     -
## replace           logical    -  TRUE        -   -    TRUE     -
## classwt     numericvector <NA>     - 0 to Inf   -    TRUE     -
## cutoff      numericvector <NA>     -   0 to 1   -    TRUE     -
## strata            untyped    -     -        -   -   FALSE     -
## sampsize    integervector <NA>     - 1 to Inf   -    TRUE     -
## nodesize          integer    -     1 1 to Inf   -    TRUE     -
## maxnodes          integer    -     - 1 to Inf   -    TRUE     -
## importance        logical    - FALSE        -   -    TRUE     -
## localImp          logical    - FALSE        -   -    TRUE     -
## proximity         logical    - FALSE        -   -   FALSE     -
## oob.prox          logical    -     -        -   Y   FALSE     -
## norm.votes        logical    -  TRUE        -   -   FALSE     -
## do.trace          logical    - FALSE        -   -   FALSE     -
## keep.forest       logical    -  TRUE        -   -   FALSE     -
## keep.inbag        logical    - FALSE        -   -   FALSE     -

Now, we are ready to fit a model. We can just simply use function train() with specified model and task.

model = train(classif_lrn, classif_task)

## Tuning a model

To tune hyperparameters, we need to specify a space fo search. For defining space for integer parameters we use function makeIntegerParam(). All of this is pinned together with the function makeParamSet().

params = makeParamSet(
makeIntegerParam("mtry", lower = 1, upper = 100),
makeIntegerParam("ntree", lower = 1L, upper = 500L)
)

Now, we use function makeTuneControlRandom() to create an object that define random search. Parameter maxit defines the number of iterations. Function makeResampleDesc() creates an object for a resampling strategy, in this case cross-validation. Finally, we can combine all of the previous pieces with function tuneParams() and tune random forest.

ctrl = makeTuneControlRandom(maxit = 10L)
rdesc = makeResampleDesc("CV", iters = 3L)

resampling = rdesc,
par.set = params,
control = ctrl,
measures = list(acc),
show.info = FALSE)
res
## Tune result:
## Op. pars: mtry=49; ntree=464
## acc.test.mean=0.9707409

As a result of tuning, we have obtained hyperparameters mtry=49, ntree=464.