Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Goal
You will learn how to estimate the model performance with mlr3
using resampling techniques such as 5-fold cross-validation. Additionally, you will compare k-NN model against a logistic regression model.
German Credit Data
We work with the German credit data. You can either manually create the corresponding mlr3
task as we did before or use a pre-defined task which is already included in the mlr3
package (you can look at the output of as.data.table(mlr_tasks)
to see which other pre-defined tasks that can be used to play around are included in the mlr3
package).
library(mlr3verse)
Loading required package: mlr3
task = tsk("german_credit") task
<TaskClassif:german_credit> (1000 x 21): German Credit * Target: credit_risk * Properties: twoclass * Features (20): - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors, other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status, telephone - int (3): age, amount, duration - ord (3): installment_rate, number_credits, present_residence
task$positive # (check the positive class)
[1] "good"
Exercise: Fairly evaluate the performance of two learners
We first create two mlr3
learners, a logistic regression and a KNN learner. We then compare their performance via resampling.
Create the learners
Create a logistic regression learner (store it as an R object called log_reg
) and KNN learner with knn
).
as.data.table(mlr_learners)
to find the appropriate learner.
< details>
< summary>
Show Hint 2:
Make sure to have the kknn
package installed.
< section id="set-up-a-resampling-instance" class="level2">
Set up a resampling instance
Use the mlr3
to set up a resampling instance and store it as an R object called cv5
. Here, we aim for 5-fold cross-validation. A table of possible resampling techniques implemented in mlr3
can be shown by looking at as.data.table(mlr_resamplings)
.
as.data.table(mlr_resamplings)
and use the rsmp
function to set up a 5-fold cross-validation instance. Store the result of the rsmp
function in an R object called cv5
.
< details>
< summary>
Show Hint 2:
rsmp("cv")
by default sets up a 10-fold cross-validation instance. The number of folds can be set using an additional argument (see the params
column from as.data.table(mlr_resamplings)
).
< section id="run-the-resampling" class="level2">
Run the resampling
After having created a resampling instance, use it to apply the chosen resampling technique to both previously created learners.
< details> < summary> Show Hint 1: You need to supply the task, the learner and the previously created resampling instance as arguments to theresample
function. See ?resample
for further details and examples.
< details>
< summary>
Show Hint 2:
The key ingredients for resample()
are a task (created by tsk()
), a learner (created by lrn()
) and a resampling strategy (created by rsmp()
), e.g.,
resample(task = task, learner = log_reg, resampling = cv5)
Evaluation
Compute the cross-validated classification accuracy of both models. Which learner performed better?
< details> < summary> Show Hint 1: Usemsr("classif.acc")
and the aggregate
method of the resampling object.
< details>
< summary>
Show Hint 2:
res_knn$aggregate(msr(...))
to obtain the classification accuracy averaged across all folds.
< section id="summary" class="level1">
Summary
We can now apply different resampling methods to estimate the performance of different learners and fairly compare them. We now have learnt how to obtain a better (in terms of variance) estimate of our model performance instead of doing a simple train and test split. This enables us to fairly compare different learners.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.