Feature Selection

Giuseppe Casalicchio

1 week ago

[This article was first published on mlr-org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

JavaScript is required to unlock solutions.
Please enable JavaScript and reload the page,
or download the source files from GitHub and run the code locally.

< section id="goal" class="level1">

Goal

After this exercise, you should understand and be able to perform feature selection using wrapper functions with mlr3fselect. You should also be able to integrate various performance measures and calculate the generalization error.

< section id="wrapper-methods" class="level1">

Wrapper Methods

In addition to filtering, wrapper methods are another variant of selecting features. While in filtering conditions for the feature values are set, in wrapper methods the learner is applied to different subsets of the feature set. As models need to be refitted, this method is computationally expensive.

For wrapper methods, we need the package mlr3fselect, at whose heart the following R6 classes are:

FSelectInstanceSingleCrit, FSelectInstanceMultiCrit: These two classes describe the feature selection problem and store the results.
FSelector: This class is the base class for implementations of feature selection algorithms.

< section id="prerequisites" class="level1">

Prerequisites

We load the most important packages and use a fixed seed for reproducibility.

library(mlr3verse)
library(data.table)
library(mlr3fselect)
set.seed(7891)

In this exercise, we will use the german_credit data and the learner classif.ranger:

task_gc = tsk("german_credit")
lrn_ranger = lrn("classif.ranger")

< section id="basic-application" class="level1">

1 Basic Application

< section id="create-the-framework" class="level2">

1.1 Create the Framework

Create an FSelectInstanceSingleCrit object using fsi(). The instance should use a 3-fold cross validation, classification accuracy as the measure and terminate after 20 evaluations. For simplification only consider the features age, amount, credit_history and duration.

< details> < summary> Hint 1:

task_gc$select(...)

instance = fsi(
  task = ...,
  learner = ...,
  resampling = ...,
  measure = ...,
  terminator = ...
)

Solution

< section id="start-the-feature-selection" class="level2">

1.2 Start the Feature Selection

Start the feature selection step by selecting sequential using the FSelector class via fs() and pass the FSelectInstanceSingleCrit object to the $optimize() method of the initialized FSelector object.

< details> < summary> Hint 1:

fselector = fs(...)

< details> < summary> Hint 2:

fselector = fs(...)
fselector$optimize(...)

Solution

< section id="evaluate" class="level2">

1.3 Evaluate

View the four characteristics and the accuracy from the instance archive for each of the first two batches.

< details> < summary> Hint 1:

instance$archive$data[...]

< details> < summary> Hint 2:

instance$archive$data[batch_nr == ..., ...]

Solution

< section id="model-training" class="level2">

1.4 Model Training

Which feature(s) should be selected? Train the model.

< details> < summary> Hint 1:

Compare the accuracy values for the different feature combinations and select the feature(s) accordingly.

< details> < summary> Hint 2:

task_gc = ...
task_gc$select(...)
lrn_ranger$train(...)

Solution

< section id="multiple-performance-measures" class="level1">

2 Multiple Performance Measures

To optimize multiple performance metrics, the same steps must be followed as above except that multiple metrics are passed. Create an ´instance´ object as above considering the measures classif.tpr and classif.tnr. For the second step use random search and take a look at the results in a third step.

We again use the german_credit data:

task_gc = tsk("german_credit")

< details> < summary> Hint 1:

instance = fsi(...)

fselector = fs(...)
fselector$...(...)

features = unlist(lapply(...))
cbind(features,...)

Solution

< section id="nested-resampling" class="level1">

3 Nested Resampling

Nested resampling enables finding unbiased performance estimators for the selection of features. In mlr3 this is possible with the class AutoFSelector, whose instance can be created by the function auto_fselector().

< section id="create-an-autofselector-instance" class="level2">

3.1 Create an `AutoFSelector` Instance

Implement an AutoFSelector object that uses random search to find a feature selection that gives the highest accuracy for a logistic regression with holdout resampling. It should terminate after 10 evaluations.

< details> < summary> Hint 1:

afs = auto_fselector(
  fselector = ...,
  learner = ...,
  resampling = ...,
  measure = ...,
  terminator = ...
)

Solution

< section id="benchmark" class="level2">

3.2 Benchmark

Compare the AutoFSelector with a normal logistic regression using 3 fold cross-validation.

< details> < summary> Hint 1:

The AutoFSelector inherits from the Learner base class, which is why it can be used like any other learner.

< details> < summary> Hint 2:

Implement a benchmark grid and aggregate the result.

Solution

< section id="summary" class="level1">

Summary

Wrapper methods calculate performance measures for various combinations of features in order to perform feature selection.
They are computationally expensive since several models need to be fitted.
The AutoFSelector inherits from the Learner base class, which is why it can be used like any other learner.

To leave a comment for the author, please follow the link and comment on their blog: mlr-org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Goal

Wrapper Methods

Prerequisites

1 Basic Application

1.1 Create the Framework

1.2 Start the Feature Selection

1.3 Evaluate

1.4 Model Training

2 Multiple Performance Measures

3 Nested Resampling

3.1 Create an AutoFSelector Instance

3.2 Benchmark

Summary

Related

3.1 Create an `AutoFSelector` Instance