Filter

Giuseppe Casalicchio

1 week ago

[This article was first published on mlr-org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

JavaScript is required to unlock solutions.
Please enable JavaScript and reload the page,
or download the source files from GitHub and run the code locally.

< section id="goal" class="level1">

Goal

Learn how to rank features of a supervised task by their importance / strength of relationship with the target variable using a feature filter method.

< section id="german-credit-dataset" class="level1">

German Credit Dataset

We create the task as for the resampling exercise: The German Credit Data set.

library("mlr3verse")
library("data.table")
task = tsk("german_credit")

< section id="exercises" class="level1">

Exercises

Within the mlr3 ecosystem, feature filters are implemented in the mlr3filters package and are typically used in combination with mlr3pipelines to be able to include the whole preprocessing step in a pipeline. In exercises 1 to 3, we apply feature filtering to preprocess the data of a task without using a pipeline. In exercise 4, we will set up a pipeline that combines a learner with the feature filtering as preprocessing step.

< section id="exercise-1-find-a-suitable-feature-filter" class="level2">

Exercise 1: Find a suitable Feature Filter

< !-- Feature filters are comprised of a set of methods for feature selection that aim at quantifying the ''usefulness'' of a feature in a supervised task. --> < !-- Often, it is desirable to reduce the number of features to both decrease the computational cost of fitting a learner and in some cases even improving the performance of the model. --> < !-- Based on the metric of a feature filter, features can be ranked and the ones with the strongest relationship with the target variable can be selected to be included in the modelling process. --> < !-- Typically, feature filters are used when a large number of similar features are available. --> < !-- Nevertheless, feature filters also are useful when only a medium number of features is available, as they allow for quantifying the importance of a feature in a supervised setting providing insight into the relationship of a feature and the target variable. --> < !-- Here, we will use feature filters to illuminate the strength of the relationship between features and the target variable. --> < !-- #FIXME: comment /additional info + link on logistic regression -->

Make yourself familiar with the mlr3filters package (link). Which Filters are applicable to all feature types from the task we created above?

< details> < summary> Hint:

Some filters are only applicable to either classification or regression or either numeric or categorical features. Therefore, we are looking for a Filter that is applicable to our classification task and that can be computed for integer and factor features (as these types of features are present in task, see task$feature_types).

The website linked above includes a table that provides detailed information for each Filter.

Solution

< section id="exercise-2-information-gain-filter" class="level2">

Exercise 2: Information Gain Filter

We now want to use the information_gain filter which requires to install the FSelectorRcpp package. This filter quantifies the gain in information by considering the following difference: H(Target) + H(Feature) - H(Target, Feature) Here, H(X) is the Shannon entropy for variable X and H(X, Y) is the joint Shannon entropy for variable X conditioned on Y.

Create an information gain filter and compute the information gain for each feature.

Visualize the score for each feature and decide how many and which features to include.

< details> < summary> Hint 1:

Use flt("information_gain") to create an information_gain filter and calculate the filter scores of the features. See ?mlr_filters_information_gain (or equivalently flt("information_gain")$help()) for more details on how to use a filter. If it does not work, you can use e.g. flt("importance", learner = lrn("classif.rpart")) which uses the feature importance of a classif.rpart decision tree to rank the features for the feature filter.

For visualization, you can, for example, create a scree plot (similar as in principle component analysis) that plots the filter score for each feature on the y-axis and the features on the x-axis.

Using a rule of thumb, e.g., the ‘’elbow rule’’ you can determine the number of features to include.

< details> < summary> Hint 2:

library(mlr3filters)
library(mlr3viz)
library(FSelectorRcpp)
filter = flt(...)
filter$calculate()
autoplot(...)

Solution

< section id="exercise-3-create-and-apply-a-pipeopfilter-to-a-task" class="level2">

Exercise 3: Create and Apply a PipeOpFilter to a Task

Since the k-NN learner suffers from the curse of dimensionality, we want set up a preprocessing PipeOp to subset our set of features to the 5 most important ones according to the information gain filter (see flt("information_gain")$help()). In general, you can see a list of other possible filters by looking at the dictionary as.data.table(mlr_filters). You can construct a PipeOp object with the po() function from the mlr3pipelines package. See mlr_pipeops$keys() for possible choices. Create a PipeOp that filters features of the german_credit task and creates a new task containing only the 5 most important ones according to the information gain filter.

< !-- < details> –> < !-- < summary>**Details on the ANOVA F-test filter:** –> < !-- The filter conducts an analysis of variance for each feature, where the feature explains the target class variable. --> < !-- The score is determined by the F statistic's value. --> < !-- The more different the mean values of a feature between the target classes are, the higher is the F statistic. --> < !-- –> < details> < summary> Hint 1:

The filter can be created by flt("information_gain") (see also the help page flt("information_gain")$help()).
In our case, we have to pass the "filter" key to the first argument of the po() function and the filter previously created with the flt function to the filter argument of the po() function to construct a PipeOpFilter object that performs feature filtering (see also code examples in the help page ?PipeOpFilter).
The help page of ?PipeOpFilter also reveals the parameters we can specify. For example, to select the 5 most important features, we can set filter.nfeat. This can be done using the param_vals argument of the po() function during construction or by adding the parameter value to the param_set$values field of an already created PipeOpFilter object (see also code examples in the help page).
The created PipeOpFilter object can be applied to a Task object to create the filtered Task. To do so, we can use the $train(input) field of the PipeOpFilter object and pass a list containing the task we want to filter.

< details> < summary> Hint 2:

library(mlr3pipelines)
# Set the filter.nfeat parameter directly when constructing the PipeOp:
pofilter = po("...",
  filter = flt(...),
   ... = list(filter.nfeat = ...))

# Alternative (first create the filter PipeOp and then set the parameter):
pofilter = po("...", filter = flt(...))
pofilter$...$filter.nfeat = ...

# Train the PipeOpFilter on the task
filtered_task = pofilter$train(input = list(...))
filtered_task
task

Solution

< section id="exercise-4-combine-pipeopfilter-with-a-learner" class="level2">

Exercise 4: Combine PipeOpFilter with a Learner

Do the following tasks:

Combine the PipeOpFilter from the previous exercise with a k-NN learner to create a so-called Graph (it can contain multiple preprocessing steps) using the %>>% operator.
Convert the Graph to a GraphLearner so that it behaves like a new learner that first does feature filtering and then trains a model on the filtered data and run the resample() function to estimate the performance of the GraphLearner with a 5-fold cross-validation.
Change the value of the nfeat.filter parameter (which was set to 5 in the previous exercise) and run again resample().

< details> < summary> Hint 1:

Create a kNN learner using lrn(). Remember that the shortcut for a kNN classifier ist "classif.kknn".
You can concatenate different preprocessing steps and a learner using the %>>% operator.
Use as_learner to create a GraphLearner (see also the code examples in the help page ?GraphLearner).

< details> < summary> Hint 2:

library(mlr3learners)
graph = ... %>>% lrn("...")
glrn = as_learner(...)
rr = resample(task = ..., learner = ..., resampling = ...)
rr$aggregate()

# Change `nfeat.filter` and run resampling again using same train-test splits
...
rr2 = resample(task = ..., learner = ..., resampling = rr$resampling)
rr2$aggregate()

Solution

< section id="summary" class="level1">

Summary

We learned how to use feature filters to rank the features w.r.t. a feature filter method in a supervised setting and how to subset a task accordingly.

Ideally, feature filtering is directly incorporated into the learning procedure by making use of a pipeline so that performance estimation after feature filtering is not biased.

< !-- In later exercises, we will see how the performance of a whole pipeline can be properly evaluated. -->

To leave a comment for the author, please follow the link and comment on their blog: mlr-org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Goal

German Credit Dataset

Exercises

Exercise 1: Find a suitable Feature Filter

Exercise 2: Information Gain Filter

Exercise 3: Create and Apply a PipeOpFilter to a Task

Exercise 4: Combine PipeOpFilter with a Learner

Summary

Related