[This article was first published on r-bloggers on Machine Learning in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We just released mlr v2.15.0 to CRAN. This version includes some breaking changes and the usual bug fixes from the last three months.

We made good progress on the goal of cleaning up the Github repo. We processed nearly all open pull requests (around 40). In the next months we will focus on cleaning up the issue tracker even though most of our time will go into improving the successor package mlr3 and its extension packages.

Unless there are active contributions from the user side, we do not expect much feature additions for the next version(s) of mlr.

# Changes to benchmark()

The benchmark() function does not store the tuning results (stored in the \$extract slot) anymore by default. This change was made to prevent BenchmarkResult (BMR) objects from getting huge in size (~ GB) when multiple models are compared with extensive tuning. Unless you want to do a analysis on the tuning effects, you do not need the tuning results to compare the performance of the algorithms. Huge BMR objects can cause various troubles. One of them (which was the inital root for this change) appears when benchmarking is done on a HPC using multiple workers. Each worker has a limited amount of memory and expecting a huge BMR might limit the amount of workers that can be spawned. In addition, loading the large resulting BMR into the global environment (or merging it using mergeBenchmarkResults()) for post-analysis will become a pain. To save users from all of these troubles in the first place, we decided to change the default.

To store the tuning results, you have to actively set keep.extract = TRUE from now on. Not storing the tuning was actually already implicitly the default in resample() since the user had to set the extract argument manually to save certain results (tuning, feature importance). With the new change the package became more consistent.

# Changes to Filters

## New ensemble filters

With this release it is possible to calculate ensemble filters with mlr (Seijo-Pardo et al. 2017). “Ensemble filters” are similar to ensemble models in the way that multiple filters are used to generate the ranking of features. Multiple aggregations functions are supported (min(), mean(), median(), “Borda”) with the latter being the most used one in literature while writing this.

To our knowledge there is no other package/framework in R currently that supports ensemble filters in a similar way mlr does. Since mlr makes it possible to use filters from a variety of different packages, the user is able to create powerful ensemble filters. Note however that currently you cannot tune the selection of simple filters since tuning a character vector param is not supported by ParamHelpers. See this discussion for more information.

Here is a simple toy example how to create ensemble filters in mlr from ?filterFeatures():

library(mlr)
base.methods = c("FSelectorRcpp_gain.ratio", "FSelectorRcpp_information.gain"), abs = 2)
## Type: classif
## Target: Species
## Observations: 150
## Features:
##    numerics     factors     ordered functionals
##           2           0           0           0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 3
##     setosa versicolor  virginica
##         50         50         50
## Positive class: NA

## New return structure for filter values

With the added support for ensemble filters we also changes the return structure of calculated filter values.

The new makes it easier to apply post-analysis tasks like grouping and filtering. The “method” of each row is now grouped into one column and the filter values are stored in a separate one. We also added a default sorting of the results by the “value” of each “method”.

Below is a comparison of the old and new output:

# new
method = c("FSelectorRcpp_gain.ratio", "FSelectorRcpp_information.gain"))
## FilterValues:
##           name    type                         method     value
## 4  Petal.Width numeric       FSelectorRcpp_gain.ratio 0.8713692
## 3 Petal.Length numeric       FSelectorRcpp_gain.ratio 0.8584937
## 1 Sepal.Length numeric       FSelectorRcpp_gain.ratio 0.4196464
## 2  Sepal.Width numeric       FSelectorRcpp_gain.ratio 0.2472972
## 8  Petal.Width numeric FSelectorRcpp_information.gain 0.9554360
## 7 Petal.Length numeric FSelectorRcpp_information.gain 0.9402853
## 5 Sepal.Length numeric FSelectorRcpp_information.gain 0.4521286
## 6  Sepal.Width numeric FSelectorRcpp_information.gain 0.2672750
# old
method = c('gain.ratio','information.gain')
## FilterValues:
##           name    type gain.ratio information.gain
## 1 Sepal.Length numeric  0.4196464        0.4521286
## 2  Sepal.Width numeric  0.2472972        0.2672750
## 3 Petal.Length numeric  0.8584937        0.9402853
## 4  Petal.Width numeric  0.8713692        0.9554360

# Learners

Besides the integration of new learners and some added options for integrated ones (check the NEWS file), we fixed a bug that caused an incorrect aggregation of probabilities in certain cases. This bug was around undetected for quite some time and was revealed due to a change in data.table’s rbindlist() function. Thankfully @danielhorn reported this issue and we could fix it within a few days.

Another mentionable change is that the commonly used e1071::svm() learner now only uses the formula interface internally if factors are present in the data. This aims to prevent “stack overflow” problems that some user encountered with large datasets.

With PR #1784 we added more support for estimating standard errors using the internal methods of the “Random Forest” algorithm. Please check the NEWS file for more detailed information about the implemented RF learners.

# References

Seijo-Pardo, B., I. Porto-Díaz, V. Bolón-Canedo, and A. Alonso-Betanzos. 2017. “Ensemble Feature Selection: Homogeneous and Heterogeneous Approaches.” Knowledge-Based Systems 118 (February): 124–39. https://doi.org/10/f9qgrv.