forester: what makes the package special?

[This article was first published on R in ResponsibleML on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this blog, we’d like to delve into more details about the package’s features than in the previous post introducing the new version of the forester package. We will underline what makes the package special among other AutoML solutions in R.

The forester pipeline

Now, let’s examine what the general AutoML pipeline described in a previous post looks like inside the forester package. The graph below shows us, that the first two steps are identical and the main part of the pipeline is hidden inside the train() function.

The forester pipeline.

At the beginning of the forester pipeline, we conduct a data quality check, which is an innovative functionality that presents the user with possible problems considering the dataset. Some of these are highly correlated columns, unbalanced classes, or missing values.

The next step is data preparation, which consists of preprocessing, feature engineering and data format adjustments for different model engines.

After that, the model training begins, and the user can choose which tuning paths will be executed. The recommended, most effective, but also the most costly method is Bayesian Optimization, however, we can also choose to train() the models with default parameters or via a random search algorithm.

Ultimately, we evaluate the models and provide the outcomes as a ranked list. The list includes all trained models, sorted from the best to the worst one via the user-chosen metric. The train() function returns a compound object which can be used with other functions from the package.

Package structure and user interface

The graph presented below briefly summarizes the processes inside of the train() function which was described before and it adds information about additional features of the package.

The forester package structure.

The explain() function is a connector to the DALEX package and creates an explainer for selected models. The explainable artificial intelligence (XAI) methods are an important part of model evaluation and we couldn’t omit that in the forester package. With the further use of DALEX, the user is able to create various explanations of the provided models, such as the feature importance plot.

The save() function enables the user to save the object returned by train() in the .RData format. The procedure not only saves the models but also all types of datasets used in the training process (from raw datasets to split and preprocessed ones). It encourages the data scientist to continue the model training on their own after getting a baseline model from the forester.

The report() function creates an automatically generated report describing the training process. There are different document structures for regression and binary classification tasks, however, the general structure of the report presents as follows:

  1. Information about the date, current package version, task type, and basic task description.
  2. The ranked list of best models.
  3. The group of plots comparing the best models.
  4. The group of plots describing the best model, including its explanations.
  5. The data check report.
  6. Details about the best model architecture.

More details about the report will be available in one of the future blog posts.

The report example for the regression task conducted on lisbon dataset.

The last feature available for the user is a data_check() which is not shown on the graph as it is also a part of the train() function. This feature gained lots of positive feedback during the workshop conducted on the group of ML experts from the MI².AI group. This function provides the user with an abundance of information about the dataset, especially warning about possible issues. The example for the lisbon dataset is presented below.

The data check report.

Existing solutions

The AutoML solutions are definitely more common in Python, however, R language also has its own packages. The biggest and most known are H2O and mlr3. During the forester development process, we kept in mind that in order to succeed we have to stay on a similar level as the aforementioned tools and add something extra. Our main goal was to keep the package
easy to use, which may be not so obvious for other solutions. To achieve it, we decided to limit ourselves to 5 tree-based engines, which is less than in H2O and mlr3, but we were able to provide more features. The most innovative one is the data check mentioned in the previous section and the automatic report generation, which will get its own blog.

The comparison of AutoML frameworks in R.
In the next blog post we will present a package usage scenario with the real-life story in the background. It will also include code examples, outcome analysis and comments.
If you are interested in other posts about explainable, fair and responsible ML, follow #ResponsibleML on Medium.
In order to see more R related content visit https://www.r-bloggers.com

forester: what makes the package special? was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R in ResponsibleML on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)