In this blog, we’d like to introduce you to the brand new, reorganised and restructured version of the forester R package.
Responsible ML readers might already be familiar with the package’s name and possibly wonder why we are describing it again. The previous version of the forester was introduced about 1.5 years before (the authors: Anna Kozak, Szymon Szmajdziński, Thien Hoang Ly) and was followed by two blogs: ‘forester: An AutoML R package for Tree-based Models’, and ‘Guide through jungle of models! What’s more about the forester R package?’. Unfortunately, because of other responsibilities and new opportunities, the authors weren’t able to maintain the package, which lead to the point where reanimation of the tool was needed. A new scientific team (Anna Kozak, Hubert Ruczyński, Adrianna Grudzień, Patryk Słowakiewicz) overtook the project and created it from scratch, learning from its predecessors’ mistakes.
What is the forester?
The forester is an AutoML tool in R for tabular data regression and binary classification tasks. It wraps up all machine learning processes into a single train() function, which includes:
- rendering a brief data check report,
- preprocessing the initial dataset enough for models to be trained,
- training 5 tree-based models (decision tree, random forest, xgboost, catboost, lightgbm) with default parameters, random search and Bayesian optimisation,
- evaluating them and providing a ranked list.
However, that’s not everything that the forester has to offer. Via additional functions, the user can easily explain created models with the usage of DALEX or generate one of the predefined reports.
The packages main goal is to maintain the user interface as simple as possible, so everyone can benefit from its possibilities. It is specifically designed for:
- Beginners in ML to train their first models and start their modelling career.
- Researchers from other scientific fields to easily add ML solutions and analysis for their thesis.
- ML experts to easily conduct dataset analysis, create baseline models, and quickly explore new tasks they are facing.
AutoML and forester pipelines
In order to fully understand what the forester package offers we need to provide a brief knowledge about the machine learning (ML) and automated machine learning (AutoML) pipelines.
The classical ML pipeline starts with two pre-modelling steps which are task identification and data collection. They are undoubtedly important, however, we will focus on the steps highlighted in green colour, because they are the heart of the whole process.
During the preprocessing stage, data scientists focus on proper data preparation, so that the models can be later trained. Typical actions performed here are missing values imputation, data encoding or the removal of static columns. The feature engineering process consists of more advanced methods and its goal is to select the most important columns from the dataset for the model training. It includes for example the removal of highly correlated columns, or selection via lasso or ridge methods for regression tasks. The most time-consuming step is model training. At this point, the data scientist has to select the model engines and tune plenty of hyperparameters manually in order to achieve the best results. In the end, comes the post-processing which includes evaluating the models by different metrics and comparing them to one another to choose the best one.
As one can see, model training is an iterative process that consists of highly repetitive steps, and ends up being incredibly time-consuming. The best way to fight that is to use an AutoML tool. As shown below, such solutions automate the ML pipeline, so data scientists can deal with more important matters.
Why tree-based models?
Some users might be surprised that all models used inside the package are from a tree-based family and wonder if there are any particular reasons for doing so. There definitely are and the most prominent ones are:
- Tree-based models are extremely popular amongst winners in Kaggle competitions, which shows their versatility.
- They are superior to deep learning neural networks, due to a lack of bias towards overly smoothed solutions.
- Bagging and boosting ensembles address their main drawback, which is overfitting.
- Tree-based models are easy to understand for users without an ML background and have already established good opinions among doctors.
For further reading and more in-depth analysis of the tree-based models’ performance we recommend a paper by Leo Grisztajn ‘Why do tree-based models still outperform deep learning on tabular data?’. The visualisations below come from the aforementioned publication.
Package structure and user interface
The graph presented below briefly summarises the processes inside of the main train() function and it adds information about additional features of the package. The explain() function creates an explainable artificial intelligence (XAI) explainer from DALEX package. The save() function lets the user save final object, and the report() creates an automatically generated report from the training process. One can also use a data_check() function, which is also present inside of the preprocessing step.
In the next blog post we will describe all forester features in detail and we will underline what makes the package special among other AutoML solutions in R.
If you are interested in other posts about explainable, fair and responsible ML, follow #ResponsibleML on Medium.
In order to see more R related content visit https://www.r-bloggers.com
forester: an R package for automated building of tree-based models was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.