Welcome to the second part of the forester blog. In the previous part, we explained the main idea of the forester package, the motivations behind it, its advantages, and the innovations it brings to the ML world. You should definitely check it out! In this part, however, we will focus on showing the wide range of possibilities of the forester package and things you can achieve with it. We will present you the main functions of the package with their parameters and show how you can use them in your problems.
The forester package capsulizes important steps in the ML pipeline. We discussed each step in the previous part using the graph below.
Now we will try to explain how our package exactly works and what happens between the first and the last step of the process. The basic scheme of functions of our package is presented on the graph below. Notice how the colors below match particular steps on the graph above. The only thing user has to do to create a model is to run one function. Then the data preprocessing is performed. In the next step, the forester package creates models and then tunes them. Finally, the models are evaluated and compared so that the best model can be chosen. The forester package returns the DALEX object so the user is able to easily create various plots to explain the model’s predictions. We will now look closer at the particular functions of our pipeline and show what you can achieve with them.
Make functions are arguably the core of our package. Their goal is to simplify the process of creating models. Using those functions, you’re able to create basic tree-based models in just a few seconds. And simply by choosing the right arguments, you can also perform simple data preprocessing and even train your model. Let’s see how it works.
First, we’ll need some data.
# Loading libraries library(DALEX) library(forester) # Creating train and test set data_shuffled <- fifa[sample(1:5000, 5000),] data_train <- head(data_shuffled, 4000) data_test <- data_shuffled[4001:5000,]
Then using this line of code you can create basic models. We will focus on ranger models, but everything works the same for the rest of the models.
basic_model <- make_ranger(data = data_train, target = "overall", type = "regression", label = "Basic Ranger")
You can also perform simple data preprocessing by selecting the right parameters. You can decide what you want to do with missing data, whether you want to delete or fill it. Moreover, if you choose how many features you want to use in training, then, by using Boruta package, most important features will be chosen. In this example we will use ten most important features and we will fill NA values.
prep_model <- make_ranger(data = data_train, target = "overall", type = "regression", fill_na = TRUE, num_features = 10, label = "Prep Ranger")
Moreover, you can easily tune your models by setting tune = True. You can (but don’t have to) choose a metric, based on which the model will be evaluated during the tunning process.
tuned_model <- make_ranger(data = data_train, target = "overall", type = "regression", fill_na = TRUE, num_features = 10, tune = TRUE, label = "Tuned Ranger")
After creating several different models, you may want to compare them and see, which one is the best. forester provides a compare function that evaluates all models, creates a table with different metrics, and selects the best model based on the chosen metric. We can use this function to see which of our three models is the best. We have to provide the test set of course. We can see, that both preprocessing and tunning improved our model’s score.
evaluation <- evaluate(basic_model, prep_model, tuned_model, data_test = data_test, target = "overall") best_model <- evaluation$best_model
Evaluate function returns a list, that contains the best model and a data frame with the results of the evaluation, so to use best_model we have to extract it first.
Now that we know which model is the best we can use explainable artificial intelligence (XAI) methods to explain our model. Because the forester is well adjusted to the DALEX package, we can do it easily by using functions from the DALEX package. In this example we will create Break Down Profile and Feature Importance plot. To read more about those methods I recomend you this blog about XAI methods.
### Feature importance best_model <- evaluation$best_model mp <- model_parts(best_model) plot(mp, max_vars = 5)
### Break Down profile nobs <- data_test[150, , drop = FALSE] pp <- predict_parts(best_model, new_observation = nobs, type = "break_down") plot(pp, max_vars = 5)
If you don’t want to choose a model on your own then you can use the forester function, which performs the whole pipeline for all available models in one function, and then chooses the best one. This function has very similar arguments to earlier mentioned make functions.
f_model <- forester(data = data_train, target = "overall", type = "regression")
As we can see, using the forester package is very simple, in fact it only takes one line of code to create a tree-based model without getting into the data and its processing, and the final object is easy to use with the DALEX package. You can read more about our package on the GitHub repository.
If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.
In order to see more R related content visit https://www.r-bloggers.com
Guide through jungle of models! What’s more about the forester R package? was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.