|This post was kindly contributed by R – SmarterPoland.pl - go there to comment and to read the full post.|
I love the pkgdown package. With a single line of code you can create a complete website with examples, vignettes and documentation for your package. Brilliant!
So what about a website generator for predictive models?
Imagine that you can take a set of predictive models (generated with caret, mlr, glm, xgboost or randomForest, anything) and automagically generate a website with an exploration/documentation for these models. A documentation with archvist hooks to models, with tables and graphs for model performance explainers, conditional model response explainers or explainers for particular predictions.
During the summer semester three students from Warsaw University of Technology (Kamil Romaszko, Magda Tatarynowicz, Mateusz Urbański) developed modelDown package for R as an team project assignment. You can find the package here. Visit an example website created with this package for four example models (instructions). And read more about this package at its pkgdown website or below.
Getting started with modelDown
by Kamil Romaszko, Magda Tatarynowicz, Mateusz Urbański
Did you ever want to have one place where you can find information explaining your model? Or maybe you were missing a tool that can show difference in multiple models for the same dataset? Well, here comes modelDown package. By using DALEX package, it creates one html page with plots and information related to the model(s) you want to analyze.
If you want to check out example website generated with modelDown, check out this link (along with script that was used to create the html). Read on to see how to use package for your own models and what features it provides.
The examples presented here were generated for dataset HR_data from breakDown package (available on CRAN). The dataset contains various information about employees (for example their satisfaction from work or their salary). The information we predict is whether they left the company.
First things first – how can you use this package? Install it from github:
When you have the package successfully installed, you need to create DALEX explainers for you models. Here is a simple example. Please refer to DALEX package documentation in order to learn more.
# assuming you have two models: glm_model and ranger_model for HR_data explainer_glm <- DALEX::explain(glm_model, data = HR_data, y = HR_data$left) explainer_ranger <- DALEX::explain(ranger_model, data = HR_data, y = HR_data$left)
Next, just pass all created explainers to function modelDown. For example:
That’s it! Now you should have your html page generated with default options.
Let’s quickly describe the sections of your page. If you want to know more about how the plots are generated, again, check out DALEX package documentation.
Always know your data before you analyze the model – the index page helps you do exactly that.
You can see basic information about your data, like dimensions and summary of all variables. For numerical variables there is some statistical data presented, for categorical ones you see how many observations were in each category.
The most general informations about how correct were the predictions.
For our two models – clearly ranger model has lower residual values, which suggests its better performance for this dataset.
Variable importance plot is extremely useful when you want to see how removing single variable impacts the response – basically how important every variable is.
Here, it is clear that for linear model there are two most important variables – number_project and satisfaction_level. For ranger model, there are 4 most important variables. Also, for each model different variable was picked as the most important one.
In variable response plot you can see how one variable impacts response.
For example, for variable average_monthly_hours and glm model, there is a linear dependency – the more hours someone works, the greater chance he will leave the company. For ranger model, this is not so clear – chance of leaving drastically increases for people working more than 270 hours a month. By default the plots are generated for every variable, so you can make similar conclusions for all variables in the model.
Prediction breakdown shows detailed informations for particular observations in a model.By default for each model one observation with worst predicted value is presented.
On the example, for ranger model the value of satisfaction_level had the biggest part in final response calculation. So even though this particular employee’s satisfaction level was lower than half of scale used to measure, he still didn’t leave the company. The model prediction was not correct in this case.
Prediction breakdown makes it easier to understand how model acted. It can be useful for tuning your model and improving its capabilities.
The idea of the package was to help you understand your models in a condensed and easy way. We hope that using this package will make models’ performance clear to you. Feel free to use it and provide your feedback.