Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Machine learning is a hot topic nowadays, thus there is no need to convince anyone about its usefulness. ML models are being successfully applied in biology, medicine, finance, and so on. Thanks to modern software, it is easy to train even a complex model that fits the training data and results in high accuracy on the test set. The problem arises when poorly verified model fails confronted with real-world data.

In this post, we would like to describe auditor package for visual auditing of residuals of machine learning models. A residual is the difference between the observed value and the value predicted by a model. The auditor provides methods for verification and validation of models. It helps in finding answers to questions that may be crucial in deeper analyses of models.

• Does the model fit the data? Is it not missing any information?
• Which model has better performance?
• How similar are models?

# Motivation

Before we start our journey with the auditor, let us focus on linear models. Such models have a very simple structure and do not require high computational power, therefore, there are many tools that validate different aspects of these models. Function plot() from the stats package generates six types of diagnostic plots for “lm” and “glm” objects.

However, this function can generate plots only for linear models and some of these plots are not extendable to other models. At the same time, many algorithms, such as random forests or neural networks are often treated as black boxes and there are few or no methods for errors analysis. The auditor comes as a solution to the problems. It is a toolbox with model-agnostic validation plots, which means that they can be used regardless of the expected distribution of residuals. The auditor provides flexible and consistent grammar for validation of any model class.

# Data

Use case – predicting a length of life

To illustrate application of auditor we will use dataset “dragons” available in the DALEX package. The dataset contains characteristics of fictional creatures (dragons), like year of birth, height, weight, etc (see below). The goal is to predict the length of life of dragons (a regression problem).

library(DALEX)
data(dragons)
head(dragons)

##   year_of_birth   height   weight scars colour year_of_discovery
## 1         -1291 59.40365 15.32391     7    red              1700
## 2          1589 46.21374 11.80819     5    red              1700
## 3          1528 49.17233 13.34482     6    red              1700
## 4          1645 48.29177 13.27427     5  green              1700
## 5            -8 49.99679 13.08757     1    red              1700
## 6           915 45.40876 11.48717     2    red              1700
##   number_of_lost_teeth life_length
## 1                   25   1368.4331
## 2                   28   1377.0474
## 3                   38   1603.9632
## 4                   33   1434.4222
## 5                   18    985.4905
## 6                   20    969.5682

## Models

First, we need models to compare. We selected linear regression and random forest because of their different structures. Linear regression model linear relationships between target response and independent variables, while random forest should be able to capture also non-linear relationships between variables.

# Linear regression
lm_model <- lm(life_length ~ ., data = dragons)

# Random forest
library(randomForest)
set.seed(59)
rf_model <- randomForest(life_length ~ ., data = dragons)

## Preparation for residual (error) analysis

Analysis begins with creation of an explainer object with explain function from DALEX package. Explainer wraps a model with its meta-data, such as dataset that was used for training or observed response.

lm_exp <- DALEX::explain(lm_model, label = "lm", data = dragons, y = dragons$life_length) rf_exp <- DALEX::explain(rf_model, label = "rf", data = dragons, y = dragons$life_length)

Next step requires creation of model_residual objects of each explainer. From this step on, only auditor functions will be used.

library(auditor)
lm_mr <- model_residual(lm_exp)
rf_mr <- model_residual(rf_exp)

## Plots

In the following section, we show individual plotting functions which demonstrate different aspects of residual analysis. We devote more attention to selected functions, but usage of each function is more or less similar.

### Observed vs predicted

First plot is a basic plot comparising predicted versus observed values. The red line corresponds to the y = x function. The patterns for both models are non-random around the diagonal line. The points corresponding to a random forest (darker dots) show the tendency to underprediction for large values of observed response. Points for linear model (lighter dots) are located more or less around diagonal line which means that this model predicts quite well.

plot(rf_mr, lm_mr, type = "prediction", abline = TRUE)

# alternatives:
# plot_prediction(rf_mr, lm_mr, abline = TRUE)
# plot_prediction(rf_mr, lm_mr, variable = "life_length")

Function plot_prediction presents observed values on the x-axis. However, on the x-axis there may be values of any model variable or observations ordered by index (variable = NULL).

plot(rf_mr, lm_mr, variable = "height", type = "prediction")
plot(rf_mr, lm_mr, variable = "scars", type = "prediction")

As you can notice, on above plots, there is no relationship for variable height and predicted values while for increasing number of scars model predictions also increase of life length. This means that that model captured monotonic relationship between number of scars and length of life of dragon.

### Residuals vs observed, fitted or variable values

Next function (plot_residual()) shows residuals versus observed values. This plot is used to detect dependence of errors, unequal error variances, and outliers. For appropriate model, residuals should not show any functional dependency. Expected mean value should be equal to 0, regardless of $$\hat{y}$$ values, so any structured arrangement of points suggests a problem with the model. It is worth looking at the observations that clearly differ from the others. If points on the plot are not randomly dispersed around the horizontal axis, it may be presumed that model is not appropriate for the data.

plot(lm_mr, rf_mr, type = "residual")

# alternative:
# plot_residual(lm_mr, rf_mr)

Values (residuals) may also be ordered by target variable, fitted values, any other variable or may be presented unordered.

plot(rf_mr, lm_mr, type = "residual", variable = "_y_hat_")
plot(rf_mr, lm_mr, type = "residual", variable = "scars")

# alternative:
# plot_residual(rf_mr, lm_mr, variable = "_y_hat_")
# plot_residual(rf_mr, lm_mr, variable = "scars")

In all examples above, we can see that linear model is better fitted for the data than random forest, because for the latter one greater values of selected variables residuals are also geater. Additionaly, we can identify most outlying observations:

plot_residual(rf_mr, variable = "_y_hat_", nlabel = 10)

### Density of residuals

Residual density plot (plot_residual_density()) detects the incorrect behavior of residuals. The funcion returns plot with estimated densities of residuals. Their values are displayed as marks along the x axis. For some models, the expected shape of density could be derived from the model assumptions. For example, simple linear model residuals should be normally distributed. However, even if the model does not have an assumption about the distribution of residuals residual density plot may be informative. If most of the residuals are not concentrated around zero, it is likely that the model predictions are biased.

plot(rf_mr, lm_mr, type = "residual_density")

# alternative
# plot_residual_density(rf_mr, lm_mr)

Resuduals may be also divided by values of a choosen variable (median of a numeric variable or levels of a factor).

plot_residual_density(rf_mr, lm_mr, variable = "colour")

### Boxplot of residuals

Residual boxplot (plotResidualBoxplot()) shows the distribution of the absolute values of residuals. Boxplot usually presents following values:

• box width which corresponds to the second and third quartile,
• vertical line which reflects median,
• the whiskers which extend to the smallest and largest values, no further than 1.5 of interquartile.

auditor adds another component to the boxplot which is the root mean square error (RMSE) measure, shown as X. For the appropriate model, box should lay near zero. A large spread of values indicates problems with a model. Comparing our two models we can see that random forest model is much more spreaded (worse) than linear one.

plot(lm_mr, rf_mr, type = "residual_boxplot")

# alternative
# plot_residual_boxplot(lm_mr, rf_mr)

## More

Plots presented in this post are only a small part of the auditor’s capacities. Among other ploting functions focusing on residuals, there are also model preformance and model evaluation functions. What is more, most of the ploting functions are available in an interactive version (D3.js). We encourage you to visit the following sources:

The upcoming chance to learn more about the auditor is going to be at a hands-on workshop during the Why R? Conference held in Warsaw, 26-29 September 2019.
If you have not done it yet, we strongly encourage you to register on Why R?, it is a remarkable occasion to gain knowledge about R, enhance data analysis and processing skills, and meet other members of European R communit.