Explaining models with triplot, part 1
tl;dr Explaining black box models built on correlated features may prove difficult and provide misleading results. R package triplot, part of the DrWhy.AI project, is aiming at facilitating the process of explaining the importance of the whole group of variables, thus solving the problem of correlated features.
Calculating the importance of explanatory variables is one of the main tasks of explainable artificial intelligence (XAI). There are a lot of tools at our disposal that helps us with that, like Feature Importance or Shapley values, to name a few. All these methods calculate individual feature importance for each variable separately. The problem arises when features used in the model, are correlated.
Triplot is a visual tool that allows us to assess the variable importance, taking into account the correlation structure. Triplots work for global and local explanations. Additionally, the package provides an instance-level explainer called predict_aspects, which is able to explain the contribution of the whole groups of explanatory variables for the arbitrary model (to be described in the next post).
Global explanations with triplot
Based on a dataset with highly correlated variables we will show why it is necessary to consider the importance of groups of variables. For this example, we are using FIFA 20 dataset from the Kaggle website, available in the DALEX package. We are explaining the model, whose goal is to predict players’ value in Euro based on their characteristics.
library(“DALEX”) data(fifa) fifa$value_eur <- fifa$value_eur/10⁶ fifa[, c(“nationality”, “overall”, “potential”, “wage_eur”)] <- NULL
FIFA 20 dataset contains many correlated variables. We can get a glimpse of that, by looking at the subset of this dataset — at the correlation of predictors that describe goalkeeping and ball-handling skills.
library("dplyr") fifa_subset <- fifa %>% select(matches('goalkeeping|skill')) library("corrplot") corrplot(cor(fifa_subset), method = "color", type = "upper", order = "hclust", addCoef.col = "black",number.cex = .7, diag = FALSE)
Looking at the produced corrplot (function from the package with the same name), we can notice groups of highly correlated variables like goalkeeping skills or dribbling and ball control.
Now, let’s build a predictive model and see how triplot can help us explore those correlated features’ importance. For this purpose, we are building the Random Forest model (from ranger package) and wrap it in DALEX explainer.
library("ranger") set.seed(2020) fifa_model <- ranger(value_eur~., data = fifa) fifa_explainer <- DALEX::explain(fifa_model, data = fifa[,-1], y = fifa$value_eur, label = "Random Forest")
Finally, it’s time to create a triplot for our model. On the created explainer, we are building and plotting global triplot by using a function model_triplot.
library("triplot") fifa_triplot_global <- model_triplot(fifa_explainer, B = 1, N = 5000, cor_method = "pearson") plot(fifa_triplot_global, margin_mid = 0)
Triplot shows, in one place:
- the global importance of every single feature (the left panel),
- correlation structure visualized by hierarchical clustering (the right panel),
- the importance of groups of variables determined by the hierarchical clustering (the middle panel).
Importance in left and middle panels is measured by permutation-based feature importance provided by the model_parts function available in DALEX.
As we’ve already seen, thanks to corrplot, ball_control and dribbling are strongly correlated (0.95) and have single importance at around 2.9 and 2, respectively. Thanks to the triplot we see that a group formed out of them has the importance of 3.75. When we consider adding to this group skill_curve and attacking_crossing (both scored low in single feature importance explanations but are correlated with ball_control and dribbling), the importance of this group rises only slightly, to 3.95.
On the other hand, variables connected with goalkeeping abilities (positioning, diving, etc.) are also highly correlated, but as a whole group, they have importance at only 3. What’s more, increasing the goalkeeping group by adding goalkeeping_kicking skill, doesn’t increase the group’s importance at all.
We can also observe that sliding and standing tackles (defending skills) are also highly correlated (0.97) but their importance, after grouping them together, increases only from 1.9 and 1.7, to 2.4.
By further investigating triplot in this manner, we can get a set of insights that can help us in understanding how the model treats correlated variables and improve our feature selection efforts.
Final note: since triplot is basing on correlation, it can be used only to explain numeric features.