SHAP Analysis in 9 Lines

[This article was first published on Blog – Michael's and Christian's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello ML world

Recently, together with Yang Liu, we have been investing some time to extend the R package SHAPforxgboost. This package is designed to make beautiful SHAP plots for XGBoost models, using the native treeshap implementation shipped with XGBoost.

Some of the new features of SHAPforxgboost

  • Added support for LightGBM models, using the native treeshap implementation for LightGBM. So don’t get tricked by the package name “SHAPforxgboost” :-).
  • The function shap.plot.dependence() has received the option to select the heuristically strongest interacting feature on the color scale, see last section for details.
  • shap.plot.dependence() now allows jitter and alpha transparency.
  • The new function shap.importance() returns SHAP importances without plotting them.
  • Added vignette with basic workflow to CRAN.
  • Added logo:
logo.png

An interesting alternative to calculate and plot SHAP values for different tree-based models is the treeshap package by Szymon Maksymiuk et al. Keep an eye on this one – it is actively being developed!

What is SHAP?

A couple of years ago, the concept of Shapely values from game theory from the 1950ies was discovered e.g. by Scott Lundberg as an interesting approach to explain predictions of ML models.

The basic idea is to decompose a prediction in a fair way into additive contributions of features. Repeating the process for many predictions provides a brilliant way to investigate the model as a whole.

The main resource on the topic is Scott Lundberg’s site. Besides this, I’d recomment to go through these two fantastic blog posts, even if you already know what SHAP values are:

Illustration

As an example, we will try to model log house prices of 20’000 sold houses in Kings County. The dataset is available e.g. on OpenML.org under ID 42092.

Some rows and columns from the Kings County house dataset.

Fetch and prepare data

We start by downloading the data and preparing it for modelling.

library(farff)
library(OpenML)
library(dplyr)
library(xgboost)
library(ggplot2)
library(SHAPforxgboost)

# Load King Country house prices dataset on OpenML
# ID 42092, https://www.openml.org/d/42092
df <- getOMLDataSet(data.id = 42092)$data
head(df)

# Prepare
df <- df %>%
  mutate(
    log_price = log(price),
    log_sqft_lot = log(sqft_lot),
    year = as.numeric(substr(date, 1, 4)),
    building_age = year - yr_built,
    zipcode = as.integer(as.character(zipcode))
  )

# Define response and features
y <- "log_price"
x <- c("grade", "year", "building_age", "sqft_living",
       "log_sqft_lot", "bedrooms", "bathrooms", "floors", "zipcode",
       "lat", "long", "condition", "waterfront")

# random split
set.seed(83454)
ix <- sample(nrow(df), 0.8 * nrow(df))

Fit XGBoost model

Next, we fit a manually tuned XGBoost model to the data.

dtrain <- xgb.DMatrix(data.matrix(df[ix, x]),
                      label = df[ix, y])
dvalid <- xgb.DMatrix(data.matrix(df[-ix, x]),
                      label = df[-ix, y])

params <- list(
  objective = "reg:squarederror",
  learning_rate = 0.05,
  subsample = 0.9,
  colsample_bynode = 1,
  reg_lambda = 2,
  max_depth = 5
)

fit_xgb <- xgb.train(
  params,
  data = dtrain,
  watchlist = list(valid = dvalid),
  early_stopping_rounds = 20,
  print_every_n = 100,
  nrounds = 10000 # early stopping
)

The resulting model consists of about 600 trees and reaches a validation RMSE of 0.16. This means that about 2/3 of the predictions are within 16% of the observed price, using the empirical rule.

Compact SHAP analysis

ML models are rarely of any use without interpreting its results, so let's use SHAP to peak into the model.

The analysis includes a first plot with SHAP importances. Then, with decreasing importance, dependence plots are shown to get an impression on the effects of each feature.

# Step 1: Select some observations
X <- data.matrix(df[sample(nrow(df), 1000), x])

# Step 2: Crunch SHAP values
shap <- shap.prep(fit_xgb, X_train = X)

# Step 3: SHAP importance
shap.plot.summary(shap)

# Step 4: Loop over dependence plots in decreasing importance
for (v in shap.importance(shap, names_only = TRUE)) {
  p <- shap.plot.dependence(shap, v, color_feature = "auto", 
                            alpha = 0.5, jitter_width = 0.1) +
    ggtitle(v)
  print(p)
}

Some of the plots are shown below. The code actually produces all plots, see the corresponding html output on github.

Figure 1: SHAP importance for XGBoost model. The results make intuitive sense. Location and size are among the strongest predictors.
Figure 2: SHAP dependence for the second strongest predictor. The larger the living area, the higher the log price. There is not much vertical scatter, indicating that living area acts quite additively on the predictions on the log scale.
Figure 3: SHAP dependence for a less important predictor. The effect of "condition" 4 vs 3 seems to depend on the zipcode (see the color). For some zipcodes, the condition does not have a big effect on the price, while for other zipcodes, the effect is clearly higher.

Same workflow for LightGBM

Let's try out the SHAPforxgboost package with LightGBM.

Note: LightGBM Version 3.2.1 on CRAN is not working properly under Windows. This will be fixed in the next release of LightGBM. As a temporary solution, you need to build it from the current master branch.

library(lightgbm)

dtrain <- lgb.Dataset(data.matrix(df[ix, x]),
                      label = df[ix, y])
dvalid <- lgb.Dataset(data.matrix(df[-ix, x]),
                      label = df[-ix, y])

params <- list(
  objective = "regression",
  learning_rate = 0.05,
  subsample = 0.9,
  reg_lambda = 2,
  num_leaves = 15
)

fit_lgb <- lgb.train(
  params,
  data = dtrain,
  valids = list(valid = dvalid),
  early_stopping_rounds = 20,
  eval_freq = 100,
  eval = "rmse",
  nrounds = 10000
)

Early stopping on the validation data selects about 900 trees as being optimal and results in a validation RMSE of also 0.16.

SHAP analysis

We use exactly the same short snippet to analyze the model by SHAP.

X <- data.matrix(df[sample(nrow(df), 1000), x])
shap <- shap.prep(fit_lgb, X_train = X)
shap.plot.summary(shap)

for (v in shap.importance(shap, names_only = TRUE)) {
  p <- shap.plot.dependence(shap, v, color_feature = "auto", 
                            alpha = 0.5, jitter_width = 0.1) +
    ggtitle(v)
  print(p)
}

Again, we only show some of the output and refer to the html of the corresponding rmarkdown. Overall, the model seems to be very similar to the one obtained by XGBoost.

Figure 4: SHAP importance for LightGBM. By chance, the order of importance is the same as for XGBoost.
Figure 5: The dependence plot for the living area also looks identical in shape than for the XGBoost model.

How does the dependence plot selects the color variable?

By default, Scott's shap package for Python uses a statistical heuristic to colorize the points in the dependence plot by the variable with possibly strongest interaction. The heuristic used by SHAPforxgboost is slightly different and directly uses conditional variances. More specifically, the variable X on the x-axis as well as each other feature Z_k is binned into categories. Then, for each Z_k, the conditional variance across binned X and Z_k is calculated. The Z_k with the highest conditional variance is selected as the color variable.

Note that the heuristic does not depend on "shap interaction values" in order to save time (and because these would not be available for LightGBM).

The following simple example shows how/that it is working. First, a dataset is created and a model with three features and strong interaction between x1 and x2 is being fitted. Then, we look at the dependence plots to see if it is consistent with the model/data situation.

n <- 1000

set.seed(334)

df <- data.frame(
  x1 = runif(n),
  x2 = runif(n),
  x3 = runif(n)
  ) %>% 
  mutate(
    y = x1 * x2 + x3 + runif(n)
  )
x <- c("x1", "x2", "x3")
dtrain <- lgb.Dataset(data.matrix(df[, x]),
                      label = df[, "y"])

params <- list(
  objective = "regression",
  learning_rate = 0.05,
  subsample = 0.9,
  reg_lambda = 2,
  num_leaves = 15
)

fit_lgb <- lgb.train(
  params,
  data = dtrain,
  eval = "rmse",
  nrounds = 100
)

shap <- shap.prep(fit_lgb, X_train = data.matrix(df[, x]))
shap.plot.summary(shap)

shap.plot.dependence(shap, "x1", color_feature = "auto")
shap.plot.dependence(shap, "x2", color_feature = "auto")
shap.plot.dependence(shap, "x3", color_feature = "auto")

Here the dependence plots for x1 and x3.

Figure 6: The dependence plots for x1 shows a clear interaction effect with the color variable x2. This is as simulated in the data.
Figure 7: The dependence plots for x3 does not show clear interaction effects, consistent with the data situation.

The full R script and rmarkdown file of this post can be found on github.

To leave a comment for the author, please follow the link and comment on their blog: Blog – Michael's and Christian's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)