Be Aware of Bias in RF Variable Importance Metrics

[This article was first published on Rstats on pi: predict/infer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Random forests are typically used as “black box” models for prediction, but they can return relative importance metrics associated with each feature in the model. These can be used to help interpretability and give a sense of which features are powering the predictions. Importance metrics can also assist in feature selection in high dimensional data. Careful attention should be paid to the data you are working with and when it is appropriate to use and interpret the different variable importance metrics from random forests.

Words of caution

A recent blog post from a team at the University of San Francisco shows that default importance strategies in both R (randomForest) and Python (scikit) are unreliable in many data scenarios. Particularly, mean decrease in impurity importance metrics are biased when potential predictor variables vary in their scale of measurement or their number of categories.

It is also known that importance metrics are biased when predictor variables are highly correlated, leading to suboptimal predictor variables being artificially preferred. This has actually been known for over ten years (Strobl et al, 2007 and Strobl et al, 2008), but it can be easy to assume the default importances of popular packages will fit your unique datasets.

The papers and blog post demonstrate how continuous and high cardinality variables are preferred in mean decrease in impurity importance rankings, even if they are equally uninformative compared to variables with less categories. The authors suggest using permutation importance instead of the default in these cases. If the predictor variables in your model are highly correlated, conditional permutation importance is suggested.

Mean decrease in impurity (Gini) importance

The mean decrease in impurity (Gini) importance metric describes the improvement in the “Gini gain” splitting criterion (for classification only), which incorporates a weighted mean of the individual trees’ improvement in the splitting criterion produced by each variable The gini impurity index is defined as:

\[G = \sum_{i=1}^{n_c} p_i(1 – p_i) = 1 – \sum_{i=1}^{n_c} p_i^2\]

where \(n_c\) is the number of classes in the target variable and \(p_i\) is the ratio of this class. In other words, it measures the disorder of a set of elements. It is calculated as the probability of mislabeling an element assuming that the element is randomly labeled according to the distribution of all the classes in the set. For regression, the analagous metric to the Gini index would be the RSS (residual sum of squares).

To see an example of how the Gini index is calculated, let’s use the iris data set. For the purposes of this post, we’ll convert the Petal.Length and Petal.Width features to factors, rounding to the nearest integer to decrease the number of “categories” in each variable.


iris <- iris %>% 
  as_tibble() %>% 
  mutate_at(vars(starts_with("Petal")), funs(as.factor(round(., digits = 0))))

iris %>% 
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## Variable type: factor 
##      variable missing complete   n n_unique
##  Petal.Length       0      150 150        7
##   Petal.Width       0      150 150        3
##       Species       0      150 150        3
##                        top_counts ordered
##        5: 35, 4: 34, 2: 26, 1: 24   FALSE
##        2: 64, 0: 49, 1: 37, NA: 0   FALSE
##  set: 50, ver: 50, vir: 50, NA: 0   FALSE
## Variable type: numeric 
##      variable missing complete   n mean   sd  p0 p25 p50 p75 p100     hist
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1 5.8 6.4  7.9 ▂▇▅▇▆▅▂▂
##   Sepal.Width       0      150 150 3.06 0.44 2   2.8 3   3.3  4.4 ▁▂▅▇▃▂▁▁

There are now 3 unique categories for Petal.Width, and 7 unique categories for Petal.Length. We leave Sepal.Length and Sepal.Width as continuous variables.

Here is an example of calculating the Gini index at a couple of randomly chosen splits using the iris dataset:

The decrease in impurity, \(I\), is then defined as:

\[I = G_{parent} – P_{split1}G_{split1} – P_{split2}G_{split2}\] where \(G\) is the Gini index for each node and \(P\) is the proportion of the data each split takes up relative to the parent node.

Continuing the above example, the decrease in impurity for the first split on Petal.Length is:

\[I = 0.667 – (87/150)*0.4983 – (63/150)*0.3457 = 0.232\]

and for the second split on Sepal.Width:

\[I = 0.3457 – (41/63)*0.3426 – (22/63)*0.3512 = 0.00009\]

These are only the calculations for one example tree, not including all potential splits. To calculate the actual mean decrease in impurity importance metric for each variable, these values would be averaged over all splits in the forest involving the variable in question.

Doing these calculations by hand would take time, but this metric is not very computationally expensive for a machine. Although it can be calculated fast, it has been shown to be biased in favor of continuous and high cardinality variables.

Why is impurity importance biased?

Each time a break point is selected in a variable, every level of the variable is tested to find the best break point. Continuous or high cardinality variables will have many more split points, which results in the “multiple testing” problem. That is, there is a higher probability that by chance that variable happens to predict the outcome well, since variables where more splits are tried will appear more often in the tree.

We’ll continue with the iris example and use the following plotting function from the authors of the blog post at USF to create the plots showing the bias in importance rankings:


create_rfplot <- function(rf, type){
  imp <- importance(rf, type = type, scale = F)
  featureImportance <- data.frame(Feature = row.names(imp), Importance = imp[,1])
  p <- ggplot(featureImportance, aes(x = reorder(Feature, Importance), y = Importance)) +
       geom_bar(stat = "identity", fill = "#53cfff", width = 0.65) +
       coord_flip() + 
       theme_light(base_size = 20) +
       theme(axis.title.x = element_text(size = 15, color = "black"),
             axis.title.y = element_blank(),
             axis.text.x  = element_text(size = 15, color = "black"),
             axis.text.y  = element_text(size = 15, color = "black")) 

The function plots the importance metrics on the x-axis and the variables on the y-axis, visualizing the relative importance rankings of each variable.

In the randomForest package, type = 2 is the default, reporting the mean decrease in impurity importance metrics. The equivalent argument in ranger is type = "impurity".

What are the mean decrease in impurity importance rankings of these features?


rf1 <- randomForest(
  Species ~ .,  
  ntree = 40,
  data = iris,
  nodesize = 1, 
  replace = FALSE,
  importance = TRUE

create_rfplot(rf1, type = 2)

We see the petal features are ranked the highest, and sepal width is the lowest.

Note: Always check whether you get the same results with a different random seed before interpreting any importance rankings. If the results change, increase the number of trees with ntree.

Now let’s create a continuous “random” column, which should be last in importance. How does the mean decrease in impurity rank this random variable?

iris_random <- iris %>% 
  mutate(random = sample(100, size = nrow(iris), replace = TRUE))

rf2 <- randomForest(
  Species ~ .,  
  ntree = 40,
  data = iris_random, # with random column
  nodesize = 1, 
  replace = FALSE,
  importance = TRUE

create_rfplot(rf2, type = 2)

The importance rankings show this random column is more important to the model than Sepal.Width! This doesn’t line up with what we would expect. The mean decrease in impurity importance is favoring the random column since it has many more splits to try than Sepal.Width:

iris_random %>% 
  summarize_at(vars(random, Sepal.Width), n_distinct)
## # A tibble: 1 x 2
##   random Sepal.Width
##    <int>       <int>
## 1     79          23

This shows that we can’t trust the mean decrease in impurity importance if our predictor variables are of different types or differ in there number of categories.

Permutation importance

As an alternative, we can use permutation importance. The basic idea is to consider a variable important if it has a positive effect on the prediction accuracy (classification), or MSE (regression). This metric is applicable to any model, not just random forests. The following steps are taken to calculate permutation importance. In a random forest modeling approach, the individual models would be each tree that is grown in the forest.

  1. Build a first model, and calculate the prediction accuracy in the OOB observations
  2. Any association between the variable of interest, \(X_i\), and the outcome is broken by permuting the values of all observations for \(X_i\), and the prediction accuracy is computed again.
  3. The difference between the two accuracy values is the permutation importance for \(X_i\) from a single model.
  4. The average of all single model importance values in a set of models then gives the permutation importance of this variable.
  5. Repeat for all variables of interest.

Permutation importance is more reliable, albeit more computationally expensive than mean decrease in impurity. It does not require retraining the model after permuting each column, you just have to re-run the permuted sample through the already-trained model. The risk of using this metric is a potential bias towards collinear predictive variables. To calculate the permutation importance metric in R, use type = 1 in randomForest, or type = "permutation" in ranger.

How does permutation importance rank the original iris variables?

create_rfplot(rf1, type = 1) # no random column

How do these rankings change after adding a continuous random column?

create_rfplot(rf2, type = 1) # with random column

Using permutation importance here lines up with what we would expect: the random column is not important at all to the model compared to the actual features.

Dealing with collinear features - Conditional permutation importance

The mean decrease in impurity and permutation importance computed from random forest models spread importance across collinear variables. For example, if you duplicate a feature and re-evaluate importance, the duplicated feature pulls down the importance of the original, so they are close to equal in importance.

The conditional permutation importance method is the suggested importance metric to use in the case of having collinear features. The steps to compute are the following:

  1. In each tree, compute the OOB-prediction accuracy before the permutation.
  2. Determine \(Z\), the number of variables to be conditioned on. It is suggested to include only variables whose correlation with the variable of interest \(X_j\) exceeds a threshold (i.e. 0.2).
  3. For all variables \(Z\) to be conditioned on, take the cutpoints that split this variable in the current tree and create a grid by bisecting the sample space at each cutpoint.
  4. Within this grid permute the values of \(X_j\) and compute the OOB-prediction accuracy after permutation
  5. The difference between the prediction accuracy before and after the permutation gives the importance of \(X_j\) for one tree. The importance of \(X_j\) for the forest is again computed as an average over all trees.

Note that this type of conditional importance metric is only applicable to tree-based models, as the conditioning is taking place on the partitions of the specific variables in each tree. It will also be much more computationally expensive than unconditional permutation importance. If you don’t have collinearity issues and have small enough data (or very powerful computational resources and no time constraints), it still might be worth using conditional permutation importance to get the most reliable importance metrics.

The party::cforest() function can be used to fit the random forest model and party::varimp(conditional = TRUE) will compute importance metrics. The default threshold in party::varimp() is 0.2. More details about conditional importance computations can be found in the Strobl et al, 2008 paper.

Let’s look at an example using the readingSkills dataset. The dataset contains the hypothetical score on a test of reading skills for 200 school children. Potential predictor variables in the data set are age, whether the child is a native speaker of the test language, and the shoe size of the child.


readingSkills <- readingSkills %>%

readingSkills %>% 
## Skim summary statistics
##  n obs: 200 
##  n variables: 4 
## Variable type: factor 
##       variable missing complete   n n_unique               top_counts
##  nativeSpeaker       0      200 200        2 no: 100, yes: 100, NA: 0
##  ordered
##    FALSE
## Variable type: integer 
##  variable missing complete   n mean   sd p0 p25 p50  p75 p100     hist
##       age       0      200 200 7.92 1.89  5   6   8 9.25   11 ▆▆▇▇▁▆▆▅
## Variable type: numeric 
##  variable missing complete   n  mean   sd    p0   p25   p50   p75  p100
##     score       0      200 200 40.66 8.53 25.26 33.94 40.33 47.57 56.71
##  shoeSize       0      200 200 27.87 2.04 23.17 26.23 27.85 29.49 32.33
##      hist
##  ▃▆▇▇▇▅▅▃
##  ▁▃▆▇▇▆▅▂
readingSkills %>% 
  mutate(nativeSpeaker = ifelse(nativeSpeaker == "yes", 1, 0)) %>% 
##               nativeSpeaker       age  shoeSize     score
## nativeSpeaker     1.0000000 0.1355429 0.0958855 0.4715154
## age               0.1355429 1.0000000 0.8969335 0.9355271
## shoeSize          0.0958855 0.8969335 1.0000000 0.8315745
## score             0.4715154 0.9355271 0.8315745 1.0000000

The shoe size is not a sensible predictor of reading skills, but is highly correlated with the outcome, test scores. This spurious correlation is only due to the fact that both shoe size and test score are associated with the underlying age variable.

First, let’s see how permutation importance ranks these variables:

rf3 <- cforest(
  score ~ .,
  data = readingSkills,
  control = cforest_unbiased(mtry = 2, ntree = 500)

create_crfplot <- function(rf, conditional = TRUE){
  imp <- rf %>%
    varimp(conditional = conditional) %>% 
    as_tibble() %>% 
    rownames_to_column("Feature") %>% 
    rename(Importance = value)
  p <- ggplot(imp, aes(x = reorder(Feature, Importance), y = Importance)) +
       geom_bar(stat = "identity", fill = "#53cfff", width = 0.65) +
       coord_flip() + 
       theme_light(base_size = 20) +
       theme(axis.title.x = element_text(size = 15, color = "black"),
             axis.title.y = element_blank(),
             axis.text.x  = element_text(size = 15, color = "black"),
             axis.text.y  = element_text(size = 15, color = "black")) 

# not conditional
create_crfplot(rf3, conditional = FALSE)

Oddly, shoeSize is ranked higher than nativeSpeaker. This is likely because shoe size is highly correlated with age. We can eliminate this effect with conditional permutation importance.

# conditional
create_crfplot(rf3, conditional = TRUE)

With the conditional permutation importance metrics, the spurious correlation between shoe size and reading skills is uncovered. Only with conditional importance do we see that the native speaker variable is actually more important for predicting the test score than shoe size.

Other Considerations


The Strobl et al (2007) paper also demonstrates how subsampling with and without replacement affects conditional permutation importance rankings. Bootstrap sampling with replacement is traditionally employed in random forests, but when computing conditional permutation importance it affects variable selection frequencies and artificially induces an association between variables. When bootstrap sampling without replacement, the bias of conditional permutation importance towards continuous/categorical variables with a high number of categories is minimized. Bootstrap sampling with replacement does not affect the relative importance rankings resulting from mean decrease in impurity (Gini) or unconditional permutation importance. Bootstrap sampling without replacement should only be used for the evaluation of conditional variable importance purposes only, not for building the predictive model. Note the default argument in cforest is replace = FALSE and in randomForest is replace = TRUE, so no adjustments overriding the defaults are necessary when computing importance metrics.


Does scaling (dividing by SD) the importance metrics help in certain scenarios? In most cases, we don’t care what the values are per se, we only care about the relative strengths. Scaling the importance metrics will only scale the values, but does not alter the ranks. The Strobl et al (2008) paper also argues that raw, unscaled importance has better statistical properties. Note that the scale argument in randomForest::importance() defaults to TRUE and the scale.permutation.importance argument in ranger defaults to FALSE. Although it shouldn’t affect the variable importance rankings, set scale = FALSE in randomForest::importance() to output the raw, unscaled importance measures.

Drop-column importance

Another potential model-agnostic approach to decrease bias in importance metrics with datasets containing collinear features is drop-column importance. The basic idea is to get a baseline performance score as with permutation importance, but then drop a column entirely, retrain the model, and recompute the performance score. The importance value of a feature is then the difference between the baseline and the score from the model missing that feature. The downside of drop-column importance is that it is quite computationally expensive, since it requires repeatedly retraining the model.


In order to be able to reliably interpret the variable importance measures of a random forest, the forest must be built from unbiased classification trees, and sampling should be conducted without replacement.

The following table shows the least computationally expensive importance metric for different data scenarios:

Variable Types Collinearity Cheapest Unbiased Importance Metric Package/Function for cheapest unbiased importance metric
All continuous or all categorical with the same number of categories No mean decrease in impurity randomForest::importance(type = 2)
Mix of categorical and continuous No permutation importance randomForest::importance(type = 1)
Mix of categorical and continuous Yes conditional permutation importance party::cforest() and party::varimp(conditional = TRUE)

To leave a comment for the author, please follow the link and comment on their blog: Rstats on pi: predict/infer. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)