Variable Importance with Correlated Features

Posted on November 6, 2015 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Variable importance graphs are great tool to see, in a model, which variables are interesting. Since we usually use it with random forests, it looks like it is works well with (very) large datasets. The problem with large datasets is that a lot of features are ‘correlated’, and in that case, interpretation of the values of variable importance plots can hardly be compared. Consider for instance a very simple linear model (the ‘true’ model, used to generate data)

$Y=\beta_0+\beta_1 X_{1}+\beta_3 X_{3}+\varepsilon$

Here, we use a random forest to model the relationship between the features, but actually, we consider another feature – not used to generate the data – $\color{blue}{X_2}$ , that is correlated to $\color{black}{X_1}$ . And we consider a random forest on those three features, $\widehat{Y}=\text{\sffamily rf}(X_{1},\color{blue}{X_2},\color{black}{X_{3})}$ .

In order to get some more robust results, I geneate 100 datasets, of size 1,000.

library(mnormt)

impact_correl=function(r=.9){
nsim=10
IMP=matrix(NA,3,nsim)
n=1000
R=matrix(c(1,r,r,1),2,2)
for(s in 1:nsim){
X1=rmnorm(n,varcov=R)
X3=rnorm(n)
Y=1+2*X1[,1]-2*X3+rnorm(n)
db=data.frame(Y=Y,X1=X1[,1],X2=X1[,2],X3=X3)
library(randomForest)
RF=randomForest(Y~.,data=db)
IMP[,s]=importance(RF)}
apply(IMP,1,mean)}

C=c(seq(0,.6,by=.1),seq(.65,.9,by=.05),.99,.999)
VI=matrix(NA,3,length(C))
for(i in 1:length(C)){VI[,i]=impact_correl(C[i])}

plot(C,VI[1,],type="l",col="red")
lines(C,VI[2,],col="blue")
lines(C,VI[3,],col="purple")

The purple line on top is the variable importance value of $X_{3}$ , which is rather stable (almost constant, as a first order approximation). The red line is the variable importance function of $\color{black}{X_1}$ while the blue line is the variable importance function of $\color{blue}{X_2}$ . For instance, the importance function with two very correlated variable is

It looks like $X_{3}$ is much more important than the other two, which is – somehow – not the case. It is just that the model cannot choose between $\color{black}{X_1}$ and $\color{blue}{X_2}$ : sometimes, $\color{black}{X_1}$ is slected, and sometimes it is $\color{blue}{X_2}$ . I think I find that graph confusing because I would probably expect the importance of $\color{black}{X_1}$ to be constant. It looks like we have a plot of the importance of each variable, given the existence of all the other variables.

Actually, what I have in mind is what we get when we consider the stepwise procedure, and when we remove each variable from the set of features,

library(mnormt)
impact_correl=function(r=.9){
  nsim=100
  IMP=matrix(NA,4,nsim)
  n=1000
  R=matrix(c(1,r,r,1),2,2)
  for(s in 1:nsim){
    X1=rmnorm(n,varcov=R)
    X3=rnorm(n)
    Y=1+2*X1[,1]-2*X3+rnorm(n)
    db=data.frame(Y=Y,X1=X1[,1],X2=X1[,2],X3=X3)
    IMP[1,s]=AIC(lm(Y~X1+X2+X3,data=db))
    IMP[2,s]=AIC(lm(Y~X2+X3,data=db))
    IMP[3,s]=AIC(lm(Y~X1+X3,data=db))
    IMP[4,s]=AIC(lm(Y~X1+X2,data=db))
  }
  apply(IMP,1,mean)}

Here, we get the following graph

plot(C,VI[2,],type="l",col="red")
lines(C,VI2[3,],col="blue")
lines(C,VI2[4,],col="purple")

The purple line is obtained when we remove $X_{3}$ : it is the worst model. When we keep $\color{black}{X_1}$ and $X_{3}$ , we get the blue line. And this line is constant: the quality of the does not depend on $\color{blue}{X_2}$ (this is what puzzled me in the previous graph, that having $\color{blue}{X_2}$ does have an impact on the importance of $\color{black}{X_1}$ ). The red line is what we get when we remove $\color{black}{X_1}$ . With 0 correlation, it is the same as the purple line, we get a poor model. With a correlation close to 1, it is same as having $\color{black}{X_1}$ , and we get the same as the blue line.

Nevertheless, discussing the importance of features, when we have a lot of correlation features is not that intuitive…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Variable Importance with Correlated Features

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)