Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Variable importance graphs are great tool to see, in a model, which variables are interesting. Since we usually use it with random forests, it looks like it is works well with (very) large datasets. The problem with large datasets is that a lot of features are ‘correlated’, and in that case, interpretation of the values of variable importance plots can hardly be compared. Consider for instance a very simple linear model (the ‘true’ model, used to generate data)
Here, we use a random forest to model the relationship between the features, but actually, we consider another feature – not used to generate the data – 
In order to get some more robust results, I geneate 100 datasets, of size 1,000.
library(mnormt)
impact_correl=function(r=.9){
nsim=10
IMP=matrix(NA,3,nsim)
n=1000
R=matrix(c(1,r,r,1),2,2)
for(s in 1:nsim){
X1=rmnorm(n,varcov=R)
X3=rnorm(n)
Y=1+2*X1[,1]-2*X3+rnorm(n)
db=data.frame(Y=Y,X1=X1[,1],X2=X1[,2],X3=X3)
library(randomForest)
RF=randomForest(Y~.,data=db)
IMP[,s]=importance(RF)}
apply(IMP,1,mean)}
C=c(seq(0,.6,by=.1),seq(.65,.9,by=.05),.99,.999)
VI=matrix(NA,3,length(C))
for(i in 1:length(C)){VI[,i]=impact_correl(C[i])}
plot(C,VI[1,],type="l",col="red")
lines(C,VI[2,],col="blue")
lines(C,VI[3,],col="purple")
The purple line on top is the variable importance value of 
It looks like 
Actually, what I have in mind is what we get when we consider the stepwise procedure, and when we remove each variable from the set of features,
library(mnormt)
impact_correl=function(r=.9){
  nsim=100
  IMP=matrix(NA,4,nsim)
  n=1000
  R=matrix(c(1,r,r,1),2,2)
  for(s in 1:nsim){
    X1=rmnorm(n,varcov=R)
    X3=rnorm(n)
    Y=1+2*X1[,1]-2*X3+rnorm(n)
    db=data.frame(Y=Y,X1=X1[,1],X2=X1[,2],X3=X3)
    IMP[1,s]=AIC(lm(Y~X1+X2+X3,data=db))
    IMP[2,s]=AIC(lm(Y~X2+X3,data=db))
    IMP[3,s]=AIC(lm(Y~X1+X3,data=db))
    IMP[4,s]=AIC(lm(Y~X1+X2,data=db))
  }
  apply(IMP,1,mean)}
Here, we get the following graph
plot(C,VI[2,],type="l",col="red") lines(C,VI2[3,],col="blue") lines(C,VI2[4,],col="purple")
The purple line is obtained when we remove 
Nevertheless, discussing the importance of features, when we have a lot of correlation features is not that intuitive…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
