By Gabriel Vasconcelos
When the LASSO fails?
The LASSO has two important uses, the first is forecasting and the second is variable selection. We are going to talk about the second. The variable selection objective is to recover the correct set of variables that generate the data or at least the best approximation given the candidate variables. The LASSO has attracted a lot of attention lately because it allows us to estimate a linear regression with thousands of variables and the model select the right ones for us. However, what many people ignore is when the LASSO fails.
Like any model, the LASSO also rely on assumptions in order to work. The first is sparsity, i.e. only a small number of variables may actually be relevant. If this assumption does not hold there is no hope to use the LASSO for variable selection. Another assumption is that the irrepresentable condition must hold, this condition may look very technical but it only says that the relevant variable may not be very correlated with the irrelevant variables.
Suppose your candidate variables are represented by the matrix , where each column is a variable and each line is an observation. We can calculate the covariance matrix , which is a symmetric matrix. This matrix may be broken into four pieces:
The first piece, , shows the covariances between only the important variables, is the covariance matrix of the irrelevant variables and and shows the covariances between relevant and irrelevant variables. With that in mind, the irrepresentable condition is:
The inequality above must hold for all elements. is for positive values of , for negative values and if .
For this example we are going to generate two covariance matrices, one that satisfies the irrepresentable condition and one that violates it. Our design will be very simple: only 10 candidate variables where five of them are relevant.
library(mvtnorm) library(corrplot) library(glmnet) library(clusterGeneration)
k=10 # = Number of Candidate Variables p=5 # = Number of Relevant Variables N=500 # = Number of observations betas=(-1)^(1:p) # = Values for beta set.seed(12345) # = Seed for replication sigma1=genPositiveDefMat(k,"unifcorrmat")$Sigma # = Sigma1 violates the irc sigma2=sigma1 # = Sigma2 satisfies the irc sigma2[(p+1):k,1:p]=0 sigma2[1:p,(p+1):k]=0 # = Verify the irrepresentable condition irc1=sort(abs(sigma1[(p+1):k,1:p]%*%solve(sigma1[1:p,1:p])%*%sign(betas))) irc2=sort(abs(sigma2[(p+1):k,1:p]%*%solve(sigma2[1:p,1:p])%*%sign(betas))) c(max(irc1),max(irc2))
##  3.222599 0.000000
# = Have a look at the correlation matrices par(mfrow=c(1,2)) corrplot(cov2cor(sigma1)) corrplot(cov2cor(sigma2))
As you can see,
irc1 violates the irrepresentable condition and
irc2 does not. The correlation matrix that satisfies the irrepresentable condition is block diagonal and the relevant variables have no correlation with the irrelevant ones. This is an extreme case, you may have a small correlation and still satisfy the condition.
Now let us check how the LASSO works for both covariance matrices. First we need do understand what is the regularization path. The LASSO objective function penalizes the size of the coefficients and this penalization is controlled by a hyper-parameter . We can find the exact that is sufficiently big to shrink all variables to zero and for any value smaller than some variable will be included. As we decrease the size of more variables are included until we have a model with all variables (or the biggest identified model when we have more variables than observations). This path between the model with all variables and the model with no variables is the regularization path. The code below generates data from multivariate normal distributions for the covariance matrix that violates the irrepresentable condition and the covariance matrix that satisfies it. Then I estimate the regularization path for both case and summarize the information in plots.
X1=rmvnorm(N,sigma = sigma1) # = Variables for the design that violates the IRC = # X2=rmvnorm(N,sigma = sigma2) # = Variables for the design that satisfies the IRC = # e=rnorm(N) # = Error = # y1=X1[,1:p]%*%betas+e # = Generate y for design 1 = # y2=X2[,1:p]%*%betas+e # = Generate y for design 2 = # lasso1=glmnet(X1,y1,nlambda = 100) # = Estimation for design 1 = # lasso2=glmnet(X2,y2,nlambda = 100) # = Estimation for design 2 = # ## == Regularization path == ## par(mfrow=c(1,2)) l1=log(lasso1$lambda) matplot(as.matrix(l1),t(coef(lasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="Violates IRC") l2=log(lasso2$lambda) matplot(as.matrix(l2),t(coef(lasso2)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="Satisfies IRC")
The plot on the left shows the results when the irrepresentable condition is violated and the plot on the right is the case when it is satisfied. The five black lines that slowly converge to zero are the five relevant variables and the red line is an irrelevant variable. As you can see, when the IRC is satisfied all relevant variables shrink very fast to zero as we increase lambda. However, when the IRC is violated one irrelevant variable starts with a very small coefficient that slowly increases before decreasing to zero in the very end of the path. This variable is selected through the entire path, it is virtually impossible to recover the correct set of variables in this case unless you apply a different penalty to each variable. This is precisely what the adaptive LASSO does. Does that mean that the adaLASSO is free from the irrepresentable condition??? The answer is: partially. The adaptive LASSO requires a less restrictive condition called weighted irrepresentable condition, which is much easier to satisfy. The two plots below show the regularization path for the LASSO and the adaLASSO in the case the IRC is violated. As you can see, the adaLASSO selects the correct set of variables in all the path.
lasso1.1=cv.glmnet(X1,y1) w.=(abs(coef(lasso1.1)[-1])+1/N)^(-1) adalasso1=glmnet(X1,y1,penalty.factor = w.) par(mfrow=c(1,2)) l1=log(lasso1$lambda) matplot(as.matrix(l1),t(coef(lasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="LASSO") l2=log(adalasso1$lambda) matplot(as.matrix(l2),t(coef(adalasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="adaLASSO")
The biggest problem is that the irrepresentable condition and its less restricted weighted version are not testable in the real world because we need the populational covariance matrix and the true betas that generate the data. The solution is to study your data as much as possible to at least have an idea of the situation.
Some articles on the topic
Zhao, Peng, and Bin Yu. “On model selection consistency of Lasso.” Journal of Machine learning research 7.Nov (2006): 2541-2563.
Meinshausen, Nicolai, and Bin Yu. “Lasso-type recovery of sparse representations for high-dimensional data.” The Annals of Statistics (2009): 246-270.