Solutions for Multicollinearity in Regression(1)

[This article was first published on Chen-ang Statistics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In multiple regression analysis, multicollinearity is a common phenomenon, in which two or more predictor variables are highly correlated. If there is an exact linear relationship (perfect multicollinearity) among the independent variables, the rank of X is less than k+1(assume the number of predictor variables is k), and the matrix X^TX will not be invertible. So the strong correlations will cause computational instability and the OLS estimator is no longer the BLUE(best linear unbiased estimator).

We can use several common ways to measure multicollinearity, for instance, VIF(variance inflation factor) and condition number. VIF is defined as

VIF_j=\frac{1}{1-R_j^2}

and condition number is given by

k=\sqrt{\frac{\lambda_{max}}{\lambda_{min}}}

according to some references, if the VIF is too large(more than 5 or 10) or condition number is more than 15(30), we consider that the multicollinearity is existent.

In order to solve this problem, there are 2 main approaches. Firstly, we can use robust regression analysis instead of OLS(ordinary least squares), such as ridge regression, lasso regression and principal component regression. On the other hand, statistical learning regression is also a good method, like regression tree, bagging regression, random forest regression, neural network and SVR(support vector regression).

1 Ridge Regression

Ridge regression addresses the problem by estimating regression coefficients using

\hat{\beta}=(X^TX+kI)^{-1}X^Ty

where k is the ridge parameter and I is the identity matrix. Small positive values of k improve the conditioning of the problem and reduce the variance of the estimates. While biased, the reduced variance of ridge estimates often results in a smaller mean square error when compared to least-squares estimates.

Obviously the question is how to determine the parameter k. In general, Ridge Trace, Generalized Cross Validation(GCV) and Mallows Cp are widely used. In R language, the function lm.ridge() in package MASS could implement ridge regression(linear model). The sample codes and output as follows

> library(MASS);
> names(longley)[1]<-"y"; 
> lm.ridge(y~.,longley); #OLS
                        GNP    Unemployed  Armed.Forces    Population 
2946.85636017    0.26352725    0.03648291    0.01116105   -1.73702984 
         Year      Employed 
  -1.41879853    0.23128785 
> r<-lm.ridge(y~.,data=longley,lambda=seq(0,0.1,0.001),model=TRUE); 
> r$lambda[which.min(r$GCV)];
  [1] 0.006
> r$coef[,which.min(r$GCV)];
         GNP   Unemployed Armed.Forces   Population         Year 
  16.9874524    1.7527228    0.4423901   -8.9474628    1.1782609 
    Employed 
  -0.1976319

According to the result, we can see that 0.006 is an appropriate value for ridge parameter. Actually, through the ridge trace curve, we also can get a similar conclusion.

RidgeTrace

R code as follows

coefficients<-matrix(t(r$coef));
lambda<-matrix(rep(seq(0,0.1,length=101),6));
variable<-t(matrix(rep(colnames(longley[,2:7]),101),nrow=6));
variable<-matrix(col);
data<-data.frame(coefficients,lambda,col);
qplot(lambda,coefficients,data=data,colour=variable,
geom="line")+geom_line(size=1);

Furthermore, package ridge provides a function called linearRidge() which also can fits a linear ridge regression model,  and optionally, the ridge regression parameter is chosen automatically using the method proposed by Cule.  For example

> library(ridge);
> data(longley);
> names(longley)[1]<-"y"; 
> mod<-linearRidge(y~.-1,data=longley,lambda="automatic"); 
> summary(mod);

In this case, the function choose 0.01 as the ridge parameter, so the result is little different from the output of lm.ridge.

In addition, if you are a matlab user, the Statistics Toolbox™ function ridge carries out ridge regression.

 

 

To leave a comment for the author, please follow the link and comment on their blog: Chen-ang Statistics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)