In multiple regression analysis, multicollinearity is a common phenomenon, in which two or more predictor variables are highly correlated. If there is an exact linear relationship (perfect multicollinearity) among the independent variables, the rank of X is less than k+1(assume the number of predictor variables is k), and the matrix will not be invertible. So the strong correlations will cause computational instability and the OLS estimator is no longer the BLUE(best linear unbiased estimator).
We can use several common ways to measure multicollinearity, for instance, VIF(variance inflation factor) and condition number. VIF is defined as
and condition number is given by
according to some references, if the VIF is too large(more than 5 or 10) or condition number is more than 15(30), we consider that the multicollinearity is existent.
In order to solve this problem, there are 2 main approaches. Firstly, we can use robust regression analysis instead of OLS(ordinary least squares), such as ridge regression, lasso regression and principal component regression. On the other hand, statistical learning regression is also a good method, like regression tree, bagging regression, random forest regression, neural network and SVR(support vector regression).
1 Ridge Regression
Ridge regression addresses the problem by estimating regression coefficients using
where k is the ridge parameter and I is the identity matrix. Small positive values of k improve the conditioning of the problem and reduce the variance of the estimates. While biased, the reduced variance of ridge estimates often results in a smaller mean square error when compared to least-squares estimates.
Obviously the question is how to determine the parameter k. In general, Ridge Trace, Generalized Cross Validation(GCV) and Mallows Cp are widely used. In R language, the function lm.ridge() in package MASS could implement ridge regression(linear model). The sample codes and output as follows
> library(MASS); > names(longley)<-"y"; > lm.ridge(y~.,longley); #OLS GNP Unemployed Armed.Forces Population 2946.85636017 0.26352725 0.03648291 0.01116105 -1.73702984 Year Employed -1.41879853 0.23128785 > r<-lm.ridge(y~.,data=longley,lambda=seq(0,0.1,0.001),model=TRUE); > r$lambda[which.min(r$GCV)];  0.006 > r$coef[,which.min(r$GCV)]; GNP Unemployed Armed.Forces Population Year 16.9874524 1.7527228 0.4423901 -8.9474628 1.1782609 Employed -0.1976319
According to the result, we can see that 0.006 is an appropriate value for ridge parameter. Actually, through the ridge trace curve, we also can get a similar conclusion.
R code as follows
coefficients<-matrix(t(r$coef)); lambda<-matrix(rep(seq(0,0.1,length=101),6)); variable<-t(matrix(rep(colnames(longley[,2:7]),101),nrow=6)); variable<-matrix(col); data<-data.frame(coefficients,lambda,col); qplot(lambda,coefficients,data=data,colour=variable, geom="line")+geom_line(size=1);
Furthermore, package ridge provides a function called linearRidge() which also can fits a linear ridge regression model, and optionally, the ridge regression parameter is chosen automatically using the method proposed by Cule. For example
> library(ridge); > data(longley); > names(longley)<-"y"; > mod<-linearRidge(y~.-1,data=longley,lambda="automatic"); > summary(mod);
In this case, the function choose 0.01 as the ridge parameter, so the result is little different from the output of lm.ridge.
In addition, if you are a matlab user, the Statistics Toolbox™ function ridge carries out ridge regression.