Ridge Regression and the Lasso

May 23, 2017
By

(This article was first published on R – Real Data, and kindly contributed to R-bloggers)

In my last post Which linear model is best? I wrote about using
stepwise selection as a method for selecting linear models, which turns
out to have some issues (see this article, and Wikipedia).
This post will be about two methods that slightly modify ordinary least
squares (OLS) regression – ridge regression and the lasso.

Ridge regression and the lasso are closely related, but only the Lasso
has the ability to select predictors. Like OLS, ridge attempts to
minimize residual sum of squares of predictors in a given model.
However, ridge regression includes an additional ‘shrinkage’ term – the
square of the coefficient estimate – which shrinks the estimate of the
coefficients towards zero. The impact of this term is controlled by
another term, lambda (determined seperately). Two interesting
implications of this design are the facts that when λ = 0 the OLS
coefficients are returned and when λ = ∞, coefficients will approach
zero.

To take a look at this, setup a model matrix (removing the intercept
column), store the independent variable as y, and create a vector of
lambda values.

swiss <- datasets::swiss
x <- model.matrix(Fertility~., swiss)[,-1]
y <- swiss$Fertility
lambda <- 10^seq(10, -2, length = 100)

First, let's prove the fact that when λ = 0 we get the same coefficients
as the OLS model.

#create test and training sets
library(glmnet)
## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-10
set.seed(489)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
ytest = y[test]

Fit your models.

#OLS
swisslm <- lm(Fertility~., data = swiss)
coef(swisslm)
##      (Intercept)      Agriculture      Examination        Education 
##       66.9151817       -0.1721140       -0.2580082       -0.8709401 
##         Catholic Infant.Mortality 
##        0.1041153        1.0770481
#ridge
ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda)
predict(ridge.mod, s = 0, exact = T, type = 'coefficients')[1:6,]
##      (Intercept)      Agriculture      Examination        Education 
##       66.9365901       -0.1721983       -0.2590771       -0.8705300 
##         Catholic Infant.Mortality 
##        0.1040307        1.0770215

The differences here are nominal. Let's see if we can use ridge to
improve on the OLS estimate.

swisslm <- lm(Fertility~., data = swiss, subset = train)
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = lambda)
#find the best lambda from our list via cross-validation
cv.out <- cv.glmnet(x[train,], y[train], alpha = 0)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
bestlam <- cv.out$lambda.min
#make predictions
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[test,])
s.pred <- predict(swisslm, newdata = swiss[test,])
#check MSE
mean((s.pred-ytest)^2)
## [1] 106.0087
mean((ridge.pred-ytest)^2)
## [1] 93.02157

Ridge performs better for this data according to the MSE.

#a look at the coefficients
out = glmnet(x[train,],y[train],alpha = 0)
predict(ridge.mod, type = "coefficients", s = bestlam)[1:6,]
##      (Intercept)      Agriculture      Examination        Education 
##      64.90631178      -0.16557837      -0.59425090      -0.35814759 
##         Catholic Infant.Mortality 
##       0.06545382       1.30375306

As expected, most of the coefficient estimates are more conservative.

Let's have a look at the lasso. The big difference here is in the
shrinkage term – the lasso takes the absolute value of the coefficient
estimates.

lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = lambda)
lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test,])
mean((lasso.pred-ytest)^2)
## [1] 124.1039

The MSE is a bit higher for the lasso estimate. Let's check out the
coefficients.

lasso.coef  <- predict(lasso.mod, type = 'coefficients', s = bestlam)[1:6,]

Looks like the lasso places high importance on Education,
Examination, and Infant.Mortality. From this we also gain some
evidence that Catholic and Agriculture are not useful predictors for
this model. It is likely that Catholic and Agriculture do have some
effect on Fertility, though, since pushing those coefficients to zero
hurt the model.

There is plenty more to delve into here, but I'll leave the details to
the experts. I am always happy to have your take on the topics I write
about, so please feel free to leave a comment or contact me. Oftentimes
I learn just as much from you all as I do in researching the topic I
write about.

I think the next post will be about more GIS stuff – maybe on rasters or
point pattern analysis.

Thank you for reading!

Kiefer

To leave a comment for the author, please follow the link and comment on their blog: R – Real Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)