Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post Which linear model is best? I wrote about using
stepwise selection as a method for selecting linear models, which turns
This post will be about two methods that slightly modify ordinary least
squares (OLS) regression – ridge regression and the lasso.

Ridge regression and the lasso are closely related, but only the Lasso
has the ability to select predictors. Like OLS, ridge attempts to
minimize residual sum of squares of predictors in a given model.
However, ridge regression includes an additional ‘shrinkage’ term – the
square of the coefficient estimate – which shrinks the estimate of the
coefficients towards zero. The impact of this term is controlled by
another term, lambda (determined seperately). Two interesting
implications of this design are the facts that when λ = 0 the OLS
coefficients are returned and when λ = ∞, coefficients will approach
zero.

To take a look at this, setup a model matrix (removing the intercept
column), store the independent variable as `y`, and create a vector of
lambda values.

```swiss <- datasets::swiss
x <- model.matrix(Fertility~., swiss)[,-1]
y <- swiss\$Fertility
lambda <- 10^seq(10, -2, length = 100)
```

First, let's prove the fact that when λ = 0 we get the same coefficients
as the OLS model.

```#create test and training sets
library(glmnet)

set.seed(489)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
ytest = y[test]
```

```#OLS
swisslm <- lm(Fertility~., data = swiss)
coef(swisslm)

##      (Intercept)      Agriculture      Examination        Education
##       66.9151817       -0.1721140       -0.2580082       -0.8709401
##         Catholic Infant.Mortality
##        0.1041153        1.0770481

#ridge
ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda)
predict(ridge.mod, s = 0, exact = T, type = 'coefficients')[1:6,]

##      (Intercept)      Agriculture      Examination        Education
##       66.9365901       -0.1721983       -0.2590771       -0.8705300
##         Catholic Infant.Mortality
##        0.1040307        1.0770215
```

The differences here are nominal. Let's see if we can use ridge to
improve on the OLS estimate.

```swisslm <- lm(Fertility~., data = swiss, subset = train)
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = lambda)
#find the best lambda from our list via cross-validation
cv.out <- cv.glmnet(x[train,], y[train], alpha = 0)

## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold

bestlam <- cv.out\$lambda.min

#make predictions
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[test,])
s.pred <- predict(swisslm, newdata = swiss[test,])
#check MSE
mean((s.pred-ytest)^2)

## [1] 106.0087

mean((ridge.pred-ytest)^2)

## [1] 93.02157
```

Ridge performs better for this data according to the MSE.

```#a look at the coefficients
out = glmnet(x[train,],y[train],alpha = 0)
predict(ridge.mod, type = "coefficients", s = bestlam)[1:6,]

##      (Intercept)      Agriculture      Examination        Education
##      64.90631178      -0.16557837      -0.59425090      -0.35814759
##         Catholic Infant.Mortality
##       0.06545382       1.30375306
```

As expected, most of the coefficient estimates are more conservative.

Let's have a look at the lasso. The big difference here is in the
shrinkage term – the lasso takes the absolute value of the coefficient
estimates.

```lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = lambda)
lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test,])
mean((lasso.pred-ytest)^2)

## [1] 124.1039
```

The MSE is a bit higher for the lasso estimate. Let's check out the
coefficients.

```lasso.coef  <- predict(lasso.mod, type = 'coefficients', s = bestlam)[1:6,]
```

Looks like the lasso places high importance on `Education`,
`Examination`, and `Infant.Mortality`. From this we also gain some
evidence that `Catholic` and `Agriculture` are not useful predictors for
this model. It is likely that `Catholic` and `Agriculture` do have some
effect on `Fertility`, though, since pushing those coefficients to zero
hurt the model.

There is plenty more to delve into here, but I'll leave the details to
the experts. I am always happy to have your take on the topics I write
I learn just as much from you all as I do in researching the topic I

I think the next post will be about more GIS stuff – maybe on rasters or
point pattern analysis.