Twelve Days 2013: LASSO Regression

[This article was first published on Category: r | Kelly Littlepage, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Day Eight: LASSO Regression

TL/DR

LASSO regression (least absolute shrinkage and selection operator) is a modified form of least squares regression that penalizes model complexity via a regularization parameter. It does so by including a term proportional to $||\beta||_{l_1}$ in the objective function which shrinks coefficients towards zero, and can even eliminate them entirely. In that light, LASSO is a form of feature selection/dimensionality reduction. Unlike other forms of regularization such as ridge regression, LASSO will actually eliminate predictors. It’s a simple, useful technique that performs quite well on many data sets.

Regularization

Regularization refers to the process of adding additional constraints to a problem to avoid over fitting. ML techniques such as neural networks can generate models of arbitrary complexity that will fit in-sample data one-for-one. As we recently saw in the post on Reed-Solomon FEC codes, the same applies to regression. We definitely have a problem anytime there are more regressors than data points, but any excessively complex model will generalize horribly and do you zero good out of sample.

Why LASSO?

There’s a litany of regularization techniques for regression, ranging from heuristic, hands-on ones like stepwise regression to full blown dimensionality reduction. They all have their place, but I like LASSO because it works very well, and it’s simpler than most dimensionality reduction/ML techniques. And, despite being a non-linear method, as of 2008 it has a relatively efficient solution via coordinate descent. We can solve the optimization in $O(n\cdot p)$ time, where $n$ is the length of the data set and $p$ is the number of regressors.

An Example

Our objective function has the form:

where $\lambda \geq 0$. The first half of the equation is just the standard objective function for least squares regression. The second half penalizes regression coefficients under the $l_1$ norm. The parameter $\lambda$ determines how important the penalty on coefficient weights is.

There are two R packages that I know of for LASSO: lars (short for least angle regression – a super set of LASSO) and glmnet. Glmnet includes solvers for more general models (including elastic net – a hybrid of LASSO and ridge that can handle catagorical variables). Lars is simpler to work with but the documentation isn’t great. As such, here are a few points worth noting:

  1. The primary lars function generates an object that’s subsequently used to generate the fit that you actually want. There’s a computational motivation behind this approach. The LARS technique works by solving for a series of “knot points” with associated, monotonically decreasing values of $\lambda$. The knot points are subsequently used to compute the LASSO regression for any value of $\lambda$ using only matrix math. This makes procedures such as cross validation where we need to try lots of different values of $\lambda$ computationally tractable. Without it, we would have to recompute an expensive non-linear optimization each time $\lambda$ changed.
  2. There’s a saturation point at which $\lambda$ is high enough that the null model is optimal. On the other end of the spectrum, when $\lambda = 0$, we’re left with least squares. The final value of $\lambda$ on the path, right before we end up with least squares, will correspond to the largest coefficient norm. Let’s call these coefficients $\beta_\text{thresh}$, and denote $\Delta = || \beta_\text{thresh} ||_{l_1}$. When the lars package does cross validation, it does so by computing the MSE for models where the second term in the objective function is fixed at $x \cdot \Delta,\ x \in [0, 1]$. This works from a calculation standpoint (and computationally it makes things pretty), but it’s counter intuitive if you’re interested in the actual value of $\lambda$ and not just trying to get the regression coefficients. You could easily write your own cross validation routine to use $\lambda$ directly.
  3. The residual sum of squared errors will increase monotonically with $\lambda$. This makes sense as we’re trading off between minimizing RSS and the model’s complexity. As such, the smallest RSS will always correspond to the smallest value of $\lambda$, and not necessarily the optimal one.

Here’s a simple example using data from the lars package. We’ll follow a common heuristic that recommends choosing $\lambda$ one SD of MSE away from the minimum. Personally I prefer to examine the CV L-curve and pick a value right on the elbow, but this works.

[lasso_regression.r]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
</p>

<p>require(lars)
data(diabetes)</p>

<h1 id="compute-mses-for-a-range-of-coefficient-penalties-expressed-as-a-fraction">Compute MSEs for a range of coefficient penalties expressed as a fraction</h1>
<p># of the final L1 norm on the interval [0, 1].
cv.res <- cv.lars(diabetes$x, diabetes$y, type = “lasso”,
	mode = “fraction”, plot = FALSE)</p>

<h1 id="choose-an-optimal-value-one-standard-deviation-away-from-the">Choose an “optimal” value one standard deviation away from the</h1>
<p># minimum MSE.
opt.frac <- min(cv.res$cv) + sd(cv.res$cv)
opt.frac <- cv.res$index[which(cv.res$cv < opt.frac)[1]]</p>

<h1 id="compute-the-lars-path">Compute the LARS path</h1>
<p>lasso.path <- lars(diabetes$x, diabetes$y, type = “lasso”)</p>

<h1 id="compute-a-fit-given-the-lars-path-that-we-precomputed-and-our-optimal">Compute a fit given the LARS path that we precomputed, and our optimal</h1>
<p># fraction of the final L1 norm
lasso.fit <- predict.lars(lasso.path, type = “coefficients”,
	mode = “fraction”, s = opt.frac)</p>

<h1 id="extract-the-final-vector-of-regression-coefficients">Extract the final vector of regression coefficients</h1>
<p>coef(lasso.fit)</p>

<p>

Final Notes

LASSO is a biased, linear estimator whose bias increases with $\lambda$. It’s not meant to provide the “best” fit as Gauss-Markov defines it – LASSO aims to find models that generalize well. Feature selection is hard problem and the best that we can do is a combination of common sense and model inference. However, no technique will save you from the worst case scenario: two very highly correlated variables, one of which is a good predictor, the other of which is spurious. It’s a crap shoot as to which predictor a feature selection algorithm would penalize in that case. LASSO has a few technical issues as well. Omitted variable bias is still an issue as it is in other forms of regression, and because of its non-linear solution, LASSO isn’t invariant under transformations of original data matrix.

To leave a comment for the author, please follow the link and comment on their blog: Category: r | Kelly Littlepage.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)