[This article was first published on isomorphismes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tibshirani’s original paper on the lasso.

• Breiman’s Garotte — 1993
• Tibshirani lasso paper submitted — 1994
• Tibshirani lasso paper revised — 1995
• Tibshirani lasso paper accepted — 1996

This is one of those papers that I’m so excited about, I feel like “You should just read the whole thing! It’s all good!” But I realise that’s less than reasonable.

Here is a bit of summary, feel free to request other information and I’ll do my best to adapt it.

The basic question is: I have some data and I want the computer to generate (regress) a linear model of it for me. What procedure should I tell the computer to do to get a good | better | best model?

The first technique, by Abraham de Moivre, applied fruitfully by Gauss in the late 1800’s (so, ok, no computers then — but nowadays we just run `lm` in `R`), was to minimise the sum of squared error (Euclidean distance)
$\large \dpi{200} \bg_white \sqrt{\blacksquare^2 + \blacksquare^2 + \blacksquare^2 + \blacksquare^2 + \blacksquare^2 + \blacksquare^2 + \ldots }$
of a given affine model of the data. (Affine being `linear + one more parameter` for a variable origin, to account for the average value of the data ex observable parameters. For example to model incomes in the USA when the only observed parameters are age, race, and zip code—you would want to include the average baseline US income level, and that would be accomplished mathematically by shifting the origin, a.k.a. the alpha; or autonomous or “vector of ones” regression-model parameter, a.k.a. the affine addition to an otherwise linear model.)

It was noticed by several someones at various points in time that whilst de Moivre’s least-squares (OLS) method is provably (calculus of variations) the optimal linear model given well-behaved data, real data does not always behave.

In the presence of correlation, missing data, wrong data, and other problems, the “optimal” OLS solution is overfit, meaning that the model it makes for you picks up on too many of the problems. Is there a way to pick up on more signal and less noise? More gold and less dross? More of the real stuff and fewer of the impurities?

I can think of 2 ways people have tried to scrape off the corrosion without flaying as well too much of the underlying good-material:

1. Assume simpler models are better. This is the approach taken by ridge regression (a.k.a. Tikhonov regularisation a.k.a. penalisation), the lasso, and the garotte
2. Compare ensembles of models, then choose one in the “middle”. Robust methods, for example, use statistical functions that vary less in theory from flawed situation to flawed situation, than do other statistical functions. Subset selection, hierarchical methods, and  generate a lot of models on the real data and
` `

That’s the backstory. Now on to what Tibshirani actually says. His original lasso paper contrasts 3 ways of penalising complicated models, plus regression on subsets.

The three formulae:

• penalisation & restriction to subsets

• garotte
• lasso

look superficially quite similar. Tibshirani discusses the pro’s, con’s, when’s, and wherefore’s of the different approaches.

` `

(In reading this paper I learned a new symbol, the operator ƒ(x) = (x)⁺. It means
$\large \dpi{200} \bg_white (x)^+ \equiv \begin{cases} x & \text{if } x>0 \\ 0 & \text{otherwise} \end{cases}$
and looks like

). In `R` code, `ifelse(x`. Like absolute value but not exactly.)

` `

Back to the lasso. How does such a small change to the penalty function, change the estimated linear model we get as output?