**Mad (Data) Scientist**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m one of many who bemoan the fact that statistics is typically thought of as — alas, even *taught* as — a set of formula plugging methods. One enters one’s data, turns the key, and the proper answers pop out. This of course is not the case at all, and arguably statistics is as much an art as a science. Or as I like to put it, you can’t be an effective number cruncher unless you know what the crunching means.

One of the worst ways in which statistics can produce bad analysis is the use of significance testing. For sheer colorfulness, I like Professor Paul Meehl’s quote, “Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose path.” But nothing beats concrete examples, and I’ll give a couple here.

First, a quick one: I’m active in the Bay Area R Users Group, and a couple of years ago we had an (otherwise-) excellent speaker from one of the major social network firms. He mentioned that he had been startled to find that, with the large data sets he works with, “Everything is significant.” Granted, he came from an engineering background rather than statistics, but even basic courses in the latter should pound into the students the fact that, with large n, even tiny departures from H_{0} will likely be declared “significant.”

The problem is compounded by the *simultaneous inference problem*, which points out, in essence, that when we perform a large number of significance tests, with H_{0} true in all of them, you are still likely to find some of them “significant.” (Of course, this problem also extends to confidence intervals, the typical alternative that I and others recommend.)

My favorite example of this is a Wharton study in which the authors deliberately added fake variables to a real data set. And guess what! In the resulting regression analysis, all of the fake variables were found to be “significant” predictors of the response.

Let’s try our own experiment along these lines, using R. We’ll do model selection first by running **lm()** and checking which variables were found “significant.” This is a common, if unrefined, method for model selection. We’ll see that it too leads us astray. Another method for variable selection, much more sophisticated, is the LASSO, so we’ll try that one too, with similarly misleading results, actually worse.

For convenience, I’ll use the data from my last post. This is Census data on programmers and engineers in Silicon Valley. The only extra operation I’ve done here (not shown) is to center and scale the data, using **scale()**, in order to make the fake variables comparable to the real ones in size. My data set, **pg2n**, includes 5 real predictors and 25 fake ones, generated by

> pg2n <- cbind(pg2,matrix(rnorm(25*nrow(pg2)),ncol=25))

Applying R’s **lm()** function as usual,

summary(lm(pg2n[,3] ~ pg2n[,-3]))

we find (output too voluminous to show here) that 4 of the 5 real predictors are found significant, but also 2 of the fake ones are significant (and a third has a p-value just slightly above 0.05). Not quite as dramatic as the Wharton data, which had more predictors than observations, but of a similar nature.

Let’s also try the LASSO. This method, favored by some in machine learning circles, aims to reduce sampling variance by constraining the estimated coefficients to a certain limit. The details are beyond our scope here, but the salient aspect is that the LASSO estimation process will typically come up with exact 0s for some of the estimated coefficients. In essence, then, LASSO can be used as a method for variable selection.

Let’s use the **lars** package from CRAN for this:

> larsout <- lars(pg2n[, -3],pg2n[, 3],trace=T) > summary(larsout)LARS/LASSO Call: lars(x = pg2n[, -3], y = pg2n[, 3],trace=T) Df Rss Cp 0 1 12818 745.4102 1 2 12696 617.9765 2 3 12526 440.5684 3 4 12146 40.7705 4 5 12134 29.1536 5 6 12121 17.7603 6 7 12119 17.4313 7 8 12111 11.5295 8 9 12109 11.3575 9 10 12106 10.6294 10 11 12099 4.9085 11 12 12099 6.2894 12 13 12098 8.1549 13 14 12098 9.0533 ...

Again I’ve omitted some of the voluminous output, but here we see enough. LASSO determines that the best model under Mallows’ C_{p} criterion would include the same 4 variables identified by **lm()** — AND 6 of the fake variables!

Undoubtedly some readers will have good suggestions, along the lines of “Why not try such-and-such on this data?” But my point is that all this goes to show, as promised, that effective application of statistics is far from automatic. Indeed, in the 2002 edition of his book, *Subset Selection in Regression*, Alan Miller laments that “very little progress has been made” in this field since his first edition came out in 1990.

Statistics is indeed as much an art as a science.

I aim for approximately one posting per week to this blog. I may not be able to get one out next week, but will have one within two weeks for sure. The topic: Progress on **Rth**, my parallel computation package for R, with a collaborator Drew Schmidt of **pdbR** fame.

**leave a comment**for the author, please follow the link and comment on their blog:

**Mad (Data) Scientist**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.