The following post by Norm Matloff originally appeared on his blog, Mad(Data)Scientist, on September 15th. We rarely republish posts that have appeared on other blogs, however, the questions that Norm raises both with respect to the teaching of statistics, and his assertion that “R's statistical procedures are centered far too much on significance testing” deserve a second look. Moreover, Norm's post elicited quite a few comments, many of which are at a high level of discourse. At the bottom of this post we have include excerpts from exchanges with statistician Mervin Thomas and with philosopher of science Deborah Mayo. It is well worth reading the full threads of these exchanges as well as those associated with a number of other comments. Norm has been a contributor to the Revolutions Blog in the past. We thank him for permission to republish his post. (Guest post editor, Joseph Rickert).
by Norm Matloff
My posting about the statistics profession losing ground to computer science drew many comments, not only here in Mad (Data) Scientist, but also in the co-posting at Revolution Analytics, and in Slashdot. One of the themes in those comments was that Statistics Departments are out of touch and have failed to modernize their curricula. Though I may disagree with the commenters’ definitions of “modern,” I have in fact long felt that there are indeed serious problems in statistics curricula.
I must clarify before continuing that I do NOT advocate that, to paraphrase Shakespeare, “First thing we do, we kill all the theoreticians.” A precise mathematical understanding of the concepts is crucial to good applications. But stat curricula are not realistic.
I’ll use Student t-tests to illustrate. (This is material from my open-source book on probablity and statistics.) The t-test is an exemplar for the curricular ills in three separate senses:
Significance testing has long been known to be under-informative at best, and highly misleading at worst. Yet it is the core of almost any applied stat course. Why are we still teaching — actually highlighting — a method that is recognized to be harmful?
We prescribe the use of the t-test in situations in which the sampled population has an exact normal distribution — when we know full well that there is no such animal. All real-life random variables are bounded (as opposed to the infinite-support normal distributions) and discrete (unlike the continuous normal family). [Clarification, added 9/17: I advocate skipping the t-distribution, and going directly to inference based on the Central Limit Theorem. Same for regression. See my book.]
Going hand-in-hand with the t-test is the sample variance. The classic quantity s2 is an unbiased estimate of the population variance σ2, with s2 defined as 1/(n-1) times the sum of squares of our data relative to the sample mean. The concept of unbiasedness does have a place, yes, but in this case there really is no point to dividing by n-1 rather than n. Indeed, even if we do divide by n-1, it is easily shown that the quantity that we actually need, s rather than s2, is a BIASED (downward) estimate of σ. So that n-1 factor is much ado about nothing.
Right from the beginning, then, in the very first course a student takes in statistics, the star of the show, the t-test, has three major problems.
Sadly, the R language largely caters to this old-fashioned, unwarranted thinking. The var() and sd() functions use that 1/(n-1) factor, for example — a bit of a shock to unwary students who wish to find the variance of a random variable uniformly distributed on, say, 1,2,…,10.
Much more importantly, R’s statistical procedures are centered far too much on significance testing. Take ks.test(), for instance; all one can do is a significance test, when it would be nice to be able to obtain a confidence band for the true cdf. Or consider log-linear models: The loglin() function is so centered on testing that the user must proactively request parameter estimates, never mind standard errors. (One can get the latter by using glm() as a workaround, but one shouldn’t have to do this.)
I loved the suggestion by Frank Harrell in r-devel to at least remove the “star system” (asterisks of varying numbers for different p-values) from R output. A Quixotic action on Frank’s part (so of course I chimed in, in support of his point); sadly, no way would such a change be made. To be sure, R in fact is modern in many ways, but there are some problems nevertheless.
In my blog posting cited above, I was especially worried that the stat field is not attracting enough of the “best and brightest” students. Well, any thoughtful student can see the folly of claiming the t-test to be “exact.” And if a sharp student looks closely, he/she will notice the hypocrisy of using the 1/(n-1) factor in estimating variance for comparing two general means, but NOT doing so when comparing two proportions. If unbiasedness is so vital, why not use 1/(n-1) in the proportions case, a skeptical student might ask?
Some years ago, an Israeli statistician, upon hearing me kvetch like this, said I would enjoy a book written by one of his countrymen, titled What’s Not What in Statistics. Unfortunately, I’ve never been able to find it. But a good cleanup along those lines of the way statistics is taught is long overdue.
SEPTEMBER 16, 2014 AT 4:15 PM
I have run statistics operations in quite large public and private sector organisations, and directly supervised many masters and PhD level statisticians. The biggest problem I had with new statisticians was helping them to understand that nobody else cares about the statistics.
Of course the statistics is important, but only in so far as it helps produce solid and reliable answers to problems – or reveals that no such answers are available with current data. Nearly everybody is focussed on their own problems. The trick is producing results and reports which address those problems in a rigorous and defensible way.
In a sense, I see applied statistics as more of an engineering discipline – but one that makes careful use of rigorous analysis.
I believe that statistics departments have largely missed the boat with data science (except for a few stand out examples like Stanford), and that the reason is that many academic statisticians have failed to engage with other disciplines properly. Of course, there are very significant exceptions to that – Terry Speed for example.
One of the most telling examples of that for me is the number of time academic statisticians have asked if I or my life science collaborators could provide them with data to test an approach — without actually wanting to engage with the problem that generated the data.
Relevance comes from engagement, not from rarefied brilliance. There is no better example of that than Fisher.
Does it matter? Yes because I see other disciplines reinventing the statistical wheel – and doing it badly.
SEPTEMBER 16, 2014 AT 5:01 PM
Very interesting comments. I largely agree.
Sadly, my own campus, the University of California at Davis, illustrates your point. To me, a big issue is joint academic appointments, and to my knowledge the Statistics Dept. has none. This is especially surprising in light of the longtime (several decades) commitment of UCD to interdisciplinary research. The Stat. Dept. has even gone in the opposite direction: The Stat grad program used to be administered by a Graduate Group, a unique UCD entity in which faculty from many departments run the graduate program in a given field; yet a few years ago, the Stat. Dept. disbanded its Graduate Group. I must hasten to add that there IS good interdisciplinary work being done by Stat faculty with researchers in other fields, but still the structure is too narrow, in my view.
(My own department, Computer Science, has several appointments with other disciplines, and more important, has actually expanded the membership of its Graduate Group.)
I would say, though, that I think the biggest reason Stat (in general, not just UCD) has been losing ground to CS and other fields is not because of disinterest in applications, but rather a failure to tackle the complex, large-scale, “messy” problems that the Machine Learning crowd addresses routinely.
SEPTEMBER 16, 2014 AT 5:19 PM
“a failure to tackle the complex, large-scale, “messy” problems that the Machine Learning crowd addresses routinely.” Good point! I have often struggled with junior statisticians wanting to know whether or not an analysis is `right’ rather than fit for purpose. That’s a strange preoccupation, because in 40 years as a professional statistician I have never done a `correct’ analysis. Everything is predicated on assumptions which are approximations at best.
SEPTEMBER 27, 2014 AT 9:27 PM
I reviewed the part in your book on tests vs CIs. It was quite as extreme as I’d remembered it. I’m so used to interpreting significance levels and p-values in terms of discrepancies warranted or not that I automatically have those (severity) interpretations in mind when I consider tests. Fallacies of rejection and acceptance, relativity to sample size–all dealt with, and the issues about CIs requiring testing supplements remain (especially in one-sided testing which is common). This paper covers 13 central problems with hypothesis tests, and how error statistics deals with them.
I remember many of the things I like A LOT about Matloff’s book. I’m glad he sees CIs as the way to go for variable choice (on prediction grounds) because it means that severity is relevant there too.
SEPTEMBER 28, 2014 AT 11:46 PM
Looks like a very interesting paper, Deborah (as I would have expected). I look forward to reading it. Just skimming through, though, it looks like I’ll probably have comments similar to the ones I made on Mervyn’s points.
Going back to my original post, do you at least agree that CIs are more informative than tests?