The June 2020 issue of JASA features a highly insightful essay by Brad Efron, dean of the world’s statisticians. The article is accompanied by commentary by a number of statistical luminaries. Important, indeed central questions are raised. I would hope JASA runs more pieces of this nature.
The essay may be viewed as an update of the classic 2001 article by Leo Breiman, regarding two largely separate cultures, statistics and machine learning (ML), updating not only in the sense of progress made since 2001, but also in the divide between Prediction and Description (“Attribution”) goals of regression analysis. (I of course use the latter term to mean estimation of E(Y | X), whether parametrically or not.)
I’ll provide my own commentary here, and, this being a statistics/R blog, bring in some related comments on R. I’ll have two main themes:
- The gap between the two cultures is as wide today as ever.
- There are fundamental problems with both cultural approaches, with a lot of “the emperor has no clothes” issues.
Parametric vs. Nonparametric Methods
Note carefully that the cultural difference is NOT one of parametric vs. nonparametric methods. For instance, statisticians were working on k-Nearest Neighbors as far back as the 1950s, if not earlier. Tree-based methods, notably random forests, were developed largely by statisticians in the 1970s, and arguably the best implementations are in R, the statisticians’ language, rather than Python, the MLers’ language.
Instead, I believe the widest gap between the two cultures involves sampling variation. It forms the very core of statistics, while ML mostly ignores it. I’ve observed this over the years in writings and statements by ML people, and in conversation with them.
This was exemplified in a recent Twitter exchange I had with a prominent ML researcher. I’d lamented that ML people don’t care about sampling variation, and the ML person responded, “What about causal models? Sampling variation is not an issue there.” But of course it IS an issue, a major one. Is the causal effect one seems to have found real, or just a sampling accident?
By the way, this extends to the research level, resulting in statistical work often focusing on asymptotics, while in ML the focus is on proving finite inequalities about the data itself, no connection to a population.
Frankly, I find that ML people just don’t seem to find statistics relevant to their work, and often don’t know much about stat. For example, in a book written by another prominent MLer, he recommends standardizing data to mean-0, variance-1 form before performing neural network analysis. Indeed, for reasons that seem to be unknown, this is mandatory, as the method may not converge otherwise. But I was shocked to see the MLer write that standardization presumes a Gaussian distribution, which is certainly not true.
R vs. Python (and R vs. R)
As noted, statisticians tend to use R, while ML people almost universally use Python. This in turn stems from the fact that ML has basically become a computer science field. Python is common, virtually standard in CS coursework and much of the CS research, so it is natural for MLers to gravitate to that language. R, of course, has been the lingua franca of stat.
That’s all fine, but it is adding to the cultural gap, to the detriment of both camps. Years ago I wrote, “R is written by statisticians, for statisticians” and that R is Statistically Correct. Sadly, in my view the Python ML software doesn’t meet that standard, which again stems from the general lack of statistical expertise in ML. On the other hand, there are a number of “full pipeline” packages in Python, notably for image classification, with nothing comparable in R.
To their credit, JJ Allaire of RStudio and Wes McKinney of the Python world have worked on tools to “translate” between the two languages. But I’m told by a prominent R developer that there is resistance on the Python side. Hopefully this will change.
Meanwhile, R itself is undergoing its own cultural split(s). On the language level, there is the base-R vs. Tidyverse split. (I consider the latter to be taking R in the wrong direction). But beyond that, a large and growing segment of R users see the language only as a tool for data wrangling (e.g. case filtering) rather than statistics. It should be viewed as both, and the lack of such a view will be problematic for stat users in the coming years, I believe.
“The Emperor Has No Clothes”
There are serious problems within each of the two cultures. So far, analysts have succeeded in blithely ignoring them, but I worry that this can’t continue.
Heavy use of cross-validation:
Efron notes, “A crucial ingredient of modern prediction methods is the training/test set paradigm…” This is combined with grid search to find the “best” combination of tuning parameters. (Called hyperparameters in the ML field. ML has its own terminology, e.g. inference for what stat people call prediction. Yet another cultural divide.)
For some ML methods, the number of tuning parameters can be large, and when combined in Cartesian product with many values for each one, we have a very large search space. That in turn raises a simultaneous inference problem, colloquially known today as p-hacking. The parameter combination that appears best may be far from best.
The fineTuning() function in my regtools package computes Bonferroni (Dunn) intervals for the various combinations as a way to partially deal with the problem. It also features a graphical exploration tool.
Similarly, some of those dazzling results from ML methods in competitions need to be taken with a grain of salt. There presumably is a p-hacking problem here as well; different contestants keep trying various parameter values and algorithm tweaks until they get a better value — whether by genuinely better technique or by random accident. A probability record-values analysis would be interesting here. Also see the impressive empirical findings in Rebecca Roelofs’ dissertation.
To me, the most interesting, thought-provoking part of Efron’s essay is the material on smoothness. But it brings to mind another “emperor has no clothes” issue.
The fact is that in real life, there is no such thing as smoothness, since all measurements are discrete rather than continuous. At best, human height is measured, say, only to 0.2 centimeter and air temperature to, say, 0.1 degree. Hence there are not even first derivatives, let alone higher-order ones. This calls into question stat theory (“Assume a continuous second derivative…”), and more importantly, any formulaic (as opposed to aesthetic) method for choosing the smoothness of a fitted curve.
Classic parametric theory has results along the lines of needing p = o(sqrt(n)) to obtain consistent estimates. There are various nonparametric results of this nature. There are more recent results regarding p > n, but with conditions that are typically unverifiable.
On the ML side, adherents have shown a number of successes in which p >> n. A typical “optimal” neural network may, after all the dust clears, have tens of millions of parameters (hidden-layer weights) with n only in the tens of thousands. Some NN specialists have even said overfitting is actually desirable. Yet, to my knowledge, there is no theoretical justification for this. (Efron notes this too.) Is it true? Or is there something else going on?
Neural networks as black boxes:
For that matter, what makes NNs work anyway? Efron describes them as “essentially elaborate logistic regression programs.” With a sigmoid activation function, and with the understanding that logits are composed together rather than, say added, that is true. However, in our work we show something more basic: NNs are essentially performing good old-fashioned polynomial regression. This has practical implications, e.g. that multicollinearity among the layer outputs increases from layer to layer. (We also have a corresponding R package, polyreg.)
In recent years, it has been found that for most applications, NNs will not do especially well, unlike their succeses with image classification for instance. Out of this has come Conventional Wisdom statements like, “For rectangular data, you may do better with non-NN methods.”
But upon closer inspection, one sees this particular “emperor” to be especially unclothed. Consider the famous MNIST data, consisting of 65,000 28×28 images. One can store this is a 65000 x 784 matrix (2ix28 = 784). In other words, n = 65000 and p = 784. Isn’t that rectangular? One prominent MLer uses the term perceptive data instead. That is not a satisfying explanation at all, nor, as noted, is the “rectangular” one.
So, are image and natural language data, putative areas of excellence for NNs, really different from all other areas? This seems unlikely. Instead, the likely explanation is that enormous efforts were expended in these areas, and NNs happened to be the tool that was used. Indeed, some researchers have taken the outputs of convolutional layers (which are just classic image operations), and run them through SVMs, obtaining results equal to or better than NNs.
So, what do we have? This discussion brings to mind an old paper, delightfully titled “What’s Not What What in Statistics.” There is so much more than could be added today.