Recently saw a really fun article making the rounds: The prevalence of statistical reporting errors in psychology (1985–2013) Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M. et al. Behav Res (2015). doi:10.3758/s13428-015-0664-2. The authors built an R package to check psychology papers for statistical errors. Please read on for how that is possible, some tools, and commentary.
Early automated analysis:
Trial model of a part of the Analytical Engine, built by Babbage, as displayed at the Science Museum (London) (Wikipedia).
From the abstract of Nuijten et.al. paper we have:
This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion.
How did they do that? Has science been so systematized it is finally mechanically reproducible? Did they get access to one of the new open information extraction systems (please see Open Information Extraction: the Second Generation Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam for some discussion)?
No, they used the fact that the American Psychological Association defines a formal style for reporting statistical significances, just like they define a formal style for citations. Roughly it looks for text like the following:
The results of the regression indicated the two predictors explained 35.8% of the variance (R2=.38, F(2,55)=5.56, p < .01).
(From a derived style guide found at the University of Connecticut.)
The software looks for fragments like: “(R2=.38, F(2,55)=5.56, p < .01)”. So really we are looking at statistics in psychology papers because they have standards clear enough to facilitate inspection.
These statistical summaries are often put into research papers by cutting and pasting from multiple sources as not all stat packages report all these pieces in one contiguous string. So there are many chances for human error and therefore there is a very high chance they eventually get out of sync. Think of a researcher using Microsoft Word, Microsoft Excel, and some complicated graphical interface driven software again and again as data and treatment change throughout a study. Eventually something gets out of sync. We can try to check for inconsistency as both the reported p-value and R-squared are derivable from the
In fact the cited example has errors. The “explained 35.8% of the variance” should likely be 38% (to match the R2 / coefficient of determination) and the “
F(2,55)=5.56” bit would entail an R-squared closer to the following: F Test summary: (R2=0.17, F(2,55)=5.56, p≤0.00632) (we chose to show the actual p-value, but cutting off at a sensible limit is part of the guidelines). Likely this is a notional example itself built by copying and pasting to show the format (so we have no intent of mocking it). We derived this result by writing our own R function that takes the F-summaries and re-calculates the R-squared and p-value. In our case we performed the calculation by pasting the following into R: “
formatAPAR2fromCite(numdf=2,dendf=55,FValue=5.56)” which performs the calculation and formats the result close to APA style.
Really this helps point out why scientists should strongly prefer workflows that support reproducible research (a topic we teach using R, RStudio, knitr, Sweave, and optionally Latex). It would be better to have correct conclusions automatically transcribed into reports, instead of hoping to catch some fraction wrong ones later. This is one reason Charles Babbage specified a printer on both his Difference Engine 2 and Analytical Engine (circa 1847)- to avoid errors!
That being said we recommend reading the original paper. The ability to detect errors gives the ability to collect statistics on errors over time, so there
are a number of interesting observations to be made. For more work in this spirit we suggest An empirical study of FORTRAN programs Knuth, Donald E., Software: Practice and Experience, Vol. 1, No. 2, 1971, doi: 10.1002/spe.4380010203.
We can even trying running statcheck on the guide; it confirms the relation between the F-value and p-value and doesn’t seem to check the R-squared (probably not part of the intended check):
|Raw||F(2,55)=5.56, p < .01|
Our R code demonstrating how to automatically produce ready to go APA style F-summaries can be found here.