Simulating p curves and detecting dodgy stats

February 16, 2012

(This article was first published on Psychological Statistics, and kindly contributed to R-bloggers)

Psych your mind has an interesting blog post on using p curves to detect dodgy stats in a a volume of published work (e.g., for a researcher or journal). The idea apparently comes from Uri Simonsohn (one of the authors of a recent paper on dodgy stats). The author (Michael W. Kraus) bravely plotted and published his own p curve – which looks reasonably ‘healthy’. However, he makes an interesting point – which is that we don’t know how useful these curves are in practice – which depends among other things on the variability inherent in the profile of p values.

I quickly threw together a simulation to address this in R. It is pretty limited (as I don’t have much time right now), but potentially interesting. It simulates independent t test p values where the samples are drawn from independent, normal distributions with equal variances but different means (and n = 25 per group). The population standardized effect size is fixed at d = 0.5 (as psychology research generally reports median effect sizes around this value). Fixing the parameters is unrealistic, but is perhaps OK for a quick simulation.

I ran this several times and plotted p curves (really just histograms with bins collecting p values at relevant intervals). First I plotted for an early career researcher with just a few publications reporting 50 p values. I then repeated for more experienced researchers with n = 100 or n = 500 published p values.

Here are the 15 random plots for 50 p values:

At least one of the plots has a suspicious spike between p = .04 and .05 (exactly where dodgy practices would tend to push the p values).
What about 100 p values?
Here the plots are still variable (but closer to the theoretical ideal plotted on Kraus’ blog).
You can see this pattern even more clearly with 500 p values:
Some quick conclusions … The method is too unreliable for use with early career researchers. You need a few hundred p values to be pretty confidence of a nice flat pattern between p = .01 and p = .06. Varying the effect size and other parameters might well inject further noise (as would adding in null effects which have a uniform distribution of p values and are thus probably rather noisy).
I’m also skeptical that this is useful for detecting fraud (as presumably deliberate fraud will tend to go for ‘impressive’ p values such as p < .0001). Also (going forward) fraudsters will be able to generate results to circumvent tools such as p curves (if they are known to be in use).

To leave a comment for the author, please follow the link and comment on their blog: Psychological Statistics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)