Simulating p curves and detecting dodgy stats

[This article was first published on Psychological Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Psych your mind has an interesting blog post on using p curves to detect dodgy stats in a a volume of published work (e.g., for a researcher or journal). The idea apparently comes from Uri Simonsohn (one of the authors of a recent paper on dodgy stats). The author (Michael W. Kraus) bravely plotted and published his own p curve – which looks reasonably ‘healthy’. However, he makes an interesting point – which is that we don’t know how useful these curves are in practice – which depends among other things on the variability inherent in the profile of p values.

I quickly threw together a simulation to address this in R. It is pretty limited (as I don’t have much time right now), but potentially interesting. It simulates independent t test p values where the samples are drawn from independent, normal distributions with equal variances but different means (and n = 25 per group). The population standardized effect size is fixed at d = 0.5 (as psychology research generally reports median effect sizes around this value). Fixing the parameters is unrealistic, but is perhaps OK for a quick simulation.

I ran this several times and plotted p curves (really just histograms with bins collecting p values at relevant intervals). First I plotted for an early career researcher with just a few publications reporting 50 p values. I then repeated for more experienced researchers with n = 100 or n = 500 published p values.

Here are the 15 random plots for 50 p values:

At least one of the plots has a suspicious spike between p = .04 and .05 (exactly where dodgy practices would tend to push the p values).

What about 100 p values?

Here the plots are still variable (but closer to the theoretical ideal plotted on Kraus’ blog).

You can see this pattern even more clearly with 500 p values:

Some quick conclusions … The method is too unreliable for use with early career researchers. You need a few hundred p values to be pretty confidence of a nice flat pattern between p = .01 and p = .06. Varying the effect size and other parameters might well inject further noise (as would adding in null effects which have a uniform distribution of p values and are thus probably rather noisy).

I’m also skeptical that this is useful for detecting fraud (as presumably deliberate fraud will tend to go for ‘impressive’ p values such as p < .0001). Also (going forward) fraudsters will be able to generate results to circumvent tools such as p curves (if they are known to be in use).

To leave a comment for the author, please follow the link and comment on their blog: Psychological Statistics. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)