Psych your mind has an interesting blog post on using p curves to detect dodgy stats in a a volume of published work (e.g., for a researcher or journal). The idea apparently comes from Uri Simonsohn (one of the authors of a recent paper on dodgy stats). The author (Michael W. Kraus) bravely plotted and published his own p curve – which looks reasonably ‘healthy’. However, he makes an interesting point – which is that we don’t know how useful these curves are in practice – which depends among other things on the variability inherent in the profile of p values.
I quickly threw together a simulation to address this in R. It is pretty limited (as I don’t have much time right now), but potentially interesting. It simulates independent t test p values where the samples are drawn from independent, normal distributions with equal variances but different means (and n = 25 per group). The population standardized effect size is fixed at d = 0.5 (as psychology research generally reports median effect sizes around this value). Fixing the parameters is unrealistic, but is perhaps OK for a quick simulation.
I ran this several times and plotted p curves (really just histograms with bins collecting p values at relevant intervals). First I plotted for an early career researcher with just a few publications reporting 50 p values. I then repeated for more experienced researchers with n = 100 or n = 500 published p values.
Here are the 15 random plots for 50 p values:
At least one of the plots has a suspicious spike between p = .04 and .05 (exactly where dodgy practices would tend to push the p values).
What about 100 p values?
Here the plots are still variable (but closer to the theoretical ideal plotted on Kraus’ blog).
You can see this pattern even more clearly with 500 p values:
Some quick conclusions … The method is too unreliable for use with early career researchers. You need a few hundred p values to be pretty confidence of a nice flat pattern between p = .01 and p = .06. Varying the effect size and other parameters might well inject further noise (as would adding in null effects which have a uniform distribution of p values and are thus probably rather noisy).
I’m also skeptical that this is useful for detecting fraud (as presumably deliberate fraud will tend to go for ‘impressive’ p values such as p < .0001). Also (going forward) fraudsters will be able to generate results to circumvent tools such as p curves (if they are known to be in use).