I finally found some time to take a closer look at p curves. I haven’t had a chance to follow-up my simulations (and probably won’t for a few weeks if not months), but I have had time to think through the ideas the p curve approach raises based on some of the comments I’ve received and a brief exchange with Uri Simonsohn (who has answered a few of my questions).
First, I got a couple of things at least partly wrong.
i) how p curves work
ii) the potential for correlated p values
How p curves work
I made the (I think) reasonable assumption that p curve analysis involved focusing on a bump just under the p = .05 threshold. Other work (Wicherts et al., 2011) has shown that there is indeed some distortion around this value. My crude simulation suggested that p curves could maybe be used to detect this kind of bump – but that the method was noisy and required large N.
All good so far except my assumption was completely wrong. This isn’t what Simonsohn and colleagues are proposing at all. They are focusing on the whole of the distribution between p = 0 and p = .05. This is a very different kind of analysis because it uses all the available p value information about ‘p hacking’ (if you accept the highly plausible premise that p hacking is concentrated on statistically significant p values).
Null effects will therefore produce a flat p curve (because the distribution of p under the null is uniform). Simonsohn argues that non-null effects should produce downward sloping p curves. He and his colleagues have simulated p curves under various ranges of effect size to confirm this – and there is also an analytic proof for the normal case (Hung et al., 1997).* I also (inadvertently) confirmed this in my original simulations – which show the downward sloping trend (but note that I include p values up to p = .10 in my plots).
However, mixing in p hacked studies to a flat curve will produce an upward sloping curve – the feature that Simonshohn and his colleagues are focusing on. I haven’t simulated this directly – but it seems sensible because p hacking is (in essence) a flavour of optional stopping (adding data or iterating analyses until you squeeze a statistically significant effect out). Certainly, an upward sloping curve would be a signal of something wierd going on.
This approach uses more information than my mistaken ‘p bump’ approach and so should be much more stable.
* It is far from unreasonable to treat the distribution of effects as approximately normal – as is common in meta-analysis (and see also Gillett, 1994), but I don’t think the pattern depends strongly on this assumption.
Correlated p values
It is well known that p values are inherently extremely noisy ‘statistics’ – they jump around all over the place for identical replications. Geoff Cumming and colleagues have published some good work on this (e.g., Cumming & Fidler, 2009). Thus the same effect in different studies or different effects of similar sizes will in general not tend to have correlated p values. However, the noise that causes this jumping around will be crystalized if you use the same data to re-calculate the p value. This could cause correlated p values where data is re-used or where variables are very highly correlated. For example, this could happen if you add a covariate that is a modest predictor of Y and uncorrelated with and report p values with and without the covariate. It could also happen if you report essentially the same analysis twice with a very similar variable (e.g., X correlated with children’s age or X correlated with years of schooling).
There are two main solutions here: a) just filter out p values that re-use data or use highly-correlated data, or b) model the correlations in some way by accounting for within-study clustering – as you might in a multilevel model and some forms of meta-analysis (itself a form of multilevel model).
In summary, I think the p curve approach looks very interesting, and I’d certainly like to see more work on it (and hope to see the full version published some time soon).
Cumming, G., & Fidler, F. (2009). Confidence Intervals. Zeitschrift für Psychologie / Journal of Psychology, 217(1), 15-26.
Gillett, R. (1994). Post hoc power analysis. Journal of Applied Psychology, 79(5), 783-785. doi:10.1037//0021-9010.79.5.783
Hung, H. M., O’Neill, R. T., Bauer, P., & Köhne, K. (1997). The behavior of the P-value when the alternative hypothesis is true. Biometrics, 53(1), 11-22.
Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PloS one, 6(11), e26828. doi:10.1371/journal.pone.0026828