Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Penalizing P Values

Ioannidis' paper
suggesting that most published results in medical research are not true is now
high profile enough that even my dad, an artist who wouldn't know a test
statistic if it hit him in the face, knows about it. It has even shown up
recently in the Economist
as a cover article and plays directly into the “decline effect” discussed in
a cover story
in the New Yorker from 2010. Something is seriously wrong with science if only
a small fraction of papers can actually be replicated.

But, placed in the context of the “decline effect,” this result makes sense. And
it is a fundamental aspect and potential flaw in the way frequentist inference
treats hypotheses.

Using a wild example, suppose I return from a walk in the woods and report an
encounter with bigfoot. Now, while it is
possible that bigfoot is real, it seems unlikely.
But I have some blurry pictures and video of something moving off in the bush.
I claim that this is evidence that bigfoot is real.

I show you my evidence of bigfoot and tell you about my encounter. You know me
and know that I am fairly sane and always wear my glasses. You think it is
unlikely that I would make this up or mistake a deer for bigfoot. You think
there is less than a 5% chance that I would make up something this or more
convincing given that bigfoot is not real. Therefore, the evidence would suggest
that you reject the null that bigfoot is not real.

Hopefully, you don't think that is reasonable. But that is exactly how
frequentist inference treats evidence for or against the null. The p value
is simply the $$P(\theta \geq \hat{\theta} | H_0)$$. The claim that bigfoot is
real is given as much credibility as the claim that smoking causes cancer
(RA Fisher might think that is reasonable but the rest of us have reason for
concern). We would probably conclude that it was much more likely that I saw
a deer or a hoax then I saw an actual bigfoot.

This becomes a problem for a few reasons

1. We notice things more when they are unexpected
2. We report things more when they are unexpected
3. Many things that are unexpected are unexpected for a reason

This problem is especially serious when people “throw statistics” at data with
the goal of making causal inference without using a priori theory as a guide.
They find that there is a relationship between X and Y that is significant at
$$\alpha = 0.05$$ and publish.

The field of Bayesian statistics provides a radically different form of
inference that can potentially be used to address this question, but a simple
back of the envelope penalty term may work just as well. Consider the simple
cases of Bayes theorem,

$P(A | B) = \frac{P(B | A) P(A)}{P(B | A) P(A) + P(B | A^c) P(A^c)}$

Taking $$P(B | A)$$ to be the the same as the p value and $$P(A)$$ being our
a priori estimate of how likely the null hypothesis is true. What is the
probability of rejecting the null when the null is not true? That is simply
the power of the test with the given parameters. Suppose we set $$P(B | A)$$ to
some constant value (e.g., 0.05), and label anything with $$p$$ less than that
value is significant and anything greater is non-significant, e.g.,
$$P(B | A) = \alpha$$. We can then calculate the rate of “false positive” results
for that value of $$\alpha$$ and power with

$P(H_0 | \hat{\theta}) = \frac{P(\theta \geq \hat{\theta} | H_0) P(H_0)}{P(\theta \geq \hat{\theta} | H_0) P(H_0) + P(\theta \geq \hat{\theta} | H_0^c) (1 - P(H_0))}$

I wanted to get a feel for what this would look like and how these different
parameters would interact. Also I needed an excuse to learn Shiny. You can see how this comes together and play with the
values in the dynamic graph below.

I would encourage you to play around with it and see how the different values
effect the probability that the alternative is true. You can see in the default
case where we place equal weight on the null being true or false and have
well powered studies, we do pretty well for ourselves. But as soon as you lower
the power to a plausible 0.35 it the probability of the results being
spurious doubles. If you set the power back at 0.80 but set the probability of
the null being true at 90%, as Ioannidis suggests, we see the probability of a
false positive at \$\alpha = 0.05% is now roughly 35%! If you combine the low
power and unlikeliness of the tested claims, the probability of false
conclusions is well over 50% using a standard $$\alpha$$.

As exciting as it would be to be known as the guy who found bigfoot, odds are
that it was just some high schoolers out to play games with you. The null
should be treated differently when we are making seemingly obvious and
unexpected results. Even a simple sanity test as described here may reduce
the surprisingly and unsustainable large number of later falsified or
unreproduced findings. It certainly explains one process by which they may
occur.