**jacobsimmering.com**, and kindly contributed to R-bloggers)

# Penalizing P Values

Ioannidis' paper

suggesting that most published results in medical research are not true is now

high profile enough that even my dad, an artist who wouldn't know a test

statistic if it hit him in the face, knows about it. It has even shown up

recently in the Economist

as a cover article and plays directly into the “decline effect” discussed in

a cover story

in the New Yorker from 2010. Something is seriously wrong with science if only

a small fraction of papers can actually be replicated.

But, placed in the context of the “decline effect,” this result makes sense. And

it is a fundamental aspect and potential flaw in the way frequentist inference

treats hypotheses.

Using a wild example, suppose I return from a walk in the woods and report an

encounter with bigfoot. Now, while it is

possible that bigfoot is real, it seems unlikely.

But I have some blurry pictures and video of something moving off in the bush.

I claim that this is evidence that bigfoot is real.

I show you my evidence of bigfoot and tell you about my encounter. You know me

and know that I am fairly sane and always wear my glasses. You think it is

unlikely that I would make this up or mistake a deer for bigfoot. You think

there is less than a 5% chance that I would make up something this or more

convincing given that bigfoot is not real. Therefore, the evidence would suggest

that you reject the null that bigfoot is not real.

Hopefully, you don't think that is reasonable. But that is exactly how

frequentist inference treats evidence for or against the null. The p value

is simply the \( P(\theta \geq \hat{\theta} | H_0) \). The claim that bigfoot is

real is given as much credibility as the claim that smoking causes cancer

(RA Fisher might think that is reasonable but the rest of us have reason for

concern). We would probably conclude that it was much more likely that I saw

a deer or a hoax then I saw an actual bigfoot.

This becomes a problem for a few reasons

- We notice things more when they are unexpected
- We report things more when they are unexpected
- Many things that are unexpected are unexpected for a reason

This problem is especially serious when people “throw statistics” at data with

the goal of making causal inference without using a priori theory as a guide.

They find that there is a relationship between X and Y that is significant at

\( \alpha = 0.05 \) and publish.

The field of Bayesian statistics provides a radically different form of

inference that can potentially be used to address this question, but a simple

back of the envelope penalty term may work just as well. Consider the simple

cases of Bayes theorem,

\[ P(A | B) = \frac{P(B | A) P(A)}{P(B | A) P(A) + P(B | A^c) P(A^c)} \]

Taking \( P(B | A) \) to be the the same as the p value and \( P(A) \) being our

a priori estimate of how likely the null hypothesis is true. What is the

probability of rejecting the null when the null is not true? That is simply

the power of the test with the given parameters. Suppose we set \( P(B | A) \) to

some constant value (e.g., 0.05), and label anything with \( p \) less than that

value is significant and anything greater is non-significant, e.g.,

\( P(B | A) = \alpha \). We can then calculate the rate of “false positive” results

for that value of \( \alpha \) and power with

\[ P(H_0 | \hat{\theta}) = \frac{P(\theta \geq \hat{\theta} | H_0) P(H_0)}{P(\theta \geq \hat{\theta} | H_0) P(H_0) + P(\theta \geq \hat{\theta} | H_0^c) (1 - P(H_0))} \]

I wanted to get a feel for what this would look like and how these different

parameters would interact. Also I needed an excuse to learn Shiny. You can see how this comes together and play with the

values in the dynamic graph below.

I would encourage you to play around with it and see how the different values

effect the probability that the alternative is true. You can see in the default

case where we place equal weight on the null being true or false and have

well powered studies, we do pretty well for ourselves. But as soon as you lower

the power to a plausible 0.35 it the probability of the results being

spurious doubles. If you set the power back at 0.80 but set the probability of

the null being true at 90%, as Ioannidis suggests, we see the probability of a

false positive at $\alpha = 0.05% is now roughly 35%! If you combine the low

power and unlikeliness of the tested claims, the probability of false

conclusions is well over 50% using a standard \( \alpha \).

As exciting as it would be to be known as the guy who found bigfoot, odds are

that it was just some high schoolers out to play games with you. The null

should be treated differently when we are making seemingly obvious and

unexpected results. Even a simple sanity test as described here may reduce

the surprisingly and unsustainable large number of later falsified or

unreproduced findings. It certainly explains one process by which they may

occur.

**leave a comment**for the author, please follow the link and comment on their blog:

**jacobsimmering.com**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...