My note on multiple testing

[This article was first published on One Tip Per Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s not a shame to put a note on something (probably) everyone knows and you thought you know but actually you are not 100% sure. Multiple testing is such a piece in my knowledge map.

Some terms first:
– Type I error (false positive) and Type II error (false negative): 
When we do a hypothesis test, we can categorize the result into the following 2×2 table:
 Table of error types Null hypothesis (H0) is
Judgement of Null Hypothesis (H0)RejectType I error
(False Positive)
Correct inference
(True Positive)
Fail to rejectCorrect inference
(True Negative)
Type II error
(False Negative)
Type I error is “you reject a true thing”. If the true thing is a null hypothesis (H0), which is what people usually assume (e.g. no difference, no effect), then you reject it (or yes, there is difference), it’s like a false positive. The similar logics for Type II error, or false negative.

Also note that people use Greek letter α for type I error rate and β for type II error rate. α is also the significant level for a test, e.g. 5%. So when a single test reaches p-value 0.05, we can intuitively understand that with 5% of chance we make a mistake or 5% of cases we thought significant are actually not. β is related with the power of a test. Power of a test = the ability to detect True Positive among all real positive cases.

– Sensitivity and Specificity
 Total test (m)Null hypothesis (H0) is
Judgement of Null Hypothesis (H0)Reject (R)VS
Fail to rejectUT
Sensitivity = S / (S+T)  = power = 1-β
Specificity = U / (U+V) = 1-α

– Why multiple testing matters?
It matters because we usually perform the same hypothesis tests not just once, but many many times. If your chance of making an error in single test is α, then your chance to make one or more errors in m tests will be
Pr(at least one error)=1−(1−α)m
So, then m is large, the chance will be nearly 100%. That’s why we need to adjust the p-values for the number of hypothesis tests performed, or to control type I error rate.

– How to control type I error rate in multiple test?
There are many different ways to control the type I errors, such as
Per comparison error rate (PCER): the expected value of the number of Type I errors over the number of hypotheses, PCER = E(V)/m
Per-family error rate (PFER): the expected number of Type I errors, PFE = E(V).
Family-wise error rate (FWER): the probability of at least one type I error, FWER = P(V ≥ 1)
False discovery rate (FDR) is the expected proportion of Type I errors among the rejected hypotheses, FDR = E(V/R | R>0)P(R>0)
Positive false discovery rate (pFDR): the rate that discoveries are false, pFDR = E(V/R | R > 0)

– Controlling Family-Wise Error Rate
Many procedures have been developed to control the family-wise error rate P(V≥ 1), including the Bonferroni, Holm (1979), Hochberg (1988), and Sidak. It consists of two typessingle-step (e.g. Bonferroni) and sequential adjustment (e.g. Holm or Hochberg). Bonferroni correction is to control the overall type I errors when all tests are independent. It rejects any hypothesis with p-value ≤ α/m. So, when doing corrections, simply multiply the nominal p-value by m to get the adjusted p-values. In R, it’s the following function
p.adjust(p, method = "bonferroni")
The sequential corrections is slightly more powerful than Bonferroni test. The Holm step-down procedure is the easiest to understand. First, sort your thousand p-values from low to high. Multiply the smallest p-value by one thousand. If that adjusted p-value is less than 0.05, then that gene shows evidence of differential expression. There is no difference as Bonferroni test for the gene. Then for the 2nd one, multiply its p-value by 999 (not one thousand) and see if it is less than 0.05. Multiply the third smallest p-value by 998, the fourth smallest by 997, etc. Compare each of these adjusted p-values to 0.05. We then insure that any adjusted p-value is at least as large as any preceding adjusted p-value. If it is not make sure it is equal to the largest of the preceding p-values. This is the algorithm of Holm step-down procedure. In R, it’s
p.adjust(p, method = "holm")

– Controlling FDR
FWER is appropriate when you want to guard against ANY false positives. However, in many cases (particularly in genomics) we can live with a certain number of false positives. In these cases, the more relevant quantity to control is the false discovery rate (FDR). False discovery rate (FDR) is designed to control the proportion of false positives (V) among the set of rejected hypotheses (R). The FDR control has generated a lot of interest due to its more balanced trade-off between error rate control and power than the traditional Family-wise Error Rate control

Procedures controlling FDR include Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001), Benjamini & Hochberg (2000) and two-stage Benjamini & Hochberg (2006).

Here are the steps for Benjamini & Hochberg FDR:
1. sort nominal p-values from small to big: p1 ≤ p2 ≤ … ≤ pm
2. find a highest rank of j with pj < (j/m) x δ, where δ is the controlled FDR level. 
3. declare the tests of rank 1, 2, …, j as significant, and their adjusted p-values as pj*m/j. 


To leave a comment for the author, please follow the link and comment on their blog: One Tip Per Day. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)