Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today’s announcement at CERN of the latest research on the Higgs boson was truly extraordinary. Not only was the scientific achievement remarkable, but medias reporting of 5-sigma as a measure of “certainty” was also truly remarkable. For instance, the science editor at the Swedish news paper Dagens Nyheter reported that a sigma of 4.9 equals a certainty of 99.99994 %, which obviously isn’t true, simply because p( D | H0 ) is not the same as p( H0 | D ). In plain english this means that a p-value represents the conditional probability of getting the data given that the null hypothesis is true. Nothing more, and it surely doesn’t give the probability for the alternative hypothesis being true, i.e. the “certainty” that somethings been found that’s not a random fluctuation.

So what does physicists mean when they report 5-sigma? Well, it’s just another convention of reporting alpha values. Sigma refers to the population standard deviation, and 5-sigma means that they accept events as statistical significant if they fall more than 5 standard deviations away from the mean, given that the null hypothesis is true. And here the null hypothesis is that the event is simply due to random noise or fluctuations. You can get the p-values for 5-sigma by taking the area under the normal curve that’s to the left of +5 sigma.

> pnorm(5)
 0.9999997


And then take 1 – 0.9999997 to get the p-value, which is 0.0000003 as the CERN researchers performed a one-tailed test. I imagine physicists say 5-sigma because saying “point zero zero zero zero zero zero three” might become quite tiresome, so it’s quite ironic that journalist all over the world seem to be converting sigma back to percent.

If we want we can also use R and ggplot2 to illustrate 5-sigma by plotting the normal distribution and superimpose a line at sigma 5

library(ggplot2)
x <- seq(-6,6,length=200)   # sigmas
y <- dnorm(x)               # curve

df <- data.frame("sigma" = x,"y" = y) # create data frame

# plot
text_block <- "A confidence level = 5-sigma represents \nthe probability of getting a result from your \nexperiment, simply from random fluctuations \nalone, equal to the area under the curve \nthat’s to the right of the dotted line. That’s an \nexceptionally rare event. However, the area to \nthe left of 5-sigma does not represent the \nprobability or certainty that the Higgs boson \nhas been found."
ggplot(df, aes(sigma,y)) +
geom_line(size=1) +
annotate("text", x=1.7, y=0.2, label=text_block, size=4, hjust=0) +
annotate("segment", x = 5, xend=5, y = 0, yend = 0.05, linetype="dashed") +
annotate("text", x=5, y=0.05, label="5-sigma", vjust=-0.5)


The area under the curve that’s to the right of the dotted line represents the p-value for 5-sigma. We see that observations in that area are highly unlikely to occur if we assume that the null hypothesis is true.