Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post spawned from a discussion I had the other day. Confidence intervals are notoriously a difficult topic for those unfamiliar with statistics. I can’t really think of another statistical topic that is so widely published in newspaper articles, television, and elsewhere that so few people really understand. It’s been this way since the moment that Jerzy Neyman proposed the idea (in the appendix no less) in 1937.

What the Confidence Interval is Not

There are a lot of things that the confidence interval is not. Unfortunately many of these are often used to define confidence interval.

• It is not the probability that the true value is in the confidence interval.
• It is not that we will obtain the true value 95% of the time.
• We are not 95% sure that the true value lies within the interval.
• It is not the probability that we correct.
• It does not say anything about how accurate the current estimate is.
• It does not mean that if we calculate a 95% confidence interval then the true value is, with certainy, contained within that one interval.

The Confidence Interval

There are several core assumption that need to be met to use confidence intervals and often require random selection, independent and identically distribution (IID) data, among others. When one computes a confidence interval they will find that the true value lies within the computed interval 95 percent of the time. That means in the long run if we keep on computing these confidence intervals then 95% of those intervals will contain the true value.

The other 5%

When we have a 95% confidence interval it means that if we repeatedly conduct the survey using the exact same procedures then the interval would contain the actual, “true value”, in 95% of the intervals in the long run. But that leaves a remaining 5%. Where did that go? This gets into hypothesis testing and rejecting the null (H0) and concluding the alternative (Ha). The 5% that is often used is known as a Type I error. It is often identified by the Greek letter $\alpha$. This 5% is the probability of making a Type I error and is often called significance level. This means that the probability of an error and rejecting the null hypothesis when the null hypothesis is in fact true is 5%.

The Population

Simply looking at the formulas used to calculate a confidence interval we can see that it is a function of the data (variance and mean). Unless the finite population correction (FPC) is used, it is otherwise not related to the population size. If we have a population of one hundred thousand or one hundred million the confidence interval will be the same. With a population of that size the FPC is so minuscule that it won’t really change anything anyway.

The Margin of Error

A direct component of the confidence interval is the margin of error. This is the number that is most widely seen in the news whether it be print, TV or otherwise. Often, however, the confidence level is excluded and not mentioned in these articles. One can normally assume a 95% confidence level, most of the time. What makes the whole thing difficult is that the margin of error could be based on a 90% confidence level making the margin of error smaller.  Thus giving the artificial impression of the survey’s accuracy.  The graph below shows the sample size needed for a given margin of error.  This graph is based on the conservative 50% proportion.  Different proportions will provide a smaller margin of error due to the math.  In other words .5*.5 maximizes the margin of error (as seen in the graph above), any other combination of numbers will decrease the margin of error.  Often the “magic number” for sample size seems to be in the neighborhood of 1000 respondents (with, according to Pew, a 9% response rate).

The Other Error

Margin of error isn’t the only error.  Keep in mind that the word error should not be confused with there being a mistake in the research.  Error simply means random variation due to sampling.  So when a survey or other study indicates a margin of error of +/- 3% that is simply the error (variation) due to random sampling.  There are all sorts of other types of error that can work its way in to the research including, but not limited to, differential response, question wording on surveys, weather, and the list could go on.  Many books have been written on this topic.

Some Examples

alpha = .01
reps = 100000
true.mean = 0
true.var = 1

true.prop = .25

raw = replicate(reps, rnorm(100,true.mean,true.var))

# Calculate the mean and standard error for each of the replicates
raw.mean = apply(raw, 2, mean)
raw.se = apply(raw, 2, sd)/sqrt( nrow(raw) )

# Calculate the margin of error
raw.moe = raw.se * qnorm(1-alpha/2)

# Set up upper and lower bound matrix. This format is useful for the graphs
raw.moe.mat = rbind(raw.mean+raw.moe, raw.mean-raw.moe)
row.names(raw.moe.mat) = c(alpha/2, 1-alpha/2)

# Calculate the confidence level
( raw.CI = (1-sum(
as.numeric( apply(raw.moe.mat, 2, min) > 0 | apply(raw.moe.mat, 2, max) < 0 )
)/reps)*100 )
# Try some binomial distribution data
raw.bin.mean = rbinom(reps,50, prob=true.prop)/50

raw.bin.moe = sqrt(raw.bin.mean*(1-raw.bin.mean)/50)*qnorm(1-alpha/2)
raw.bin.moe.mat = rbind(raw.bin.mean+raw.bin.moe, raw.bin.mean-raw.bin.moe)
row.names(raw.bin.moe.mat) = c(alpha/2, 1-alpha/2)

( raw.bin.CI = (1-sum(
as.numeric( apply(raw.bin.moe.mat, 2, min) > true.prop | apply(raw.bin.moe.mat, 2, max) <= true.prop )
)/reps)*100 )

par(mfrow=c(1,1))
ind = 1:100
ind.odd = seq(1,100, by=2)
ind.even = seq(2,100, by=2)
matplot(rbind(ind,ind),raw.moe.mat[,1:100],type="l",lty=1,col=1,
xlab="Sample Identifier",ylab="Response Value",
main=expression(paste("Confidence Intervals with ",alpha,"=.01")),
sub=paste("Simulated confidence Level: ",raw.CI,"%", sep="")
, xaxt='n')

axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75)
axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75)
points(ind,raw.mean[1:100],pch=19, cex=.4)

abline(h=0, col="#0000FF")
size.seq = seq(0, 10000, by=500)[-1]

moe.seq = sqrt( (.5*(1-.5))/size.seq ) * qnorm(1-alpha/2)

plot(size.seq, moe.seq, xaxt='n', yaxt='n',
main='Margin of Error and Sample Size',
ylab='Margin of Error', xlab='Sample Size',
sub='Based on 50% Proportion')
lines(size.seq, moe.seq)
axis(side=1, at=size.seq, tcl = -1.0, lty = 1, lwd = 0.5, labels=size.seq, cex.axis=.75)
axis(side=2, at=seq(0,15, by=.005), tcl = -0.7, lty = 1, lwd = 0.5, labels=seq(0,15, by=.005), cex.axis=.75)
abline(h=seq(0,15,by=.005), col='#CCCCCC')
abline(v=size.seq, col='#CCCCCC')

size.seq = seq(0,1, by=.01)

moe.seq = sqrt( (size.seq*(1-size.seq))/1000 ) * qnorm(1-alpha/2)

plot(size.seq, moe.seq, xaxt='n', yaxt='n',
main='Margin of Error and Sample Size',
ylab='Margin of Error', xlab='Proportion',
sub='Based on 50% Proportion')
lines(size.seq, moe.seq)
axis(side=1, at=size.seq, tcl = -1.0, lty = 1, lwd = 0.5, labels=size.seq, cex.axis=.75)
axis(side=2, at=seq(0,15, by=.005), tcl = -0.7, lty = 1, lwd = 0.5, labels=seq(0,15, by=.005), cex.axis=.75)
abline(h=seq(0,15,by=.005), col='#CCCCCC')
abline(v=.5, col="#CCCCCC")