[This article was first published on Back Side Smack » R Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Talking a bit with my friend Jarrod about math stats and econometrics, we both came to the conclusion that the standard presentation for basic inference is lacking. In an intro or intermediate applied statistics course you learn about first and second moments for distributions of test statistics, then apply that knowledge to attempt to infer information about a population from a sample. For many people, the most common example of inference is the “margin of error” seen in polls for politicians or legislation. A pollster calls a few hundred people and shows their responses along with a somewhat uninformative statement about the margin of error for the poll. For applied economists and statisticians, the margin of error is simply a special case of a confidence interval. Trouble is, there doesn’t seem to be a consensus view on how to talk about confidence intervals. For example:

• With some confidence level α, the population mean will lie within the confidence intervals
• If we sample repeatedly, α% of the sample means with lie within the confidence intervals
• ρ% of the sampled points will lie within the confidence intervals

All purport to describe a confidence interval. They are each wrong in their own way, with the first probably being the least wrong. But another problem arises. We are now defining confidence intervals recursively through our definition of a confidence level. That isn’t an existential problem; the proper way to think about confidence intervals is to treat them as dependent upon the choice of confidence level. But it makes for a pedagogical headache. Now we jump right in and add another element of inference, the null hypothesis (or more generally, hypothesis testing). Hypothesis testing compares two or more hypotheses (octopodes IMO), determines the appropriate test statistic and then compares sample data to the two hypothesis. If the test statistic can offer a conclusive test (some cannot, see the Durbin-Watson test), a researcher can compare the observed values to the test statistic and see which hypothesis has enough evidence to fail to reject it. Again we run into the issue of recursive definition, this time not with confidence intervals but with ρ-values. The sample mean is a stochastic variable, so in order to make statements about where it may or may not be we need to formulate them in stochastic terms. Where the confidence level was determined by the statistician, the ρ-value is derived from the shape of the test statistic and the choice of null hypothesis. This can be a bit confusing as we often encounter researcher defined threshold ρ-values like 5%, 1%, and so on.

 From Simple regression

The above figure just shows a ρ-value of 2.5% picked right out a hat, but each point along the normal distribution corresponds to a different possible ρ-value and stating the value in order to work backwards is now how we should be thinking about it. Instead we imagine a null hypothesis. In our case, that x=0. Then we say, given that x is zero, what is the probability that we will see a value as high as 1.959? In our case that probability is 2.5%. Is a probability of 2.5% low enough that we might accept the alternate hypothesis that the true mean is something else should we sample from x and get something like 2.0? Maybe. That’s really up to you.

So we’ve covered null hypothesis, confidence levels and confidence intervals but we aren’t much closer to seeing how they are all connected apart from the fact that they are all taught to students at about the same time. To see how they might be related, try a simple exercise. Create a fake sample for linear regression in R with different parameters for pnorm()‘s standard deviation. For each of the different results, compute your own confidence intervals (you can use qnorm() for this or qt() if your test statistic is distributed as a t-distribution) and recover ρ-values from lm(). If you calibrated the fake data such that some result in a significant slope and some do not you should find an interesting relationship between the range of the confidence intervals you set at a given confidence level and the point at which you are unable to reject the null hypothesis (of a 0 slope). Give it a try.