Does the Student based confidence interval have any interest in practice ?

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Friday in the course of statistics, we started the section on confidence interval, and like always, I got a bit confused with the degrees of freedom of the Student (should it be http://freakonometrics.blog.free.fr/public/perso2/IC-std-6.gif or http://freakonometrics.blog.free.fr/public/perso2/IC-std-5.gif ?) and which empirical variance (should we consider the one where we divide by http://freakonometrics.blog.free.fr/public/perso2/IC-std-6.gif or the one with http://freakonometrics.blog.free.fr/public/perso2/IC-std-5.gif ?).
And each time I start to get confused, the student obviously see it, and start to ask tricky questions… So let us make it clear now. The correct formula is the following: let

http://freakonometrics.blog.free.fr/public/perso2/IC-std-4.gif
then
http://freakonometrics.blog.free.fr/public/perso2/IC-std-1.gif
is a confidence interval for the mean of a Gaussian i.i.d. sample.
But the important thing is neither the n-1 that appear as degrees of freedom nor the http://freakonometrics.blog.free.fr/public/perso2/IC-std-6.gif that appear in the estimation of the standard error. Like always in mathematical result, the most important part of that result is not mentioned here: observations have to be i.i.d. and to be normally distributed. And not “almost” normally distributed….
Consider the following case: we have http://freakonometrics.blog.free.fr/public/perso2/IC-std-6.gif=20 observations that are almost normally distributed. Hence, I consider a student t distribution
n=20; X=rt(n,df=3)
An Anderson Darling normality test accepts a normal distribution in 2 cases out of 3.
for(s in 1:10000){
X=rt(n,df=3)
pv[s]=ad.test(X)$p.value
}
mean(pv>.05)
[1] 0.6799
With a true normal distribution if would be 95% of the cases, so in some sense, I can pretend that I generate almost normal samples.
For those samples, we can look at bounds of the 90% confidence interval for the mean, with three different formulas,
http://freakonometrics.blog.free.fr/public/perso2/IC-std-1.gif
 i.e. the correct one, or the one where I considered http://freakonometrics.blog.free.fr/public/perso2/IC-std-6.gif degrees of freedom instead of http://freakonometrics.blog.free.fr/public/perso2/IC-std-5.gif,
http://freakonometrics.blog.free.fr/public/perso2/IC-std-2.gif
and the one were we condired a Gaussian quantile instead of a Student t one,
http://freakonometrics.blog.free.fr/public/perso2/IC-std-3.gif
(and one might think to look at the non-unbiased estimator of the variance, also).
for(s in 1:10000){
X=rt(n,df=3)
m[s]=mean(X)
sd=sqrt(var(X))
IC1[s]=m[s]-qt(.95,df=n-1)*sd/sqrt(n)
IC2[s]=m[s]-qt(.95,df=n)*sd/sqrt(n)
IC3[s]=m[s]-qnorm(.95)*sd/sqrt(n)
}
One the graph below are plotted the distributions of the values obtained as lower bound of the 90% confidence interval,

(the curves with http://freakonometrics.blog.free.fr/public/perso2/IC-std-6.gif and http://freakonometrics.blog.free.fr/public/perso2/IC-std-5.gif degrees of freedom in quantiles are the same, here).
The dotted vertical line is the true lower bound of the 90%-confidence interval, given the true distribution (which was not a Gaussian one).
If I get back to the standard procedure in any statistical textbook, since the sample is almost Gaussian, the lower bound of the confidence interval should be (since we have a Student t distribution)
mean(IC1)
[1] -0.605381
instead of
mean(IC3)
[1] -0.5759391
(obtained with a Gaussian distribution instead of a Student one). Actually, both of them are quite different from the correct one which was
quantile(m,.05)
       5% 
-0.623578 
As I mentioned in a previous post (here), an important issue is that if we do not know a parameter and substitute an estimator, there is usually a cost (which means usually that the confidence interval should be larger). And this is what we observe here. From a teacher’s point of view, it is an important issue that should be mentioned in statistical courses….
But another important point is also that confidence interval is valid only if the underlying distribution is Gaussian. And not almost Gaussian, but really a Gaussian one.  So since with http://freakonometrics.blog.free.fr/public/perso2/IC-std-6.gif=20 observations everything might look Gaussian, I was wondering what should be done in practice… Because in some sense, using a Student quantile based confidence interval on some almost Gaussian sample is as wrong as using a Gaussian quantile based confidence interval on some Gaussian sample…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)