Friday in the course of statistics, we started the section on confidence interval, and like
always, I got a bit confused with the degrees of freedom of the Student
(should it be or ?) and which empirical variance (should we
consider the one where we divide by or the one with ?).
And each time I start to get confused, the student obviously see it, and start to ask tricky questions... So let us make it clear now. The correct formula is the following: let
But the important thing is neither the n-1 that appear as degrees of freedom nor the that appear in the estimation of the standard error. Like always in mathematical result, the most important part of that result is not mentioned here: observations have to be i.i.d. and to be normally distributed. And not "almost" normally distributed....
Consider the following case: we have =20 observations that are almost normally distributed. Hence, I consider a student t distribution
For those samples, we can look at bounds of the 90% confidence interval for the mean, with three different formulas,
(the curves with and degrees of freedom in quantiles are the same, here).
The dotted vertical line is the true lower bound of the 90%-confidence interval, given the true distribution (which was not a Gaussian one).
If I get back to the standard procedure in any statistical textbook, since the sample is almost Gaussian, the lower bound of the confidence interval should be (since we have a Student t distribution)
But another important point is also that confidence interval is valid only if the underlying distribution is Gaussian. And not almost Gaussian, but really a Gaussian one. So since with =20 observations everything might look Gaussian, I was wondering what should be done in practice... Because in some sense, using a Student quantile based confidence interval on some almost Gaussian sample is as wrong as using a Gaussian quantile based confidence interval on some Gaussian sample...