**Freakonometrics - Tag - R-english**, and kindly contributed to R-bloggers)

Friday in the course of statistics, we started the section on *confidence interval*, and like

always, I got a bit confused with the degrees of freedom of the Student

(should it be or ?) and which empirical variance (should we

consider the one where we divide by or the one with ?).

And each time I start to get confused, the student obviously see it,

and start to ask tricky questions… So let us make it clear now. The *correct* formula is the following: let

then

is a confidence interval for the mean of a Gaussian i.i.d. sample.

But the important thing is neither the *n-1*

that appear as degrees of freedom nor the that appear in the

estimation of the standard error. Like always in mathematical result,

the most important part of that result is not mentioned here:

observations have to be i.i.d. and to be normally

distributed. And not “*almost*”

normally distributed….

Consider the following case: we have =20 observations that are *almost *normally distributed.

Hence, I consider a student *t*

distribution

An Anderson Darling normality test accepts a normal distribution in 2

cases out of 3.

With a *true* normal

distribution if would be 95% of the cases, so in some sense, I can

pretend that I generate *almost*

normal samples.

For those samples, we can look at bounds of the 90% confidence interval

for the mean, with three different formulas,

i.e. the *correct* one,

or the one where I considered degrees of freedom instead of ,

and the one were we condired a Gaussian quantile instead of a Student t

one,

for(s in 1:10000){

X=rt(n,df=3)

m[s]=mean(X)

sd=sqrt(var(X))

IC1[s]=m[s]-qt(.95,df=n-1)*sd/sqrt(n)

IC2[s]=m[s]-qt(.95,df=n)*sd/sqrt(n)

IC3[s]=m[s]-qnorm(.95)*sd/sqrt(n)

}

One the graph below are plotted the distributions of the values obtained

as lower bound of the 90% confidence interval,

(the curves with and degrees of freedom in quantiles are the same, here).

The dotted vertical line is the *true*

lower bound of the 90%-confidence interval, given the *true* distribution (which was not a

Gaussian one).

If I get back to the standard procedure in any statistical textbook,

since the sample is almost Gaussian, the lower bound of the confidence

interval should be (since we have a Student *t* distribution)

mean(IC1)

[1] -0.605381

instead of

mean(IC3)

[1] -0.5759391

(obtained with a Gaussian distribution instead of a Student one). Actually, both

of them are quite different from the correct one which was

quantile(m,.05)

5%

-0.623578

As I mentioned in a previous post (here), an important issue is that if we do

not know a parameter and substitute an estimator, there is usually a

cost (which means usually that the confidence interval should be

larger). And this is what we observe here. From a teacher’s point of view, it is an important issue that should be mentioned in statistical courses….

But another important point is also that confidence

interval is valid *only* if the

underlying distribution is Gaussian. And not *almost* Gaussian, but really a

Gaussian one. So since with =20 observations everything might look Gaussian, I was wondering what should be done in practice… Because in some sense, using a Student quantile based confidence interval on some almost Gaussian sample is as wrong as using a Gaussian quantile based confidence interval on some Gaussian sample…

**leave a comment**for the author, please follow the link and comment on their blog:

**Freakonometrics - Tag - R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...