Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This morning, in our mathematical statistical class, we’ve seen the use of the chi-square test. The first one was related to some goodness of fit of a multinomial distribution. Assume that $\boldsymbol{N}=(N_1,\cdots,N_k)\sim\mathcal{M}(n,\boldsymbol{p})$. In order to test $H_0:\boldsymbol{p}=\boldsymbol{p}_0$ against $H_1:\boldsymbol{p}\neq\boldsymbol{p}_0$, use the statistic

$Q=\sum_{j=1}^k \frac{[N_j-np_{0,j}]^2}{np_{0,j}}$Under $H_0$$Q\sim\chi^2(k-1)$. For instance, we have the number of weddings, in a large city, per season,

`> n=c(301,356,413,262)`

We want to test if weddings are celebrated uniformely over the year, i.e. $H_0:\boldsymbol{p}=\boldsymbol{1}/4$.

```> np=rep(sum(n)/4,4)
> cbind(n,np)
n  np
[1,] 301 333
[2,] 356 333
[3,] 413 333
[4,] 262 333
> Q=sum( (n-np)^2/np  )
> Q
[1] 39.02102```

This quantity should be compared with the quantile of the chi-square distribution

```> qchisq(.95,df=4-1)
[1] 7.814728```

but it is also possible to compute the p-value,

```> 1-pchisq(Q,df=4-1)
[1] 1.717959e-08```

Here, we reject the assumption that weddings are celebrated uniformly over the year.

A second application is a goodness-of-fit test, for some parametric distribution. Assume that $X\sim F_{\boldsymbol{\theta}}$ takes discrete data, say $\{1,2,\cdots,k\}$. Here $\mathbb{P}[X=j]=p_j(\boldsymbol{\theta})$. In order to test $H_0:X\sim F_{\boldsymbol{\theta}_0}$ against $H_1:X\not\sim F_{\boldsymbol{\theta}_0}$, use

$Q=\sum_{j=1}^k \frac{[N_j-np_{j}(\boldsymbol{\theta}_0)]^2}{np_{j}(\boldsymbol{\theta}_0)}$

One can prove that $Q\sim\chi^2(k-1-\text{dim}(\boldsymbol{\theta}))$ under $H_0$. For instance, consider the popular example of von Bortkiewicz’s horsekicks data.

```> n=c(109,65,22,3,1)
> sum(n*0:4)/sum(n)
[1] 0.61```

The  first thing is that we should regroup 3 and 4+, in order to have enough observation in each cell of the table

`> n_correc=c(109,65,22,4)`

Now we can try a $\mathcal{P}(0.6)$ distribution

```> np=200*c(dpois(0:2,lambda=.6),
+    1-ppois(2,lambda=.6))
> n_correc=c(109,65,22,4)
> cbind(n_correc,np)
n_correc         np
[1,]      109 109.762327
[2,]       65  65.857396
[3,]       22  19.757219
[4,]        4   4.623058
> Q=sum( (n_correc-np)^2/np  )
> Q
[1] 0.3550214```

The quantile of the chi-square distribution is

```> qchisq(.95,df=4-1-1)
[1] 5.991465```

and the p-value is

```> 1-pchisq(Q,df=4-1-1)
[1] 0.837352```

Finally, it is possible to use the chi-square test in order to test for independence. Consider here two categorical variables$X$and $Y$, e.g. the color of the hair, and the color of the eyes, and summarize the information in a contingency table

```> n=HairEyeColor[,,1]+HairEyeColor[,,2]
> n
Eye
Hair    Brown Blue Hazel Green
Black    68   20    15     5
Brown   119   84    54    29
Red      26   17    14    14
Blond     7   94    10    16```

In that case, use

$Q=\sum_{i,j}\frac{[N_{i,j}-N_{i,j}^\perp]^2}{N_{i,j}^\perp}$

with

$N_{i,j}^\perp=\frac{N_{i,\cdot} N_{\cdot, j}}{n}$

where $N_{i,\cdot}$ and $N_{\cdot,j}$ denote respectively the number of observation per row and per column.

```> ni=apply(n,1,sum)         # sum per row [hair]
> nj=apply(n,2,sum)         # sum per colum [eye]
> n_ind= ni %*% t(nj)/sum(n)
> rownames(n_ind)=rownames(n)
> n_ind
Brown      Blue    Hazel     Green
Black  40.13514  39.22297 16.96622 11.675676
Brown 106.28378 103.86824 44.92905 30.918919
Red    26.38514  25.78547 11.15372  7.675676
Blond  47.19595  46.12331 19.95101 13.729730```

Under $H_0:X\perp Y$,

$Q\sim\chi^2((I-1)(J-1))$

```> Q= sum( (n-n_ind)^2/n_ind )
> Q
[1] 138.2898```

The quantile is here

```> qchisq(.95,df=(4-1)*(4-1))
[1] 16.91898```

and the p-value is way below 5%,

```> 1-pchisq(Q,df=(4-1)*(4-1))
[1] 0```