**Freakonometrics » R-english**, and kindly contributed to R-bloggers)

This morning, in our mathematical statistical class, we’ve seen the use of the chi-square test. The first one was related to some goodness of fit of a multinomial distribution. Assume that . In order to test against , use the statistic

Under , . For instance, we have the number of weddings, in a large city, per season,

> n=c(301,356,413,262)

We want to test if weddings are celebrated uniformely over the year, i.e. .

> np=rep(sum(n)/4,4) > cbind(n,np) n np [1,] 301 333 [2,] 356 333 [3,] 413 333 [4,] 262 333 > Q=sum( (n-np)^2/np ) > Q [1] 39.02102

This quantity should be compared with the quantile of the chi-square distribution

> qchisq(.95,df=4-1) [1] 7.814728

but it is also possible to compute the *p*-value,

> 1-pchisq(Q,df=4-1) [1] 1.717959e-08

Here, we reject the assumption that weddings are celebrated uniformly over the year.

A second application is a goodness-of-fit test, for some parametric distribution. Assume that takes discrete data, say . Here . In order to test against , use

One can prove that under . For instance, consider the popular example of von Bortkiewicz’s horsekicks data.

> n=c(109,65,22,3,1) > sum(n*0:4)/sum(n) [1] 0.61

The first thing is that we should regroup 3 and 4+, in order to have enough observation in each cell of the table

> n_correc=c(109,65,22,4)

Now we can try a distribution

> np=200*c(dpois(0:2,lambda=.6), + 1-ppois(2,lambda=.6)) > n_correc=c(109,65,22,4) > cbind(n_correc,np) n_correc np [1,] 109 109.762327 [2,] 65 65.857396 [3,] 22 19.757219 [4,] 4 4.623058 > Q=sum( (n_correc-np)^2/np ) > Q [1] 0.3550214

The quantile of the chi-square distribution is

> qchisq(.95,df=4-1-1) [1] 5.991465

and the *p*-value is

> 1-pchisq(Q,df=4-1-1) [1] 0.837352

Finally, it is possible to use the chi-square test in order to test for independence. Consider here two categorical variablesand , e.g. the color of the hair, and the color of the eyes, and summarize the information in a contingency table

> n=HairEyeColor[,,1]+HairEyeColor[,,2] > n Eye Hair Brown Blue Hazel Green Black 68 20 15 5 Brown 119 84 54 29 Red 26 17 14 14 Blond 7 94 10 16

In that case, use

with

where and denote respectively the number of observation per row and per column.

> ni=apply(n,1,sum) # sum per row [hair] > nj=apply(n,2,sum) # sum per colum [eye] > n_ind= ni %*% t(nj)/sum(n) > rownames(n_ind)=rownames(n) > n_ind Brown Blue Hazel Green Black 40.13514 39.22297 16.96622 11.675676 Brown 106.28378 103.86824 44.92905 30.918919 Red 26.38514 25.78547 11.15372 7.675676 Blond 47.19595 46.12331 19.95101 13.729730

Under ,

> Q= sum( (n-n_ind)^2/n_ind ) > Q [1] 138.2898

The quantile is here

> qchisq(.95,df=(4-1)*(4-1)) [1] 16.91898

and the *p-*value is way below 5%,

> 1-pchisq(Q,df=(4-1)*(4-1)) [1] 0

**leave a comment**for the author, please follow the link and comment on their blog:

**Freakonometrics » R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...