Applications of Chi-Square Tests

November 3, 2015
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

This morning, in our mathematical statistical class, we’ve seen the use of the chi-square test. The first one was related to some goodness of fit of a multinomial distribution. Assume that . In order to test  against , use the statistic

Under . For instance, we have the number of weddings, in a large city, per season,

> n=c(301,356,413,262)

We want to test if weddings are celebrated uniformely over the year, i.e. .

> np=rep(sum(n)/4,4)
> cbind(n,np)
       n  np
[1,] 301 333
[2,] 356 333
[3,] 413 333
[4,] 262 333
> Q=sum( (n-np)^2/np  )
> Q
[1] 39.02102

This quantity should be compared with the quantile of the chi-square distribution

> qchisq(.95,df=4-1)
[1] 7.814728

but it is also possible to compute the p-value,

> 1-pchisq(Q,df=4-1)
[1] 1.717959e-08

Here, we reject the assumption that weddings are celebrated uniformly over the year.

A second application is a goodness-of-fit test, for some parametric distribution. Assume that  takes discrete data, say . Here . In order to test  against , use

One can prove that  under . For instance, consider the popular example of von Bortkiewicz’s horsekicks data.

> n=c(109,65,22,3,1)
> sum(n*0:4)/sum(n) 
[1] 0.61

The  first thing is that we should regroup 3 and 4+, in order to have enough observation in each cell of the table

> n_correc=c(109,65,22,4)

Now we can try a  distribution

> np=200*c(dpois(0:2,lambda=.6),
+    1-ppois(2,lambda=.6))
> n_correc=c(109,65,22,4)
> cbind(n_correc,np)
     n_correc         np
[1,]      109 109.762327
[2,]       65  65.857396
[3,]       22  19.757219
[4,]        4   4.623058
> Q=sum( (n_correc-np)^2/np  )
> Q
[1] 0.3550214

The quantile of the chi-square distribution is

> qchisq(.95,df=4-1-1)
[1] 5.991465

and the p-value is

> 1-pchisq(Q,df=4-1-1)
[1] 0.837352

Finally, it is possible to use the chi-square test in order to test for independence. Consider here two categorical variablesand , e.g. the color of the hair, and the color of the eyes, and summarize the information in a contingency table

> n=HairEyeColor[,,1]+HairEyeColor[,,2]
> n
       Eye
Hair    Brown Blue Hazel Green
  Black    68   20    15     5
  Brown   119   84    54    29
  Red      26   17    14    14
  Blond     7   94    10    16

In that case, use

with

where  and  denote respectively the number of observation per row and per column.

> ni=apply(n,1,sum)         # sum per row [hair]
> nj=apply(n,2,sum)         # sum per colum [eye]
> n_ind= ni %*% t(nj)/sum(n)
> rownames(n_ind)=rownames(n)
> n_ind
          Brown      Blue    Hazel     Green
Black  40.13514  39.22297 16.96622 11.675676
Brown 106.28378 103.86824 44.92905 30.918919
Red    26.38514  25.78547 11.15372  7.675676
Blond  47.19595  46.12331 19.95101 13.729730

Under ,

> Q= sum( (n-n_ind)^2/n_ind )
> Q
[1] 138.2898

The quantile is here

> qchisq(.95,df=(4-1)*(4-1))
[1] 16.91898

and the p-value is way below 5%,

> 1-pchisq(Q,df=(4-1)*(4-1))
[1] 0

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)