Chi-square tests

[This article was first published on Statistics & R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Chi-square test is a statistical method used for analysis of categorical data. There are 2 types:

1. Goodness of fit

Known as:
* Chi-square for 1 sample
* Chi-square for given proportions

Purpose : Comparing the observed frequency to the expected frequency based on a theoretical law specified in the null hypothesis.

2. Goodness of association

Known as:
* Chi-square test for Independence
* Chi-square test for Homogeneity

Purpose : Determine whether there is an association between the categories of the two variables.

General formula for both types: \[ X^2=\sum_{}\frac{(Observed-Expected)^2}{Expected} \] ___________________________________________________________________________

Example 1:

  1. Goodness of fit:

Assume that in a certain forest we counted 132 trees planted. Forest officials claim the ratio of Orange, Cedar and banana trees is 1:2:1 (1+2+1 = 4). This means that the expected proportion is:

1/4 (= 1/4) for Orange trees
2/4 ( = 1/2) for Cedar trees
1/4 for Banana trees

The count was as follow: 50 Orange trees, 60 Cedar trees and 20 banana trees.

Trees_observed <- c(52, 60, 20)
theoretical_proportion <- c(1/4, 1/2, 1/4)
test <- chisq.test(Trees_observed, p=theoretical_proportion)
test
##
## Chi-squared test for given probabilities
##
## data: Trees_observed
## X-squared = 16.606, df = 2, p-value = 0.0002478

since pval <0.05 then we reject the null hypothesis. In which tree ratio for cedar should be 1/2 and the other 2 types should be 1/4. Thus the field does not follow the theoretical ratio mentioned by the officials. In other words the observed proportions are significantly different from the expected proportions.

If the expected ratio was true then the observed count of Orange, Cedar and banana are respectively.

test$expected
## [1] 33 66 33

By hand: To calculate expected counts:

Total_trees_counted <- sum(Trees_observed)
Total_trees_counted
## [1] 132
1/4*Total_trees_counted # Orange Tree
## [1] 33
1/2*Total_trees_counted # Cedar Tree
## [1] 66
1/4*Total_trees_counted # Banana Tree
## [1] 33

Chi-square Equation:

\[ X^2=\frac{(52-33)^2}{33}+ \frac{(60-66)^2}{66}+ \frac{(20-33)^2}{33}=16.606 \]

i

To get the pvalue of the calculated Chi-square 16.606 and degree of freedom 2 (df=n-1, n=3 (tree types)) we can apply 1-pchisq().

pval = 1-pchisq(16.606,df=2)

Example 2:

  1. Goodness of association

Assume we took 10 plants of Cedar trees and 10 plants of Orange trees and planted them in a forest in Germany. After a year we visited the planting site and counted the trees that survived and the ones that died.

## Trees Survived Dead Total_trees
## Cedar 7 3 10
## Orange 4 6 10
## Total 11 9 20

Now we want to know whether there is an association between Growth and Trees on the new land. if there is no association then the observed counts would be equal to the expected counts or at least similar.

test <- chisq.test(dat, correct = FALSE)
test
##
## Pearson's Chi-squared test
##
## data: dat
## X-squared = 1.8182, df = 1, p-value = 0.1775

Since pval>0.05 then fail to reject the null hypothesis. in which there is no association between the tree type and their survival on the new land.

By hand:
To calculate expected counts: \[ Expected=\frac{totalcolumn*totalrow}{totalobservation} \]

dat2
## Trees Survived Dead Total_trees
## Cedar (11*10)/20 = 5.5 (9*10)/20 = 4.5 10
## Orange (11*10)/20 = 5.5 (9*10)/20 = 4.5 10
## Total 11 9 20

Chi-square Equation: \[ X^2=\frac{(7-5.5)^2}{5.5}+ \frac{(4-5.5)^2}{5.5}+ \frac{(3-4.5)^2}{4.5} + \frac{(6-4.5)^2}{4.5}=1.8182 \] As in the previous exercise. To get the pvalue of the calculated Chi-square 1.8182 and degree of freedom 1, we can apply 1-pchisq().

pval = 1-pchisq(1.8182,df=1)
pval
## [1] 0.1775277

Note: We ran the code, chisq.test(dat, correct = FALSE), the correct argument was set to false, If we left it empty then by default the function will apply the Yates continuity correction when one of the observed values is less than 5. In this case the formula becomes: \[ X^2=\sum_{}\frac{(|Observed-Expected|-0.5)^2}{Expected} \]

To leave a comment for the author, please follow the link and comment on their blog: Statistics & R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)