Fisher’s exact test in R: independence test for a small sample
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
After presenting the Chi-square test of independence by hand and in R, this article focuses on the Fisher’s exact test.
Independence tests are used to determine if there is a significant relationship between two categorical variables. There exists two different types of independence test:
- the Chi-square test (the most common)
- the Fisher’s exact test
On the one hand, the Chi-square test is used when the sample is large enough (in this case the \(p\)-value is an approximation that becomes exact when the sample becomes infinite, which is the case for many statistical tests). On the other hand, the Fisher’s exact test is used when the sample is small (and in this case the \(p\)-value is exact and is not an approximation).
The literature indicates that the usual rule for deciding whether the \(\chi^2\) approximation is good enough is that the Chi-square test is not appropriate when the expected values in one of the cells of the contingency table is less than 5, and in this case the Fisher’s exact test is preferred (McCrum-Gardner 2008; Bower 2003).
Hypotheses
The hypotheses of the Fisher’s exact test are the same than for the Chi-square test, that is:
- \(H_0\) : the variables are independent, there is no relationship between the two categorical variables. Knowing the value of one variable does not help to predict the value of the other variable
- \(H_1\) : the variables are dependent, there is a relationship between the two categorical variables. Knowing the value of one variable helps to predict the value of the other variable
Example
Data
For our example, we want to determine whether there is a statistically significant association between smoking and being a professional athlete. Smoking can only be “yes” or “no” and being a professional athlete can only be “yes” or “no”. The two variables of interest are qualitative variables and we collected data on 14 persons.1
Observed frequencies
Our data are summarized in the contingency table below reporting the number of people in each subgroup:
Non-smoker | Smoker | |
---|---|---|
Athlete | 7 | 2 |
Non-athlete | 0 | 5 |
Expected frequencies
Remember that the Fisher’s exact test is used when there is at least one cell in the contingency table of the expected frequencies below 5. To retrieve the expected frequencies, use the chisq.test()
function together with $expected
:
chisq.test(dat)$expected ## Warning in chisq.test(dat): Chi-squared approximation may be incorrect ## Non-smoker Smoker ## Athlete 4.5 4.5 ## Non-athlete 2.5 2.5
The contingency table above confirms that we should use the Fisher’s exact test instead of the Chi-square test because there is at least one cell below 5.
Tip: although it is a good practice to check the expected frequencies before deciding between the Chi-square and the Fisher test, it is not a big issue if you forget. As you can see above, when doing the Chi-square test in R (with chisq.test()
), a warning such as “Chi-squared approximation may be incorrect” will appear. This warning means that the smallest expected frequencies is lower than 5. Therefore, do not worry if you forgot to check the expected frequencies before applying the appropriate test to your data, R will warn you that you should use the Fisher’s exact test instead of the Chi-square test if that is the case.
Fisher’s exact test in R
To perform the Fisher’s exact test in R, use the fisher.test()
function as you would do for the Chi-square test:2
test <- fisher.test(dat) test ## ## Fisher's Exact Test for Count Data ## ## data: dat ## p-value = 0.02098 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 1.449481 Inf ## sample estimates: ## odds ratio ## Inf
The most important in the output is the \(p\)-value. You can also retrieve the \(p\)-value with:
test$p.value ## [1] 0.02097902
Conclusion and interpretation
From the output and from test$p.value
we see that the \(p\)-value is less than the significance level of 5%. Like any other statistical test, if the \(p\)-value is less than the significance level, we can reject the null hypothesis. If you are not familiar with \(p\)-values, I invite you to read this section.
\(\Rightarrow\) In our context, rejecting the null hypothesis for the Fisher’s exact test of independence means that there is a significant relationship between the two categorical variables (smoking habits and being an athlete or not). Therefore, knowing the value of one variable helps to predict the value of the other variable.
Thanks for reading. I hope the article helped you to perform the Fisher’s exact test of independence in R and interpret its results. Learn more about the Chi-square test of independence by hand or in R.
As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.
References
Bower, Keith M. 2003. “When to Use Fisher’s Exact Test.” In American Society for Quality, Six Sigma Forum Magazine, 2:35–37. 4.
McCrum-Gardner, Evie. 2008. “Which Is the Correct Statistical Test to Use?” British Journal of Oral and Maxillofacial Surgery 46 (1): 38–41.
The data are the same than for the article covering the Chi-square test by hand, except that some observations have been removed to decrease the sample size.↩︎
Use
fisher.test(table(dat$variable1, dat$variable2))
ifdat
represents the raw data and is not already presented as a contingency table.↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.