The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data. Given 2 categorical random variables, and , the chi-squared test of independence determines whether or not there exists a statistical dependence between them. Formally, it is a hypothesis test with the following null and alternative hypotheses:
If you’re not familiar with probabilistic independence and how it manifests in categorical random variables, watch my video on calculating expected counts in contingency tables using joint and marginal probabilities. For your convenience, here is another video that gives a gentler and more practical understanding of calculating expected counts using marginal proportions and marginal totals.
The Example: Gender and Ice Cream Flavour Preferences
In this example, we seek to determine whether or not there is an association between gender and preference for ice cream flavour – these are the 2 categorical variables of interest.
Gender has 2 categories:
Ice cream flavour has 3 categories:
My data come from a hypothetical survey of 920 people that ask for their preference of 1 of the above 3 ice cream flavours. Here are the data:
Calculating Expected Counts
Before performing a chi-squared test of independence, I encourage you to organize your data in a contingency table and calculate the expected counts. In my previous video tutorials, I have discussed the conceptual background of expected counts in great detail, so I encourage you to watch them first if you are not familiar with expected counts. I showed 2 different ways of calculating expected counts using the above data on gender and preferences of ice cream flavour:
- Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals
- Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities
The Chi-Squared Test Statistic
The test statistic for the chi-squared test of independence is
- is the number of rows.
- is the number of columns.
- is the observed count of the cell in the row and the column.
- is the expected count of the cell in the row and the column.
This chi-squared test statistic has degrees of freedom. If the observed counts are “close enough” to the expected counts, then the sum of all of their deviations (i.e. should not be much bigger than ). Otherwise, would be very big, suggesting that the original hypothesis of independence between the 2 random variables is not valid. In our example, there are 2 rows and 3 columns, so the number of degrees of freedom is .
SAS Code and Output
Here is the SAS code for entering the data in the DATA step and conducting the chi-squared test using PROC FREQ. If you are not familiar with my code in the beginning for clearing the log, the output window and the results window, read my earlier post about how it works.
* Demonstrating the Chi-Squared Test of Independence; * By Eric Cai - The Chemical Statistician; dm 'cle log; cle out;'; ods html close; ods html; dm 'odsresults; clear'; ods listing close; ods listing; options noovp linesize = 105 formdlim = '-' pageno = min ; title 'Are Gender and Ice Cream Flavour Preference Independent?'; libname chisq 'INSERT YOUR DIRECTORY PATH FOR YOUR LIBRARY HERE!'; * entering the survey data (i.e. the observed counts) in the DATA step; data chisq.icecream; input gender$ flavour$ count; datalines; Male Chocolate 100 Male Vanilla 120 Male Strawberry 60 Female Chocolate 350 Female Vanilla 200 Female Strawberry 90 ; run; * conducting the chi-squared test of independence within PROC FREQ; proc freq data = chisq.icecream; * this tables statement generates the contingency table; * the chisq option requests the chi-squared test of independence; tables gender * flavour / chisq; * this weight statement tells SAS that the variable "count" contains the observed counts for each observation; weight count; * add a title to the PROC FREQ output; title2 'Chi-Squared Test of Independence: Gender and Ice Cream Flavour Preference'; run;
Here is the output from PROC FREQ:
R Code and Output
Here is the same analysis done in R.
##### Demonstrating the Chi-Squared Test of Independence in R ##### By Eric Cai ##### The Chemical Statistician # Entering the data into vectors men = c(100, 120, 60) women = c(350, 200, 90) # combining the row vectors in matrices, then converting the matrix into a data frame ice.cream.survey = as.data.frame(rbind(men, women)) # assigning column names to this data frame names(ice.cream.survey) = c('chocolate', 'vanilla', 'strawberry') chisq.test(ice.cream.survey)
Here is the output from the chisq.test() function.
> chisq.test(ice.cream.survey) Pearson's Chi-squared test data: ice.cream.survey X-squared = 28.3621, df = 2, p-value = 6.938e-07
*Notice that the chi-squared test statistic and the number of degrees of freedom are the same in both the R and the SAS output!
Interpreting the Results
I will articulate the interpretation of the very small p-value in several different ways:
- It provides strong evidence to suggest that gender and ice cream flavour preference are dependent or have some association. (This is a probabilistic interpretation, but it is not very clear what it means on a practical level.)
- It provides strong evidence to suggest that men and women tend to have difference preferences for ice cream flavours. (This is a practical implication.)
Filed under: Applied Statistics, Categorical Data Analysis, R programming, SAS Programming, Statistics, Uncategorized Tagged: applied statistics, chi-squared test of independence, DATA step, degree of freedom, expected count, expected counts, number of degrees of freedom, observed count, observed counts, p-value, PROC FREQ, R, SAS, statistics