The Chi-Squared Test of Independence – An Example in Both R and SAS

(This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers)

Introduction

The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data.  Given 2 categorical random variables, X and Y, the chi-squared test of independence determines whether or not there exists a statistical dependence between them.  Formally, it is a hypothesis test with the following null and alternative hypotheses:

H_0: X \perp Y \ \ \ \ \ \text{vs.} \ \ \ \ \ H_a: X \not \perp Y

If you’re not familiar with probabilistic independence and how it manifests in categorical random variables, watch my video on calculating expected counts in contingency tables using joint and marginal probabilities.  For your convenience, here is another video that gives a gentler and more practical understanding of calculating expected counts using marginal proportions and marginal totals.

Today, I will continue from those 2 videos and illustrate how the chi-squared test of independence can be implemented in both R and SAS with the same example.

 

The Example: Gender and Ice Cream Flavour Preferences

In this example, we seek to determine whether or not there is an association between gender and preference for ice cream flavour – these are the 2 categorical variables of interest.

Gender has 2 categories:

  • Men
  • Women

Ice cream flavour has 3 categories:

  • Chocolate
  • Vanilla
  • Strawberry

My data come from a hypothetical survey of 920 people that ask for their preference of 1 of the above 3 ice cream flavours.  Here are the data:

    Flavour  
    Chocolate Vanilla Strawberry Total
Gender Men 100 120 60 280
Women 350 200 90 640
    450 320 150 920

 

Calculating Expected Counts

Before performing a chi-squared test of independence, I encourage you to organize your data in a contingency table and calculate the expected counts.  In my previous video tutorials, I have discussed the conceptual background of expected counts in great detail, so I encourage you to watch them first if you are not familiar with expected counts.  I showed 2 different ways of calculating expected counts using the above data on gender and preferences of ice cream flavour:

  1. Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals
  2. Video Tutorial Calculating Expected Counts in a Contingency Table Using Joint Probabilities

 

The Chi-Squared Test Statistic

The test statistic for the chi-squared test of independence is

\chi^2 = \sum_{i = 1}^{r} \sum_{k = 1}^{c} [O_{ik} - E_{ik}]^2 \div E_{ik},

where

  • r is the number of rows.
  • c is the number of columns.
  • O_{ik} is the observed count of the cell in the i\textit{th} row and the k\textit{th} column.
  • E_{ik} is the expected count of the cell in the i\textit{th} row and the k\textit{th} column.

This chi-squared test statistic has (r - 1)(c - 1) degrees of freedom.  If the observed counts are “close enough” to the expected counts, then the sum of all of their deviations (i.e. \chi^2 should not be much bigger than 0).  Otherwise, \chi^2 would be very big, suggesting that the original hypothesis of independence between the 2 random variables is not valid.  In our example, there are 2 rows and 3 columns, so the number of degrees of freedom is (2-1)(3-1) = 2.

 

SAS Code and Output

Here is the SAS code for entering the data in the DATA step and conducting the chi-squared test using PROC FREQ.  If you are not familiar with my code in the beginning for clearing the log, the output window and the results window, read my earlier post about how it works.

* Demonstrating the Chi-Squared Test of Independence;
* By Eric Cai - The Chemical Statistician;
dm 'cle log; cle out;';
ods html close; 
ods html;

dm 'odsresults; clear';
ods listing close;
ods listing;

options 
     noovp
     linesize = 105
     formdlim = '-'
     pageno = min
;

title 'Are Gender and Ice Cream Flavour Preference Independent?';
libname chisq 'INSERT YOUR DIRECTORY PATH FOR YOUR LIBRARY HERE!';

* entering the survey data (i.e. the observed counts) in the DATA step;
data chisq.icecream;
     input gender$ flavour$ count;
     datalines;
     Male Chocolate 100
     Male Vanilla 120
     Male Strawberry 60
     Female Chocolate 350
     Female Vanilla 200
     Female Strawberry 90
;
run;

* conducting the chi-squared test of independence within PROC FREQ;
proc freq 
     data = chisq.icecream;

     * this tables statement generates the contingency table;
     * the chisq option requests the chi-squared test of independence; 
     tables gender * flavour 
          / chisq;

     * this weight statement tells SAS that the variable "count" contains the observed counts for each observation;
     weight count;

     * add a title to the PROC FREQ output;
     title2 'Chi-Squared Test of Independence: Gender and Ice Cream Flavour Preference';
run;

 

Here is the output from PROC FREQ:

 

sas output

 

R Code and Output

Here is the same analysis done in R.

##### Demonstrating the Chi-Squared Test of Independence in R
##### By Eric Cai
##### The Chemical Statistician

# Entering the data into vectors
men = c(100, 120, 60)
women = c(350, 200, 90)

# combining the row vectors in matrices, then converting the matrix into a data frame
ice.cream.survey = as.data.frame(rbind(men, women))

# assigning column names to this data frame
names(ice.cream.survey) = c('chocolate', 'vanilla', 'strawberry')

chisq.test(ice.cream.survey)

 

Here is the output from the chisq.test() function.

> chisq.test(ice.cream.survey)

 Pearson's Chi-squared test

data: ice.cream.survey
X-squared = 28.3621, df = 2, p-value = 6.938e-07

 

*Notice that the chi-squared test statistic and the number of degrees of freedom are the same in both the R and the SAS output!

 

Interpreting the Results

I will articulate the interpretation of the very small p-value in several different ways:

  1. It provides strong evidence to suggest that gender and ice cream flavour preference are dependent or have some association.  (This is a probabilistic interpretation, but it is not very clear what it means on a practical level.)
  2. It provides strong evidence to suggest that men and women tend to have difference preferences for ice cream flavours.  (This is a practical implication.)

Filed under: Applied Statistics, Categorical Data Analysis, R programming, SAS Programming, Statistics, Uncategorized Tagged: applied statistics, chi-squared test of independence, DATA step, degree of freedom, expected count, expected counts, number of degrees of freedom, observed count, observed counts, p-value, PROC FREQ, R, SAS, statistics

To leave a comment for the author, please follow the link and comment on his blog: The Chemical Statistician » R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.