# Finding Patterns Amongst Binary Variables with the homals Package

February 10, 2013
By

(This article was first published on Data and Analysis with R, at Work, and kindly contributed to R-bloggers)

It’s survey analysis season for me at work!  When analyzing survey data, the one kind of analysis I have realized that I’m not used to doing is finding patterns in binary data.  In other words, if I have a question to which multiple, non-mutually exclusive (checkbox) answers apply, how do I find the patterns in peoples’ responses to this question?

I tried apply PCA and Factor Analysis alternately, but they really don’t seem well suited to the analysis of data consisting of only binary columns (1s and 0s). In searching for something that works, I came across the homals package.  While the main function is described as a “homogeneity analysis”, its one ability that interests me is called “non-linear PCA”.  This is supposed to be able to reduce the dimensionality of your dataset even when the variables are all binary.

Well, here’s an example using some real survey data (with masked variable names).  First we start off with the purpose of the data and some simple summary stats:

It’s a group of 6 variables (answer choices) showing peoples check-box responses to a question asking them why they donated to a particular charity.  Following are the numbers of responses to each answer choice:

mapply(whydonate, FUN=sum, 1)
V1  V2  V3  V4  V5  V6
201  79 183 117 288 199

With the possible exception of answer choice V2, there are some pretty healthy numbers in each answer choice.  Next, let’s load up the homals package and run our non-linear PCA on the data.

library(homals)
fit = homals(whydonate)

fit
Call: homals(data = whydonate)

Loss: 0.0003248596

Eigenvalues:
D1     D2
0.0267 0.0156

D1          D2
V1 0.28440348 -0.10010355
V2 0.07512143 -0.10188037
V3 0.09897585  0.32713745
V4 0.20464762  0.21866432
V5 0.26782837 -0.09600215
V6 0.33198532 -0.04843107

As you can see, it extracts 2 dimensions by default (it can be changed using the “ndim” argument in the function), and it gives you what looks very much like a regular PCA loadings table.

Reading it naively, the pattern I see in the first dimension goes something like this: People tended to answer affirmatively to answer choices 1,4,5, and 6 as a group (obviously not all the time and altogether though!), but those answers didn’t tend to be used alongside choices 2 and 3.

In the second  dimension I see: People tended to answer affirmatively to answer choices 3 and 4 as a group.  Okay, now as a simple check, let’s look at the correlation matrix for these binary variables:

cor(whydonate)

V1            V2            V3         V4          V5         V6
V1 1.00000000  0.0943477325  0.0205241732 0.16409945 0.254854574 0.45612458
V2 0.09434773  1.0000000000 -0.0008474402 0.01941461 0.038161091 0.08661938
V3 0.02052417 -0.0008474402  1.0000000000 0.21479291 0.007465142 0.11416164
V4 0.16409945  0.0194146144  0.2147929137 1.00000000 0.158325383 0.22777471
V5 0.25485457  0.0381610906  0.0074651417 0.15832538 1.000000000 0.41749064
V6 0.45612458  0.0866193754  0.1141616374 0.22777471 0.417490642 1.00000000

The first dimension is easy to spot in the “V1″ column above. Also, we can see the second dimension in the “V3″ column above – both check out! I find that neat and easy. Does anyone use anything else to find patterns in binary data like this? Feel free to tell me in the comments!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...