It’s survey analysis season for me at work! When analyzing survey data, the one kind of analysis I have realized that I’m not used to doing is finding patterns in binary data. In other words, if I have a question to which multiple, non-mutually exclusive (checkbox) answers apply, how do I find the patterns in peoples’ responses to this question?
I tried apply PCA and Factor Analysis alternately, but they really don’t seem well suited to the analysis of data consisting of only binary columns (1s and 0s). In searching for something that works, I came across the homals package. While the main function is described as a “homogeneity analysis”, its one ability that interests me is called “non-linear PCA”. This is supposed to be able to reduce the dimensionality of your dataset even when the variables are all binary.
Well, here’s an example using some real survey data (with masked variable names). First we start off with the purpose of the data and some simple summary stats:
It’s a group of 6 variables (answer choices) showing peoples check-box responses to a question asking them why they donated to a particular charity. Following are the numbers of responses to each answer choice:
mapply(whydonate, FUN=sum, 1) V1 V2 V3 V4 V5 V6 201 79 183 117 288 199
With the possible exception of answer choice V2, there are some pretty healthy numbers in each answer choice. Next, let’s load up the homals package and run our non-linear PCA on the data.
library(homals) fit = homals(whydonate) fit Call: homals(data = whydonate) Loss: 0.0003248596 Eigenvalues: D1 D2 0.0267 0.0156 Variable Loadings: D1 D2 V1 0.28440348 -0.10010355 V2 0.07512143 -0.10188037 V3 0.09897585 0.32713745 V4 0.20464762 0.21866432 V5 0.26782837 -0.09600215 V6 0.33198532 -0.04843107
As you can see, it extracts 2 dimensions by default (it can be changed using the “ndim” argument in the function), and it gives you what looks very much like a regular PCA loadings table.
Reading it naively, the pattern I see in the first dimension goes something like this: People tended to answer affirmatively to answer choices 1,4,5, and 6 as a group (obviously not all the time and altogether though!), but those answers didn’t tend to be used alongside choices 2 and 3.
In the second dimension I see: People tended to answer affirmatively to answer choices 3 and 4 as a group. Okay, now as a simple check, let’s look at the correlation matrix for these binary variables:
cor(whydonate) V1 V2 V3 V4 V5 V6 V1 1.00000000 0.0943477325 0.0205241732 0.16409945 0.254854574 0.45612458 V2 0.09434773 1.0000000000 -0.0008474402 0.01941461 0.038161091 0.08661938 V3 0.02052417 -0.0008474402 1.0000000000 0.21479291 0.007465142 0.11416164 V4 0.16409945 0.0194146144 0.2147929137 1.00000000 0.158325383 0.22777471 V5 0.25485457 0.0381610906 0.0074651417 0.15832538 1.000000000 0.41749064 V6 0.45612458 0.0866193754 0.1141616374 0.22777471 0.417490642 1.00000000
The first dimension is easy to spot in the “V1″ column above. Also, we can see the second dimension in the “V3″ column above – both check out! I find that neat and easy. Does anyone use anything else to find patterns in binary data like this? Feel free to tell me in the comments!