[This article was first published on R – Displayr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To measure the overlap or similarity between the data in two binary variables you can use a Jaccard coefficient. The coefficient ranges between 0 and 1, with 1 indicating that the two variables overlap completely, and 0 indicating that there are no selections in common. In this post I show you how to do the calculation in Displayr using R, by looking at overlaps between the devices people own, as indicated by their responses to a survey.

## The Jaccard coefficient

The Jaccard coefficient for two variables is defined as the number of cases where both variables are equal to 1, called the “set intersection”, divided by the number of cases where either of the two variables is equal to 1, called the “set union”). The formula for the Jaccard coefficient for two variables, A and B, is

The top part counts the number of cases for which both variables are 1, and the bottom part counts the cases for which either variable is 1.

You can visualize the coefficient in terms of a Venn diagram. As a basic example, consider a survey question which asks respondents to select which devices (iPhone, Laptop, etc) they own. We may want to know the overlap between people who said they own an iPhone and an iPad.

The Venn diagram for these two variables (which you can create in Displayr by selecting Insert > Visualization > Venn Diagram, selecting your Variables, and clicking Automatic), looks like this:

There is a big overlap between iPhone owners and iPad owners in this sample. The Jaccard coefficient is the number of people in the overlapping area in the middle of the diagram, divided by the total number of people represented by the colored area. In this case the Jaccard coefficient is 0.53.

On the other hand, the Venn diagram for Samsung owners and iPhone owners is quite different:

The proportion of the total area represented by the overlapping segment is much smaller. The Jaccard coefficient is only 0.16.

## Data setup

The variables for the Jaccard calculation must be binary, having values of 0 and 1. They may also include a missing value, and any case with a missing value in each pair will be excluded from the Jaccard coefficient for that pair.

In Displayr, this means that your variables must come from a variable set which has structure of Numeric, Numeric – Multi, or Multiple categories (Binary – Multi). You can check and change the Structure of a variable set by selecting it under Data Sets in the bottom left, and then looking in the Structure drop-down menu under Properties > INPUTS in the Object Inspector on the right side of the window.

## Doing the calculation using R

To calculate Jaccard coefficients for a set of binary variables, you can use the following:

1. Select Insert > R Output.
2. Paste the code below into to the R CODE section on the right.
3. Change line 8 of the code so that input.variables contains the variable Name of the variables you want to include. The variable Name can be found by hovering over the variable in the Data Sets pane, or by selecting the variable and looking under Properties > GENERAL > Name.
4. Click Automatic.

The code for the Jaccard coefficients is:

Jaccard = function (x, y) {
M.11 = sum(x == 1 & y == 1)
M.10 = sum(x == 1 & y == 0)
M.01 = sum(x == 0 & y == 1)
return (M.11 / (M.11 + M.10 + M.01))
}

input.variables = data.frame(Q6_01, Q6_02, Q6_03, Q6_04, Q6_05, Q6_06, Q6_07, Q6_08, Q6_09)

m = matrix(data = NA, nrow = length(input.variables), ncol = length(input.variables))
for (r in 1:length(input.variables)) {
for (c in 1:length(input.variables)) {
if (c == r) {
m[r,c] = 1
} else if (c > r) {
m[r,c] = Jaccard(input.variables[,r], input.variables[,c])
}
}
}

variable.names = sapply(input.variables, attr, "label")
colnames(m) = variable.names
rownames(m) = variable.names

jaccards = m


## In this code:

• I have defined a function called Jaccard. The function takes any two variables and calculates the Jaccard coefficient for those two variables. A function is a set of instructions that can be used elsewhere in the code. Particularly for more complicated blocks of code, writing a function like this can make your code more efficient and easier to read and check for mistakes.
• input.variables contains a data frame which has each of the variables you want to analyze as the columns.
• Initially, I created a matrix full of missing values as a place to store my calculations.
• I have used two for loops to go through and calculate the Jaccard coefficients and fill up the top half of the matrix.
• The bottom half of the matrix is left empty. In Displayr, missing values are displayed as empty cells. As the bottom half of the matrix would be identical to the top half, empty cells help us to read the results more easily.
• I have used the sapply function to obtain the labels for each variable so that they may be displayed in the row labels (rownames) and column labels (colnames) of the table. In this case, sapply is using the attr function to obtain the label attribute of each variable. As R does not recognize the same set of meta data for each variable, Displayr adds the meta data to the attributes of the variables so that it may be returned later if necessary.

The result is a table that contains all of the Jaccard coefficients for each pair of variables.

## Visualize the results

A heatmap is an ideal way to visualize tables of coefficients like this. To create a heatmap for this data in Displayr,

1. Select Insert > Visualization > Heatmap.
2. Under Inputs > DATA SOURCE, click into Output in ‘Pages’ and select the output for the Jaccard coefficients that was created above.
3. Tick Automatic.

You’ll get a result that looks like the following. With the blue default color palette, the largest Jaccard coefficients will be the darkest blue. Looking for dark patches off the diagonal  of the table allows you to locate the pairs of products which have the biggest overlap according to the Jaccard index. In this case we see strong overlaps between iPhone, iPod, and iPad owners in the top left, and between Samsung owners and people who own non-Mac computers over to the right.

If you would like to know more about using R, check out the R in Displayr category!

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)