K is for Cohen’s Kappa

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last April, during the A to Z of Statistics, I blogged about Cohen’s kappa, a measure of interrater reliability. Cohen’s kappa is a way to assess whether two raters or judges are rating something the same way. And thanks to an R package called irr, it’s very easy to compute. But first, let’s talk about why you would use Cohen’s kappa and why it’s superior to a more simple measure of interrater reliability, interrater agreement.

I often do research that requires another person to observe the same thing and make their own ratings, using a codebook or similar method. Meta-analysis, in which information from studies on a topic is coded, frequently requires judgment calls. While some things may be very straightforward to code, such pulling out a group sample size that is clearly stated, other things are not; the rater may need to make some decisions about quality of the methods used or exactly what sampling approach was selected, because researchers may use different and/or vague language to describe things. Since the coded data is what ultimately gets analyzed, we need to make sure the coding is done in a way that is systematic and reproducable. Qualitative research, in which the things people say are coded, also requires a codebook that is clear, systematic, and reproducable, and once again, the best way to demonstrate that is to have another person use the same data and codebook and see if they get the same results.

So you want to make sure the degree to which two coders agree on the coded results is high. A simple way of doing that is to look at interrater agreement: the number of times raters agree divided by the number of things being rated. The problem is that, when raters are working with a codebook with a limited number of categories to choose from, they’re likely to agree to a certain extent just by chance alone. Even a stopped clock is right twice a day, and even untrained raters coding things willy nilly are going to agree with each other some of the time. In fact, a lot of things we want to happen in research will simply happen by chance alone. Being a good researcher means making certain that the things that happen in our research are unlikely to be due to chance. Cohen’s kappa corrects for that, by taking into account how often raters will agree if they were simply to make decisions at random.

You want to set up your data with each coder getting his/her own column. You can put all coded information in a single file, if you’d like, and simply reference the columns you need for your interrater reliability function. For the demonstration with real data (below), I just created two separate files, one for each variable I’m demonstrating.

But first, let’s demonstrate with some randomly generated data. Pretend that I have two coins, and I’m going to flip each of them, one and then the other, 20 times. We would expect the resulting pairs of 20 coin flips to be the same at least some of the time. We can easily generate these data using the binomial distribution. I’ve assigned a theta (probability of a certain outcome) at 0.5, to recreate a “fair” coin. Then I used the cbind (column bind) function to put them together into a data frame.

theta = 0.5
N = 20
flips1 <- rbinom(n = N, size = 1, prob = theta)
flips2 <- rbinom(n = N, size = 1, prob = theta)
coins<-cbind(flips1, flips2)

Now we have a data frame called coins, which contains two columns: flips for coin 1 and flips for coin 2. The irr package will measure simple agreement for us.

install.packages("irr")

library(irr)

## Loading required package: lpSolve

agree(coins, tolerance=0)

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 20 
##    Raters = 2 
##   %-agree = 40

By setting tolerance to 0, I've forced the agree function to require both columns to have the exact same value for it to be considered agreement. If I was assessing agreement on a rating scale, I might want to allow a small margin of error - perhaps 1 point. As you can see, agreement is 40%, very close to what you would expect by chance alone. And this highlights the issue with using percent agreement: we would expect two raters coding something with 2 categories to agree with each other 50% of the time. This is about how much agreement you would see between two raters who are given a codebook with absolutely no training, though if the categories are even slightly well-defined, you'll see higher agreement just by chance. So training is important, but then, so is using a measure of reliability that takes into account the agreement you would see just by chance.

Now let's run Cohen's kappa on these data.

kappa2(coins)

##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 20 
##    Raters = 2 
##     Kappa = -0.237 
## 
##         z = -1.08 
##   p-value = 0.279

Kappa is a negative value, showing that they are doing worse than chance - very poor interrater reliability. This would tell me - if I wasn't using randomly generated data - that the codebook is poorly defined and doing little good for my raters, and/or that I may need to retrain my raters.

Now let's demonstrate interrater agreement and Cohen's kappa using some real data. For my meta-analysis, I had a fellow grad student go through and code studies with me. Since I coded many variables for my meta-analysis, and I want to keep this post as short as possible, I've selected 2 to use for this demonstration - 1 that showed poor agreement/kappa initially and 1 that showed high agreement/kappa. I adopted a consensus approach to coding, meaning that when my fellow coder and I disagreed, we met to discuss and come to a decision on how to deal with the discrepant code. Sometimes we changed the codebook as a result, sometimes one or both of us misunderstood the study (and found a better code after reexamining it together), and sometimes we simply had to compromise. We started this process early, getting together after we'd each coded a few studies solo and continuing to meet after coding 3-4 studies each. If we changed the codebook, we'd have to recode earlier studies, and of course, code all new studies with the updated codebook.

The first variable that showed disagreement surprised me: the number of studies in the article that was eligible for the meta-analysis. I was a bit surprised that we disagreed, but I realized, after seeing her coded results that I had not been clear in how I wanted to divide up any subsamples. That discussion led to a better codebook. I've created a tab-delimited file that includes a variable for study ID, then how rater1 and rater2 coded each study on that variable.

numstudies<-read.delim("num_studies.txt", header=TRUE)
agree(numstudies[,2:3], tolerance=0)

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 62 
##    Raters = 2 
##   %-agree = 79

kappa2(numstudies[,2:3])

##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 62 
##    Raters = 2 
##     Kappa = 0.521 
## 
##         z = 6.22 
##   p-value = 5.12e-10

Our percent agreement is about 79%, but once you account for chance agreement, our Cohen's kappa is much lower: 0.52. You see why a discussion, and a better codebook, was the right approach here. On the other hand, we showed much better agreement and Cohen's kappa for a variable assessing instructions received by the control group: 0 = Nothing, 1 = A news article not about any kind of crime, 2 = A news article about crime in general, but not the specific case, 3 = A news article about the specific case that contained only neutral information.

CGinstruct<-read.delim("CG_instruct.txt", header=TRUE)
agree(CGinstruct[,2:3], tolerance=0)

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 97 
##    Raters = 2 
##   %-agree = 96.9

kappa2(CGinstruct[,2:3])

##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 97 
##    Raters = 2 
##     Kappa = 0.954 
## 
##         z = 14.5 
##   p-value = 0

As you can see, this showed much better results: 97% agreement and a Cohen's kappa of 0.95.

In a publication, you'd want to provide Cohen's kappa for each variable or, if there are a lot of variables, some summary statistics, including range and average. (But this might also be a sign that you have too many variables and should elect only the most important ones for analysis. I ended up dropping some variables, not only because of coding results, but because some key information was missing from most studies.) If you have any Cohen's kappa not in the 0.8 or 0.9 range, you probably want to consider updating your codebook and/or retraining your coders to make sure everyone is on the same page. You also want to come up with a game plan of how to handle disagreements, at the very least because you need to pick a final value to use in your analysis. I prefer the consensus approach myself, but some people will enlist a third coder as a tie breaker.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)