Examining Inter-Rater Reliability in a Reality Baking Show

[This article was first published on R on I Should Be Writing: The Musical, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Format

In the blind auditions, candidates have 90 minutes to prepare a dessert, confection or other baked good, which is then sent to the show’s four judges for a blind taste test. The judges don’t see the candidate, and know nothing about him or her, until after they deliver their decision – A “pass” decision is signified by the judge granting the candidate a kitchen knife; a “fail” decision is signified by the judge not granting a knife. A candidate who receives a knife from at least three of the judges continues on to the next stage, the training camp.

The 4 judges and host, from left to right: Yossi Shitrit, Erez Komarovsky, Miri Bohadana (host), Assaf Granit, and Moshik Roth

Inter-Rater Reliability

I’ve watched all 4 episodes of the blind auditions (for purely academic reasons!), for a total of 31 candidates. For each dish, I recorded each of the four judges’ verdict (fail / pass).

Let’s load the data.

df <- read.csv('IJR.csv')
head(df)
##   Moshik Assaf Erez Yosi
## 1      1     1    1    1
## 2      1     0    0    0
## 3      1     1    1    0
## 4      1     0    0    0
## 5      0     1    1    1
## 6      1     1    1    1

We will need the following packages:

library(dplyr) # for manipulating the data
library(magrittr) # <3
library(psych) # for computing Choen's Kappa
library(corrr) # for manipulating matricies

We can now use the psych package to compute Cohen’s Kappa coefficient for inter-rater agreement for categorical items, such as the fail/pass categories we have here. \(\kappa\) ranges from -1 (full disagreement) through 0 (no pattern of agreement) to +1 (full agreement). Normally, \(\kappa\) is computed between two raters, and for the case of more than two raters, the mean \(\kappa\) across all pair-wise raters is used.

cohen.kappa(df)
## 
## Cohen Kappa (below the diagonal) and Weighted Kappa (above the diagonal) 
## For confidence intervals and detail print with all=TRUE
##        Moshik Assaf  Erez  Yosi
## Moshik   1.00  0.34  0.17  0.29
## Assaf    0.34  1.00  0.17  0.68
## Erez     0.17  0.17  1.00 -0.16
## Yosi     0.29  0.68 -0.16  1.00
## 
## Average Cohen kappa for all raters  0.25
## Average weighted kappa for all raters  0.25

We can see that overall \(\kappa=0.25\) - surprisingly low for what might be expected from a group of pristine, accomplished, professional chefs and bakers in a blind taste test.

When examining the pair-wise coefficients, we can also see that Erez seems to be in lowest agreement with each of the other judges (and even in a slight disagreement with Yossi!). This might be because Erez is new on the show (this is his first season as judge), but it might also be because of the four judges, he is the only one who is actually a confectioner (the other 3 are restaurant chefs).

For curiosity’s sake, let’s also look at the \(\kappa\) coefficient between each judge’s rating and the total fail/pass decision, based on whether a dish got a “pass” from at least 3 judges.

df <- df %>% 
  mutate(PASS = as.numeric(rowSums(.) >= 3))
head(df)
##   Moshik Assaf Erez Yosi PASS
## 1      1     1    1    1    1
## 2      1     0    0    0    0
## 3      1     1    1    0    1
## 4      1     0    0    0    0
## 5      0     1    1    1    1
## 6      1     1    1    1    1

We can now use the wonderful new corrr package, which is intended for exploring correlations, but can also generally be used to manipulate any symmetric matrix in a tidy-fashion.

cohen.kappa(df)$cohen.kappa %>%
  as_cordf() %>%
  focus(PASS)
## # A tibble: 4 x 2
##   rowname  PASS
##   <chr>   <dbl>
## 1 Moshik  0.619
## 2 Assaf   0.746
## 3 Erez    0.159
## 4 Yosi    0.676

Perhaps unsurprisingly (to people familiar with previous seasons of the show), it seems that Assaf’s judgment of a dish is a good indicator of whether or not a candidate will continue on to the next stage. Also, once again we see that Erez is barely aligned with the other judges’ total decision.

Conclusion

Every man to his taste…

Even among the experts there is little agreement on what is fine cuisine and what is not worth a doggy bag. Having said that, if you still have your heart set on competing in Game of Chefs, it seems that you should at least appeal to Assaf’s palate.

Bon Appetit!

To leave a comment for the author, please follow the link and comment on their blog: R on I Should Be Writing: The Musical.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)