Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
I am happy to announce the release of ICC Bot, an app for computing
interrater reliability scores using the intraclass correlation coefficient
(ICC).
The ICC statistic appears in repeated measures or multilevel modeling
literature as a way to quantify the similarity (correlation) of data within
measurement units (intra classes). In a multilevel model example, the ICC
might estimate the similarity of test scores within classrooms (as opposed to
between classrooms). For interrater reliability checks, the ICC estimates how
similar raters’ scores are within each participant (hence, between raters).
The ICC ranges from 0 to 1, and it is usually interpreted as the proportion of
variance explained by betweenparticipant differences.
A score of 1 means that all of the variation in the scores is due to differences
between the participants; a score of 0 means that none of the variation is due
to differences between participants: The raters might as well have been scoring
completely different participants.
Using the app
This statistic, as implemented in ICC Bot, only works at the summary score
level (as opposed to the item score level). That is, the data should live in a
table with one column per rater and one row per participant and with the cells being
(total) scores. There should be one table cell per rater–participant pair.
There are a lot of ways to mix and match raters and participants, so there are
many different flavors of ICC for specific rating situations. The ICC Bot
supports six ICC types, so I try to provide detailed notes on the differences
between these options on the Bot Settings and ICC Interpretation tab.
R Users. Because I have (only) 25 monthly hours of free compute time on
ShinyApps.io, I ask that R users who want to try the app run it locally with:
# remotes::install_github("tjmahr/iccbot")
iccbot::run_app()
Everyone else. Because I have 25 monthly hours of free compute time on
ShinyApps.io, you can try the app using https://tristan.shinyapps.io/iccbot/.
The rest of this post discusses the origin of the project and some programming
details.
You can’t peek inside the scores (background)
“I feel like I’ve had an epiphany… It almost feels spiritual. All of the
complexity in the data. It doesn’t matter.”
That was me, reporting to my group how we would handle interrater reliability
for most of our tests. I had spent a few days in a rabbithole of interrater
statistics: agreement percentages, Cohen’s kappa, other [last name]’s [Greek
letter] statistics, “measurement equals true score plus error”, idemnotic
versus vaganotic measurement, etc.
We give lots of speech and language tests in our lab, and we wanted to know how
reliable our scores were for a particular test. The test is a fairly conventional
articulation inventory. Children name pictures, and listeners score whether the
child correctly produced certain target sounds. The words work
through the consonants, consonant clusters, and vowels of English in different
positions. For example, the sound /l/ is tested twice: in ladder
(wordinitially) and in ball (wordfinally). The test weights some sounds are
more than others. The vowel in knife is worth 3 points, the final /f/ .5 points. All told, there are 67 sounds tested and producing all of the sounds
correctly yields an overall score of 100 points.
We had 2 graduate students each score 130 administrations of the test. We are
talking abouts thousands of data points. The possible routes through the data
are numerous To name a few options:
 Overall agreement percentage.
 Overall agreement percentage, weighting by item scores.
 Compute the argreement percentages by child, report the range.
 Weighted version of the above.
 Logistic mixed effects model with varying effects for child, item, and rater.
These statistics all answer interesting and important questions, especially for assessing rating fidelity or calibrating raters, but they are not the right
questions. We are using the total scores from the test as a predictor of other
measures. We want the reliability for those numbers. That was the epiphany.
Report the reliability of the scores that are used in the analyses. It
sounds basic and obvious, but my Bayesian training leads me to think about and
model as much of the datagenerating process as possible, so such a simplifying
assumption felt like a revelation.
Because our field almost always uses summary score level measurements—that is,
one single score on a test per participant—as speech and language measures,
the ICC provides a general purpose statistic for computing interrater
reliability. So I wrote an RMarkdown script to automatically populate a
boilerplate paragraph with ICC statistics. Eventually, we came to report at
least one ICC statistic in each of our papers, so we decided that this task
should more automated and I turned my R script into a Shiny
app.
How much work could it take? (technical details)
Like a lot of programming projects, the app started out simple: just a single
file—a flexdashboard document,
to be precise. And it grew organically, and by organically, I mean each change
was easy and made sense but parts of it will look a little messy.
Initially, the app just translated the user input into arguments for
irr::icc()
. Seriously, the code below shows the original code. input
is
where userinput is stored in Shiny app, req()
checks that inputs exist, so I
literally am translating form options into arguments for a function.
run_icc < function() {
req(input$use_twoway_model)
req(input$single_or_average)
req(input$agreement_or_consistency)
model < ifelse(
input$use_twoway_model == "yes (twoway model)",
"twoway",
"oneway"
)
unit < ifelse(
input$single_or_average == "single rating",
"single",
"average"
)
type < ifelse(
input$agreement_or_consistency == "absolute agreement",
"agreement",
"consistency"
)
add_formatted_results_to_icc(
irr::icc(
getData(),
model = model,
unit = unit,
type = type
)
)
}
Almost all of the work on the app was spent researching and
writing the documentation to help users figure which inputs to use. I wrote an
extensive section on how to choose the app settings and how to interpret ICC
scores, and I checked my app’s results against the numbers in Shrout and
Fleiss (1979), a landmark survey on ICC scores.
But I also had a worry. In the past, I’ve come across situations where raters
and participants are not fully crossed. A typical situation is that there
are a handful of available raters, so they work through the data until all the
participants have two ratings. There is not a score for every combination of
rater–participant; there is missing data.
Consider the following with 2 ratings per participant but 3 judges.
d < iccbot::example_shrout_fleiss()
d[1:2, 3] < NA
d[1:2 + 2, 2] < NA
d[1:2 + 4, 1] < NA
d < d[4]
d
#> # A tibble: 6 x 3
#> Judge1 Judge2 Judge3
#>
#> 1 9 2 NA
#> 2 6 1 NA
#> 3 8 NA 6
#> 4 7 NA 2
#> 5 NA 5 6
#> 6 NA 2 4
irr::icc()
handles this situation by removing any rows with missing data,
so it deletes everything. Look at those NA
s.
irr::icc(d)
#> Warning in qf(1  alpha/2, ns  1, ns * (nr  1)): NaNs produced
#> Warning in qf(1  alpha/2, ns * (nr  1), ns  1): NaNs produced
#> Single Score Intraclass Correlation
#>
#> Model: oneway
#> Type : consistency
#>
#> Subjects = 0
#> Raters = 3
#> ICC(1) = NA
#>
#> FTest, H0: r0 = 0 ; H1: r0 > 0
#> F(1,0) = NA , p = NA
#>
#> 95%Confidence Interval for ICC Population Values:
#> NA < ICC < NA
But we can compute an ICC on this data. If we fit a mixed model with
byparticipant and byrater random effects, we get an estimate of the variances
of the byparticipant and byrater and residual effects.
library(tidyverse)
# Convert to long format
d_long < d %>%
tibble::rowid_to_column("participant") %>%
tidyr::gather(rater, score, participant)
m < lme4::lmer(
score ~ 1 + (1  rater) + (1  participant),
d_long
)
m
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: score ~ 1 + (1  rater) + (1  participant)
#> Data: d_long
#> REML criterion at convergence: 47.4455
#> Random effects:
#> Groups Name Std.Dev.
#> participant (Intercept) 1.3027
#> rater (Intercept) 2.7302
#> Residual 0.9907
#> Number of obs: 12, groups: participant, 6; rater, 3
#> Fixed Effects:
#> (Intercept)
#> 4.833
See the Std.Dev.
column? Compare the magnitude of the Residual
row (1.0) to
the participant
row (1.3) or the rater
(2.7) row. Not much of the variance
is coming from the participant
level.^{1} We can compute the ICC
as the proportion of the variance explained by the participant level directly
from these numbers.
variance_total < sum(c(1.3, 2.7, 1.0) ^ 2)
variance_participant < 1.3 ^ 2
variance_rater < 2.7 ^ 2
variance_participant / variance_total
#> [1] 0.1693387
Indeed, this is what the implementation inside psych::ICC()
does. It uses lme4
when it encounters missing data, so it gets the same answer. (Differences here
are due to rounding.)
psych::ICC(d)[["results"]]["Single_random_raters", ]
#> type ICC F df1 df2 p
#> Single_random_raters ICC2 0.1674744 6.187133 5 10 0.00725818
#> lower bound upper bound
#> Single_random_raters 0.01886028 0.6403929
And this is what ICC Bot does too. If there is no missing data, irr::icc()
runs
as normal. If there is missing data, the variance is computed with lme4 and then
the calculations in irr::icc()
take place. I use the same code as irr::icc()
so that the results get printed just like the results from irr::icc()
.
iccbot::run_icc(d, model = "twoway", type = "agreement")
#> Single Score Intraclass Correlation
#>
#> Model: twoway
#> Type : agreement
#>
#> Subjects = 6
#> Raters = 3
#> ICC(A,1) = 0.167
#>
#> FTest, H0: r0 = 0 ; H1: r0 > 0
#> F(5,3.42) = 6.19 , p = 0.0667
#>
#> 95%Confidence Interval for ICC Population Values:
#> 0.019 < ICC < 0.64
Because the software that calculated changed behind the scenes, the citations
inside the app change so that lme4 can properly credited.
If you do interrater reliability measures, let me know what works, what doesn’t
work, what’s wrong and what’s confusing on the Github
repository. Your feedback will help the app
continue its organic growth.

When they briefly cover the ICC, Gelman and Hill (2007) do
something very cools with the numbers from the lme4 printout. They interpret
the variances in terms of
partial pooling. They
are describing the example of repeated radon measurements in counties. They
have σ_{α (county)} = .33 and
σ_{y (residual)} = .76, so an ICC of .16.One way to interpret the variation between counties, σ_{α},
is to consider the variance ratio,
σ_{α}^{2}/σ_{y}^{2},
which inthis example is estimated at
0.33^{2}/0.76^{2} = 0.19 or about onefifth. Thus, the
standard deviation of the average radon levels
between counties is the same as the standard deviation of the average of
5 measurements within a county. […][…] The amount of information in this distribution is the same as that
in 5 measurements within a county. To put it another way, for a country
with a sample size of less than 5, there is more information in the
grouplevel model than in the county’s data; for a county, with more than
5 observations, the withincountry measurements are more informative
(in the sense of providing a lowervariance estiamte of the county’s
average radon level.) As a result, the multilevel regression line in a
county is closer to the completepooling estimate when sample size is
less than 5 and, closer to the nopooling estimate when the sample size
exceeds 5.
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.