[This article was first published on Markov Chain Monte Carlo in Item Response Models, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, we define the Two-Parameter Logistic (2PL) IRT model, derive the complete conditionals that will form the basis of the sampler, and discuss our choice of prior specification.

# 2PL IRT Model specification


Let the probability of person $p$ answering item $i$ be $pi_{pi}$. We map $pi_{pi}$ to a linear scale with a logit link function
begin{equation}
ln{frac{pi_{pi}}{1+pi_{pi}}} = a_i ( theta_p – b_i) label{eq:pipi}
end{equation}
where the three parameters $a_i$, $b_i$, and $theta_p$ are interpreted as item discrimination, item difficulty, and person ability parameters respectively. The observed response of person $p$ to item $i$ is modeled as a flip of a biased coin
begin{aligned}
U_{pi} & stackrel{indep}{sim} text{Bernoulli}(pi_{pi}) quad .
end{aligned}

This model becomes a Bayesian model when we place prior distributions on the item and person parameters. In the chapter and this supplement, we use the following prior specification
begin{aligned}
% U_{pi} & stackrel{indep}{sim} text{Bernoulli}(pi_{pi}) \
% ln frac{pi_{pi}}{1+pi_{pi}} & = a_i(theta_p – b_i) \
theta_p & stackrel{iid}{sim} text{Normal}(0,sigma_theta^2) \
sigma_theta^2 & sim text{Inverse-Gamma}(alpha_theta,beta_theta) \
a_i & stackrel{iid}{sim} text{Log-Normal}(mu_a,sigma^2_a) \
b_i & stackrel{iid}{sim} text{Normal}(0,sigma_b^2)
end{aligned}
where $alpha_theta$, $beta_theta$, $mu_a$, $sigma_a^2$ and $sigma_b^2$ are constants that we choose below in the prior specification section.

# Derivation of the complete conditionals

The idea of a Gibbs sampler is that we can sample from the joint-posterior by iteratively sampling from conditional posteriors. In this section, we derive the conditional posteriors, which are commonly referred to as complete conditionals because they are conditional on all other parameters and data in the model.

We begin with the likelihood function. Since each $U_{pi}$ is an iid Bernoulli, the likelihood function is:

where the bolded quantities $boldsymbol{U}$, $boldsymboltheta$, $boldsymbol a$, and $boldsymbol b$ are vectors or matrices, and $pi_{pi}$ is defined in Equation ref{eq:pipi}. Note that the likelihood is independent of $sigma^2_theta$.

By applying Bayes rule and using the conditional independence relationships in the model, we find the joint posterior is proportional to:
begin{equation}
f(boldsymboltheta, sigma^2_theta, boldsymbol a, boldsymbol b | boldsymbol{U} )
propto f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot
f(boldsymboltheta | sigma^2_theta ) cdot
f(sigma^2_theta ) cdot
f(boldsymbol a) cdot
f(boldsymbol b) label{eq:post}
end{equation}

The complete conditionals conditionals are then essentially read off of Equation ref{eq:post} by simply dropping any expression which does not contain the parameter of interest. To illustrate this technique, we derive the complete conditional for $boldsymboltheta$ formally:
begin{aligned}
f(boldsymboltheta | sigma^2_theta, boldsymbol a, boldsymbol b, boldsymbol{U} ) & propto
frac{ f(boldsymboltheta, sigma^2_theta, boldsymbol a, boldsymbol b | boldsymbol{U} ) }
{f(sigma^2_theta, boldsymbol a, boldsymbol b) } \
& propto frac{ f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot
f(boldsymboltheta | sigma^2_theta ) cdot
f(sigma^2_theta ) cdot
f(boldsymbol a) cdot
f(boldsymbol b) }{
f(sigma^2_theta ) cdot
f(boldsymbol a) cdot
f(boldsymbol b)
} \
& propto f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot f(boldsymboltheta | sigma^2_theta )
end{aligned}
which has the same result as the heuristic.

Repeating this process we find the other complete conditionals:
begin{aligned}
f(theta_p|text{rest}) & propto
prod_{i=1}^I
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Normal}(theta_p| 0,sigma_theta^2) ~, \
f(sigma^2_theta|text{rest}) & propto
left[ prod_{p=1}^P f_text{Normal}(theta_p| 0,sigma_theta^2) right]
f_text{Inverse-Gamma}left(sigma^2_theta left|
alpha_theta,
beta_theta right.
right)
\
& propto
f_text{Inverse-Gamma}left(sigma^2_theta left|
alpha_theta + frac{P}{2},
beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p right.
right) \
f(a_i|text{rest}) & propto
prod_{p=1}^P
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Log-Normal}(a_i| mu_a,sigma_a^2) ~, \
f(b_i|text{rest}) & propto
prod_{p=1}^P
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Normal}(b_i| 0,sigma_b^2) ,
end{aligned}
where $text{rest}$ stands in for the conditioning variables, $pi_{pi}$ is defined in Equation ref{eq:pipi}, and $f_star(dagger|dots)$ represents the density of the random variable named $dagger$, which has a $star$ distribution.

# Details of prior specification

## Person ability

We model the ability of person $p$ as being an iid draw from a normal distribution with a zero mean and an unknown variance
begin{equation*}
theta_i stackrel{iid}{sim} text{Normal}left(0, sigma^2_thetaright) quad .
end{equation*}

Although this normal distribution fulfills the mathematical role of a prior for $boldsymboltheta$, its interpretation in IRT is as the population distribution of person ability. Therefore, we are interested in estimating the value of $sigma^2_theta$ from the data.

One approach to estimating $sigma^2_theta$ is to build a new layer of the model hierarchy so that our sampler will also give us output for $sigma^2_theta$ which we can then use to draw inferences. To do this, we give $sigma^2_theta$ a prior. A computationally convenient prior is the conjugate prior for the variance of a normal distribution, the inverse-gamma distribution:
begin{equation*}
sigma^2_theta sim text{Inverse-gamma}left(alpha_theta, beta_theta right) quad
end{equation*}
where the $alpha_theta$ and $beta_theta$ parameters are the inverse-gamma distribution’s shape and scale parameters respectively. Since the prior is conjugate, the posterior is also inverse-gamma: begin{equation*}
sigma^2_theta left| boldsymboltheta right. sim
text{Inverse-gamma}left(
alpha_theta + frac{P}{2},
beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p
right)
end{equation*}
where $P$ is the total number of persons.

To set values for $alpha_theta$ and $beta_theta$, we consider the effect of those values on the posterior of $sigma^2_theta$. Since the posterior is a well known distribution, we can consult the Wikipedia page to find the formula for its mean. The posterior mean is:
begin{aligned}
Egiven{sigma^2_theta}{boldsymboltheta} %& = frac{beta}{alpha-1} \
& = frac{beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p}{ alpha_theta + frac{P}{2} – 1}
end{aligned}

If $alpha_theta$ and $beta_theta$ are small compared to $P$, then the posterior mean will be driven almost entirely by the data (via the $theta_p$ estimates):
begin{aligned}
sum_{p=1}^P theta^2_p & approx sigma^2_theta P &
& text{and} &
Egiven{sigma^2_theta}{boldsymboltheta} & approx frac{ frac{sigma^2_theta P}{2} }{ frac{P}{2}} = sigma^2_theta
end{aligned}

so that the posterior mean will be approximately unbiased.

If $alpha_theta$ and $beta_theta$ are much larger than $P$, then the posterior mean will be approximately
begin{aligned}
Egiven{sigma^2_theta}{boldsymboltheta} & approx frac{ beta_theta }{ alpha_theta – 1} = E{sigma^2_theta}
end{aligned}
which is the prior mean.

If $alpha_theta$ and $beta_theta$ are of a similar size as $P$, then the posterior mean will be “shrunk” towards the prior mean. This can be useful for stabilizing estimation or incorporating prior information about the value of $sigma^2_theta$ from past studies. For example, if a past study had $n$ subjects and estimated $s$ as the value of $sigma^2_theta$, then setting
begin{aligned}
alpha_theta & = frac{n}{2} &
& text{and} &
beta_theta & = s cdot frac{n}{2}
end{aligned}
will incorporate the prior information on $sigma^2_theta$ into our study in a principled way.

For our purposes, we choose $alpha_theta = beta_theta = 1$. This leads to a “flat” prior for $sigma^2_theta$ so that the posterior mean will be approximately unbiased.

## Item difficulty

We model the difficulty of item $i$ as being an iid draw from a normal distribution with a zero mean and an unknown variance
begin{equation*}
b_i stackrel{iid}{sim} Nleft(0, sigma^2_bright) quad .
end{equation*}

Since items are often regarded as fixed, we are not usually interested in the variance of the item parameters. Therefore we choose a large value of $sigma^2_b$ so that the prior is flat over typical values.

For our purposes, we choose $sigma^2_b = 100$, which is flat because it is nearly two orders of magnitude larger than a typical item difficulty parameter.

## Item discrimination

We model the discrimination of item $i$ as being an iid draw from a log-normal distribution with an unknown mean and unknown variance
begin{equation*}
a_i stackrel{iid}{sim} text{Log-Normal}(mu_a,sigma^2_a) quad.
end{equation*}

Due to the multiplicative of nature of $a_i$ in Equation ref{eq:pipi}, we chose a log-normal distribution for its prior. This is because the log-normal distribution is the limiting distribution for random variables which are multiplied, in the same way that the normal distribution is the limiting distribution for random variables which are added together.

The analogue of a mean zero normal distribution is a mode one log-normal. The mode of a log normal is given by
begin{equation*}
text{mode}left[ a_i right] = e^{mu_a-sigma^2_a} quad,
end{equation*}
which is equal to 1 when $mu_a = sigma^2_a$. The variance of a mode one log normal is given by
begin{aligned}
Var{a_i} & = left( e^{sigma^2_a}-1 right) e^{2mu_a + sigma^2_a} \
& = left( e^{mu_a}-1 right) e^{3mu_a} quad
end{aligned}
which is zero when $mu_a = sigma^2_a = 0$ and can be arbitrarily large as $mu_a = sigma^2_a$ increases.

Due to the multiplicative and highly skewed nature of the log-normal distribution, it is somewhat difficult to gain an intuition for which values of $mu_a = sigma^2_a$ lead to an informative prior, and which values lead to a flat prior. Our approach is to approximately match the spread of the prior on the item difficulty parameter $b_i$, by matching the upper quantile of the log-normal to a truncated version of the prior for $b_i$.

We can find the appropriate values of $mu_a = sigma^2_a$ numerically in R as follows:

require(truncnorm)
## If the above command fails, you must install the
## truncnorm library. To do so, un-comment the install.packages
## line below, run it, and follow the directions it gives.
##
## install.packages('truncnorm')

## We use the quantile function to determine the critical
## value where 95% of the mass of a normal distribution
## truncated at zero (a=0), with a mean of 0 (mean=0) and
## standard deviation of 10 (sd=10) lies.
qtruncnorm(p=.95, mean=0, sd=10, a=0)
# [1] 19.59964

## We used a guess and check method to find the
## corresponding quantile for a mode 1 log-normal distribution
xx = 1.185 ;qlnorm(.95, xx, sqrt(xx))
# [1] 19.6004


As a visual check, below we compare the CDF of a log-normal with $mu_a = sigma^2_a = 1.185$, with the CDF of a truncated version of the prior for $b_i$ (a truncated normal):

Note from the zoomed in CDF that the log-normal puts much less mass on values near zero. This is expected because for a given constant $c$, we want to have values less than $frac{1}{c}$ to be rare in the same way that values greater than $c$ are rare. Further note that the 95% quantile of both distributions are equal.

The code to produce the graph is given below. Note that it requires you to have run the previous R code from this page.

## Set a 2 by 1 grid for the graphs
par(mfrow=c(1,2))

## Adjust the margins of the plot
par(mar=c(4,4,4,2))

## Plot the zoomed in CDF of the truncated normal
curve( ptruncnorm(q=x, a=0, mean=0, sd=sqrt(100)), from=0, to=1,
main="Zoomed in", lty='dotted', lwd=2,
xlab='quantile', ylab='probability')

## Add the CDF of the log-normal to the graph
curve( plnorm(q=x, mean=1.185, sd=sqrt(1.185)), from=0, to=1,
add=TRUE, lwd=2, col='blue')

## Draw an arrow approximately where the two CDFs cross
arrows( x0=0.325, y0=.038, x1=0.45,  y1=.038,
length=1/8)

## Draw the legend
legend( "topleft", c('truncated b_i prior', 'log-normal'),
col=c('black','blue'), lwd=2, lty=c('dotted', 'solid'))

## Draw the zoomed out truncated normal CDF in the second panel
## N.B. It draws in the second panel because add=TRUE is missing.
curve( ptruncnorm(q=x, a=0, mean=0, sd=sqrt(100)), from=0, to=60,
main="Zoomed out", lwd=2, lty='dotted',
xlab='quantile', ylab='probability')

## Add the the CDF of the log-normal to the graph
curve( plnorm(q=x, mean=1.185, sd=sqrt(1.185)), from=0, to=60,
add=TRUE, lwd=2, col='blue')

## Draw an arrow approximately where the two CDFs cross
arrows( x0=10, y0=.95, x1=16.6,  y1=.95,
length=1/8)

## Reset to a 1 by 1 grid for any future graphs
par(mfrow=c(1,1))


For our purposes, we choose choose $mu_a = sigma^2_a = 1.185$ for the prior of the item discrimination parameter because those values approximately match it to the prior for the item difficulty parameter.

To leave a comment for the author, please follow the link and comment on their blog: Markov Chain Monte Carlo in Item Response Models.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)