Post 1: A Bayesian 2PL IRT model

October 4, 2013
By

(This article was first published on Markov Chain Monte Carlo in Item Response Models, and kindly contributed to R-bloggers)

In this post, we define the Two-Parameter Logistic (2PL) IRT model, derive the complete conditionals that will form the basis of the sampler, and discuss our choice of prior specification.

2PL IRT Model specification


newcommand{E}[1]{mathbb{E}left[#1right]}      
newcommand{Egiven}[2]{mathbb{E}left[ left. #1 right| #2right]}      
newcommand{Var}[1]{text{Var}left[#1right]}      
newcommand{Vargiven}[2]{text{Var}left[ left. #1 right| #2right]}

Let the probability of person p answering item i be pi_{pi}. We map pi_{pi} to a linear scale with a logit link function
begin{equation}
ln{frac{pi_{pi}}{1+pi_{pi}}} = a_i ( theta_p – b_i) label{eq:pipi}
end{equation}
where the three parameters a_i, b_i, and theta_p are interpreted as item discrimination, item difficulty, and person ability parameters respectively. The observed response of person p to item i is modeled as a flip of a biased coin
begin{aligned}
U_{pi} & stackrel{indep}{sim} text{Bernoulli}(pi_{pi}) quad .
end{aligned}

This model becomes a Bayesian model when we place prior distributions on the item and person parameters. In the chapter and this supplement, we use the following prior specification
begin{aligned}
% U_{pi} & stackrel{indep}{sim} text{Bernoulli}(pi_{pi}) \
% ln frac{pi_{pi}}{1+pi_{pi}} & = a_i(theta_p – b_i) \
theta_p & stackrel{iid}{sim} text{Normal}(0,sigma_theta^2) \
sigma_theta^2 & sim text{Inverse-Gamma}(alpha_theta,beta_theta) \
a_i & stackrel{iid}{sim} text{Log-Normal}(mu_a,sigma^2_a) \
b_i & stackrel{iid}{sim} text{Normal}(0,sigma_b^2)
end{aligned}
where alpha_theta, beta_theta, mu_a, sigma_a^2 and sigma_b^2 are constants that we choose below in the prior specification section.

Derivation of the complete conditionals

The idea of a Gibbs sampler is that we can sample from the joint-posterior by iteratively sampling from conditional posteriors. In this section, we derive the conditional posteriors, which are commonly referred to as complete conditionals because they are conditional on all other parameters and data in the model.

We begin with the likelihood function. Since each U_{pi} is an iid Bernoulli, the likelihood function is:

 
f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) = prod_i prod_p (pi_{pi})^{U_{pi}}(1-pi_{pi})^{1-U_{pi}}

where the bolded quantities boldsymbol{U}, boldsymboltheta, boldsymbol a, and boldsymbol b are vectors or matrices, and pi_{pi} is defined in Equation ref{eq:pipi}. Note that the likelihood is independent of sigma^2_theta.

By applying Bayes rule and using the conditional independence relationships in the model, we find the joint posterior is proportional to:
begin{equation}
f(boldsymboltheta, sigma^2_theta, boldsymbol a, boldsymbol b | boldsymbol{U} )
propto f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot
f(boldsymboltheta | sigma^2_theta ) cdot
f(sigma^2_theta ) cdot
f(boldsymbol a) cdot
f(boldsymbol b) label{eq:post}
end{equation}

The complete conditionals conditionals are then essentially read off of Equation ref{eq:post} by simply dropping any expression which does not contain the parameter of interest. To illustrate this technique, we derive the complete conditional for boldsymboltheta formally:
begin{aligned}
f(boldsymboltheta | sigma^2_theta, boldsymbol a, boldsymbol b, boldsymbol{U} ) & propto
frac{ f(boldsymboltheta, sigma^2_theta, boldsymbol a, boldsymbol b | boldsymbol{U} ) }
{f(sigma^2_theta, boldsymbol a, boldsymbol b) } \
& propto frac{ f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot
f(boldsymboltheta | sigma^2_theta ) cdot
f(sigma^2_theta ) cdot
f(boldsymbol a) cdot
f(boldsymbol b) }{
f(sigma^2_theta ) cdot
f(boldsymbol a) cdot
f(boldsymbol b)
} \
& propto f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot f(boldsymboltheta | sigma^2_theta )
end{aligned}
which has the same result as the heuristic.

Repeating this process we find the other complete conditionals:
begin{aligned}
f(theta_p|text{rest}) & propto
prod_{i=1}^I
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Normal}(theta_p| 0,sigma_theta^2) ~, \
f(sigma^2_theta|text{rest}) & propto
left[ prod_{p=1}^P f_text{Normal}(theta_p| 0,sigma_theta^2) right]
f_text{Inverse-Gamma}left(sigma^2_theta left|
alpha_theta,
beta_theta right.
right)
\
& propto
f_text{Inverse-Gamma}left(sigma^2_theta left|
alpha_theta + frac{P}{2},
beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p right.
right) \
f(a_i|text{rest}) & propto
prod_{p=1}^P
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Log-Normal}(a_i| mu_a,sigma_a^2) ~, \
f(b_i|text{rest}) & propto
prod_{p=1}^P
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Normal}(b_i| 0,sigma_b^2) ,
end{aligned}
where text{rest} stands in for the conditioning variables, pi_{pi} is defined in Equation ref{eq:pipi}, and f_star(dagger|dots) represents the density of the random variable named dagger, which has a star distribution.

Details of prior specification

Person ability

We model the ability of person p as being an iid draw from a normal distribution with a zero mean and an unknown variance
begin{equation*}
theta_i stackrel{iid}{sim} text{Normal}left(0, sigma^2_thetaright) quad .
end{equation*}

Although this normal distribution fulfills the mathematical role of a prior for boldsymboltheta, its interpretation in IRT is as the population distribution of person ability. Therefore, we are interested in estimating the value of sigma^2_theta from the data.

One approach to estimating sigma^2_theta is to build a new layer of the model hierarchy so that our sampler will also give us output for sigma^2_theta which we can then use to draw inferences. To do this, we give sigma^2_theta a prior. A computationally convenient prior is the conjugate prior for the variance of a normal distribution, the inverse-gamma distribution:
begin{equation*}
sigma^2_theta sim text{Inverse-gamma}left(alpha_theta, beta_theta right) quad
end{equation*}
where the alpha_theta and beta_theta parameters are the inverse-gamma distribution’s shape and scale parameters respectively. Since the prior is conjugate, the posterior is also inverse-gamma: begin{equation*}
sigma^2_theta left| boldsymboltheta right. sim
text{Inverse-gamma}left(
alpha_theta + frac{P}{2},
beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p
right)
end{equation*}
where P is the total number of persons.

To set values for alpha_theta and beta_theta, we consider the effect of those values on the posterior of sigma^2_theta. Since the posterior is a well known distribution, we can consult the Wikipedia page to find the formula for its mean. The posterior mean is:
begin{aligned}
Egiven{sigma^2_theta}{boldsymboltheta} %& = frac{beta}{alpha-1} \
& = frac{beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p}{ alpha_theta + frac{P}{2} – 1}
end{aligned}

If alpha_theta and beta_theta are small compared to P, then the posterior mean will be driven almost entirely by the data (via the theta_p estimates):
begin{aligned}
sum_{p=1}^P theta^2_p & approx sigma^2_theta P &
& text{and} &
Egiven{sigma^2_theta}{boldsymboltheta} & approx frac{ frac{sigma^2_theta P}{2} }{ frac{P}{2}} = sigma^2_theta
end{aligned}

so that the posterior mean will be approximately unbiased.

If alpha_theta and beta_theta are much larger than P, then the posterior mean will be approximately
begin{aligned}
Egiven{sigma^2_theta}{boldsymboltheta} & approx frac{ beta_theta }{ alpha_theta – 1} = E{sigma^2_theta}
end{aligned}
which is the prior mean.

If alpha_theta and beta_theta are of a similar size as P, then the posterior mean will be “shrunk” towards the prior mean. This can be useful for stabilizing estimation or incorporating prior information about the value of sigma^2_theta from past studies. For example, if a past study had n subjects and estimated s as the value of sigma^2_theta, then setting
begin{aligned}
alpha_theta & = frac{n}{2} &
& text{and} &
beta_theta & = s cdot frac{n}{2}
end{aligned}
will incorporate the prior information on sigma^2_theta into our study in a principled way.

For our purposes, we choose alpha_theta = beta_theta = 1. This leads to a “flat” prior for sigma^2_theta so that the posterior mean will be approximately unbiased.

Item difficulty

We model the difficulty of item i as being an iid draw from a normal distribution with a zero mean and an unknown variance
begin{equation*}
b_i stackrel{iid}{sim} Nleft(0, sigma^2_bright) quad .
end{equation*}

Since items are often regarded as fixed, we are not usually interested in the variance of the item parameters. Therefore we choose a large value of sigma^2_b so that the prior is flat over typical values.

For our purposes, we choose sigma^2_b = 100, which is flat because it is nearly two orders of magnitude larger than a typical item difficulty parameter.

Item discrimination

We model the discrimination of item i as being an iid draw from a log-normal distribution with an unknown mean and unknown variance
begin{equation*}
a_i stackrel{iid}{sim} text{Log-Normal}(mu_a,sigma^2_a) quad.
end{equation*}

Due to the multiplicative of nature of a_i in Equation ref{eq:pipi}, we chose a log-normal distribution for its prior. This is because the log-normal distribution is the limiting distribution for random variables which are multiplied, in the same way that the normal distribution is the limiting distribution for random variables which are added together.

The analogue of a mean zero normal distribution is a mode one log-normal. The mode of a log normal is given by
begin{equation*}
text{mode}left[ a_i right] = e^{mu_a-sigma^2_a} quad,
end{equation*}
which is equal to 1 when mu_a = sigma^2_a. The variance of a mode one log normal is given by
begin{aligned}
Var{a_i} & = left( e^{sigma^2_a}-1 right) e^{2mu_a + sigma^2_a} \
& = left( e^{mu_a}-1 right) e^{3mu_a} quad
end{aligned}
which is zero when mu_a = sigma^2_a = 0 and can be arbitrarily large as mu_a = sigma^2_a increases.

Due to the multiplicative and highly skewed nature of the log-normal distribution, it is somewhat difficult to gain an intuition for which values of mu_a = sigma^2_a lead to an informative prior, and which values lead to a flat prior. Our approach is to approximately match the spread of the prior on the item difficulty parameter b_i, by matching the upper quantile of the log-normal to a truncated version of the prior for b_i.

We can find the appropriate values of mu_a = sigma^2_a numerically in R as follows:

require(truncnorm)
## If the above command fails, you must install the 
## truncnorm library. To do so, un-comment the install.packages 
## line below, run it, and follow the directions it gives.
##
## install.packages('truncnorm')
 
## We use the quantile function to determine the critical 
## value where 95% of the mass of a normal distribution 
## truncated at zero (a=0), with a mean of 0 (mean=0) and 
## standard deviation of 10 (sd=10) lies. 
qtruncnorm(p=.95, mean=0, sd=10, a=0) 
# [1] 19.59964
 
## We used a guess and check method to find the
## corresponding quantile for a mode 1 log-normal distribution
xx = 1.185 ;qlnorm(.95, xx, sqrt(xx))
# [1] 19.6004

As a visual check, below we compare the CDF of a log-normal with mu_a = sigma^2_a = 1.185, with the CDF of a truncated version of the prior for b_i (a truncated normal):
a.prior

Note from the zoomed in CDF that the log-normal puts much less mass on values near zero. This is expected because for a given constant c, we want to have values less than frac{1}{c} to be rare in the same way that values greater than c are rare. Further note that the 95% quantile of both distributions are equal.

The code to produce the graph is given below. Note that it requires you to have run the previous R code from this page.

## Set a 2 by 1 grid for the graphs
par(mfrow=c(1,2))

## Adjust the margins of the plot 
par(mar=c(4,4,4,2))

## Plot the zoomed in CDF of the truncated normal
curve( ptruncnorm(q=x, a=0, mean=0, sd=sqrt(100)), from=0, to=1,
       main="Zoomed in", lty='dotted', lwd=2,
       xlab='quantile', ylab='probability')

## Add the CDF of the log-normal to the graph
curve( plnorm(q=x, mean=1.185, sd=sqrt(1.185)), from=0, to=1,
       add=TRUE, lwd=2, col='blue')

## Draw an arrow approximately where the two CDFs cross
arrows( x0=0.325, y0=.038, x1=0.45,  y1=.038,
        length=1/8)

## Draw the legend
legend( "topleft", c('truncated b_i prior', 'log-normal'),
        col=c('black','blue'), lwd=2, lty=c('dotted', 'solid'))

## Draw the zoomed out truncated normal CDF in the second panel
## N.B. It draws in the second panel because add=TRUE is missing.
curve( ptruncnorm(q=x, a=0, mean=0, sd=sqrt(100)), from=0, to=60,
             main="Zoomed out", lwd=2, lty='dotted',
        xlab='quantile', ylab='probability')

## Add the the CDF of the log-normal to the graph
curve( plnorm(q=x, mean=1.185, sd=sqrt(1.185)), from=0, to=60,
       add=TRUE, lwd=2, col='blue')

## Draw an arrow approximately where the two CDFs cross
arrows( x0=10, y0=.95, x1=16.6,  y1=.95,
        length=1/8)

## Reset to a 1 by 1 grid for any future graphs
par(mfrow=c(1,1))

For our purposes, we choose choose mu_a = sigma^2_a = 1.185 for the prior of the item discrimination parameter because those values approximately match it to the prior for the item difficulty parameter.

To leave a comment for the author, please follow the link and comment on their blog: Markov Chain Monte Carlo in Item Response Models.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)