Post 1: A Bayesian 2PL IRT model

Nathan VanHoudnos

8 years ago

[This article was first published on Markov Chain Monte Carlo in Item Response Models, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, we define the Two-Parameter Logistic (2PL) IRT model, derive the complete conditionals that will form the basis of the sampler, and discuss our choice of prior specification.

2PL IRT Model specification

Let the probability of person

answering item

. We map

to a linear scale with a logit link function begin{equation} ln{frac{pi_{pi}}{1+pi_{pi}}} = a_i ( theta_p – b_i) label{eq:pipi} end{equation} where the three parameters

, and

are interpreted as item discrimination, item difficulty, and person ability parameters respectively. The observed response of person

to item

is modeled as a flip of a biased coin begin{aligned} U_{pi} & stackrel{indep}{sim} text{Bernoulli}(pi_{pi}) quad . end{aligned}

This model becomes a Bayesian model when we place prior distributions on the item and person parameters. In the chapter and this supplement, we use the following prior specification begin{aligned} % U_{pi} & stackrel{indep}{sim} text{Bernoulli}(pi_{pi}) \ % ln frac{pi_{pi}}{1+pi_{pi}} & = a_i(theta_p – b_i) \ theta_p & stackrel{iid}{sim} text{Normal}(0,sigma_theta^2) \ sigma_theta^2 & sim text{Inverse-Gamma}(alpha_theta,beta_theta) \ a_i & stackrel{iid}{sim} text{Log-Normal}(mu_a,sigma^2_a) \ b_i & stackrel{iid}{sim} text{Normal}(0,sigma_b^2) end{aligned} where , , , and are constants that we choose below in the prior specification section.

Derivation of the complete conditionals

The idea of a Gibbs sampler is that we can sample from the joint-posterior by iteratively sampling from conditional posteriors. In this section, we derive the conditional posteriors, which are commonly referred to as complete conditionals because they are conditional on all other parameters and data in the model.

We begin with the likelihood function. Since each is an iid Bernoulli, the likelihood function is:

where the bolded quantities

, and

are vectors or matrices, and

is defined in Equation ref{eq:pipi}. Note that the likelihood is independent of

By applying Bayes rule and using the conditional independence relationships in the model, we find the joint posterior is proportional to: begin{equation} f(boldsymboltheta, sigma^2_theta, boldsymbol a, boldsymbol b | boldsymbol{U} ) propto f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot f(boldsymboltheta | sigma^2_theta ) cdot f(sigma^2_theta ) cdot f(boldsymbol a) cdot f(boldsymbol b) label{eq:post} end{equation}

The complete conditionals conditionals are then essentially read off of Equation ref{eq:post} by simply dropping any expression which does not contain the parameter of interest. To illustrate this technique, we derive the complete conditional for formally: begin{aligned} f(boldsymboltheta | sigma^2_theta, boldsymbol a, boldsymbol b, boldsymbol{U} ) & propto frac{ f(boldsymboltheta, sigma^2_theta, boldsymbol a, boldsymbol b | boldsymbol{U} ) } {f(sigma^2_theta, boldsymbol a, boldsymbol b) } \ & propto frac{ f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot f(boldsymboltheta | sigma^2_theta ) cdot f(sigma^2_theta ) cdot f(boldsymbol a) cdot f(boldsymbol b) }{ f(sigma^2_theta ) cdot f(boldsymbol a) cdot f(boldsymbol b) } \ & propto f(boldsymbol{U} | boldsymboltheta, boldsymbol a, boldsymbol b ) cdot f(boldsymboltheta | sigma^2_theta ) end{aligned} which has the same result as the heuristic.

Repeating this process we find the other complete conditionals: begin{aligned} f(theta_p|text{rest}) & propto prod_{i=1}^I pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}} f_text{Normal}(theta_p| 0,sigma_theta^2) ~, \ f(sigma^2_theta|text{rest}) & propto left[ prod_{p=1}^P f_text{Normal}(theta_p| 0,sigma_theta^2) right] f_text{Inverse-Gamma}left(sigma^2_theta left| alpha_theta, beta_theta right. right) \ & propto f_text{Inverse-Gamma}left(sigma^2_theta left| alpha_theta + frac{P}{2}, beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p right. right) \ f(a_i|text{rest}) & propto prod_{p=1}^P pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}} f_text{Log-Normal}(a_i| mu_a,sigma_a^2) ~, \ f(b_i|text{rest}) & propto prod_{p=1}^P pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}} f_text{Normal}(b_i| 0,sigma_b^2) , end{aligned} where stands in for the conditioning variables, is defined in Equation ref{eq:pipi}, and represents the density of the random variable named , which has a distribution.

Details of prior specification

Person ability

We model the ability of person as being an iid draw from a normal distribution with a zero mean and an unknown variance begin{equation*} theta_i stackrel{iid}{sim} text{Normal}left(0, sigma^2_thetaright) quad . end{equation*}

Although this normal distribution fulfills the mathematical role of a prior for , its interpretation in IRT is as the population distribution of person ability. Therefore, we are interested in estimating the value of from the data.

One approach to estimating is to build a new layer of the model hierarchy so that our sampler will also give us output for which we can then use to draw inferences. To do this, we give a prior. A computationally convenient prior is the conjugate prior for the variance of a normal distribution, the inverse-gamma distribution: begin{equation*} sigma^2_theta sim text{Inverse-gamma}left(alpha_theta, beta_theta right) quad end{equation*} where the and parameters are the inverse-gamma distribution’s shape and scale parameters respectively. Since the prior is conjugate, the posterior is also inverse-gamma: begin{equation*} sigma^2_theta left| boldsymboltheta right. sim text{Inverse-gamma}left( alpha_theta + frac{P}{2}, beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p right) end{equation*} where is the total number of persons.

To set values for and , we consider the effect of those values on the posterior of . Since the posterior is a well known distribution, we can consult the Wikipedia page to find the formula for its mean. The posterior mean is: begin{aligned} Egiven{sigma^2_theta}{boldsymboltheta} %& = frac{beta}{alpha-1} \ & = frac{beta_theta + frac{1}{2} sum_{p=1}^P theta^2_p}{ alpha_theta + frac{P}{2} – 1} end{aligned}

If and are small compared to , then the posterior mean will be driven almost entirely by the data (via the estimates): begin{aligned} sum_{p=1}^P theta^2_p & approx sigma^2_theta P & & text{and} & Egiven{sigma^2_theta}{boldsymboltheta} & approx frac{ frac{sigma^2_theta P}{2} }{ frac{P}{2}} = sigma^2_theta end{aligned}

so that the posterior mean will be approximately unbiased.

If and are much larger than , then the posterior mean will be approximately begin{aligned} Egiven{sigma^2_theta}{boldsymboltheta} & approx frac{ beta_theta }{ alpha_theta – 1} = E{sigma^2_theta} end{aligned} which is the prior mean.

If and are of a similar size as , then the posterior mean will be “shrunk” towards the prior mean. This can be useful for stabilizing estimation or incorporating prior information about the value of from past studies. For example, if a past study had subjects and estimated as the value of , then setting begin{aligned} alpha_theta & = frac{n}{2} & & text{and} & beta_theta & = s cdot frac{n}{2} end{aligned} will incorporate the prior information on into our study in a principled way.

For our purposes, we choose . This leads to a “flat” prior for so that the posterior mean will be approximately unbiased.

Item difficulty

We model the difficulty of item as being an iid draw from a normal distribution with a zero mean and an unknown variance begin{equation*} b_i stackrel{iid}{sim} Nleft(0, sigma^2_bright) quad . end{equation*} Since items are often regarded as fixed, we are not usually interested in the variance of the item parameters. Therefore we choose a large value of so that the prior is flat over typical values.

For our purposes, we choose , which is flat because it is nearly two orders of magnitude larger than a typical item difficulty parameter.

Item discrimination

We model the discrimination of item as being an iid draw from a log-normal distribution with an unknown mean and unknown variance begin{equation*} a_i stackrel{iid}{sim} text{Log-Normal}(mu_a,sigma^2_a) quad. end{equation*} Due to the multiplicative of nature of in Equation ref{eq:pipi}, we chose a log-normal distribution for its prior. This is because the log-normal distribution is the limiting distribution for random variables which are multiplied, in the same way that the normal distribution is the limiting distribution for random variables which are added together.

The analogue of a mean zero normal distribution is a mode one log-normal. The mode of a log normal is given by begin{equation*} text{mode}left[ a_i right] = e^{mu_a-sigma^2_a} quad, end{equation*} which is equal to 1 when . The variance of a mode one log normal is given by begin{aligned} Var{a_i} & = left( e^{sigma^2_a}-1 right) e^{2mu_a + sigma^2_a} \ & = left( e^{mu_a}-1 right) e^{3mu_a} quad end{aligned} which is zero when and can be arbitrarily large as increases.

Due to the multiplicative and highly skewed nature of the log-normal distribution, it is somewhat difficult to gain an intuition for which values of lead to an informative prior, and which values lead to a flat prior. Our approach is to approximately match the spread of the prior on the item difficulty parameter , by matching the upper quantile of the log-normal to a truncated version of the prior for .

We can find the appropriate values of

numerically in R as follows:

require(truncnorm)
## If the above command fails, you must install the 
## truncnorm library. To do so, un-comment the install.packages 
## line below, run it, and follow the directions it gives.
##
## install.packages('truncnorm')
 
## We use the quantile function to determine the critical 
## value where 95% of the mass of a normal distribution 
## truncated at zero (a=0), with a mean of 0 (mean=0) and 
## standard deviation of 10 (sd=10) lies. 
qtruncnorm(p=.95, mean=0, sd=10, a=0) 
# [1] 19.59964
 
## We used a guess and check method to find the
## corresponding quantile for a mode 1 log-normal distribution
xx = 1.185 ;qlnorm(.95, xx, sqrt(xx))
# [1] 19.6004

As a visual check, below we compare the CDF of a log-normal with , with the CDF of a truncated version of the prior for (a truncated normal):

Note from the zoomed in CDF that the log-normal puts much less mass on values near zero. This is expected because for a given constant , we want to have values less than to be rare in the same way that values greater than are rare. Further note that the 95% quantile of both distributions are equal.

The code to produce the graph is given below. Note that it requires you to have run the previous R code from this page.

## Set a 2 by 1 grid for the graphs
par(mfrow=c(1,2))

## Adjust the margins of the plot 
par(mar=c(4,4,4,2))

## Plot the zoomed in CDF of the truncated normal
curve( ptruncnorm(q=x, a=0, mean=0, sd=sqrt(100)), from=0, to=1,
       main="Zoomed in", lty='dotted', lwd=2,
       xlab='quantile', ylab='probability')

## Add the CDF of the log-normal to the graph
curve( plnorm(q=x, mean=1.185, sd=sqrt(1.185)), from=0, to=1,
       add=TRUE, lwd=2, col='blue')

## Draw an arrow approximately where the two CDFs cross
arrows( x0=0.325, y0=.038, x1=0.45,  y1=.038,
        length=1/8)

## Draw the legend
legend( "topleft", c('truncated b_i prior', 'log-normal'),
        col=c('black','blue'), lwd=2, lty=c('dotted', 'solid'))

## Draw the zoomed out truncated normal CDF in the second panel
## N.B. It draws in the second panel because add=TRUE is missing.
curve( ptruncnorm(q=x, a=0, mean=0, sd=sqrt(100)), from=0, to=60,
             main="Zoomed out", lwd=2, lty='dotted',
        xlab='quantile', ylab='probability')

## Add the the CDF of the log-normal to the graph
curve( plnorm(q=x, mean=1.185, sd=sqrt(1.185)), from=0, to=60,
       add=TRUE, lwd=2, col='blue')

## Draw an arrow approximately where the two CDFs cross
arrows( x0=10, y0=.95, x1=16.6,  y1=.95,
        length=1/8)

## Reset to a 1 by 1 grid for any future graphs
par(mfrow=c(1,1))

For our purposes, we choose choose for the prior of the item discrimination parameter because those values approximately match it to the prior for the item difficulty parameter.

To leave a comment for the author, please follow the link and comment on their blog: Markov Chain Monte Carlo in Item Response Models.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.