[This article was first published on Markov Chain Monte Carlo in Item Response Models, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In order to check that an estimation algorithm is working properly, it is useful to see if the algorithm can recover the true parameter values in one or more simulated “test” data sets. This post explains how to build such a test data set for a 2PL model, gives two methods to check that the data are being generated properly, and examines the robustness of the checks by intentionally introducing bugs.

# Generating 2PL data

We defined the 2PL IRT model in the previous post in terms of the distribution of $U_{pi}$, the observed response of person $p$ to item $i$:
begin{equation*}
U_{pi} stackrel{indep}{sim} text{Bernoulli}(pi_{pi})
end{equation*}begin{equation}
ln{frac{pi_{pi}}{1+pi_{pi}}} = a_i ( theta_p – b_i) label{eq:pipi}
end{equation}
where the three parameters $a_i$, $b_i$, and $theta_p$ are interpreted as item discrimination, item difficulty, and person ability parameters respectively.

We can generate fake data by choosing parameter values and following the model definition. In this example we set the item discrimination parameters to be uniformly distributed between 0.5 and 1.5; the item difficulty parameters to be evenly spaced between -3 and 3; the variance of the person ability parameters to be $1.25^2$; and the person ability parameters themselves to be normally distributed around zero. We implement this in R as follows:

## Set the random-number generator seed,
## to make the results reproducible
set.seed(314159)

## set the number of items I and persons P
I.items    <- 30
P.persons  <- 2000

## set the fixed item and population parameters
a.disc     <- 1 + runif(I.items,-0.5,0.5)
b.diff     <- seq(-3,3,length=I.items)

## generate the person ability parameters
mean.theta <- 0
sig2.theta <- (1.25)^2
theta.abl  <- rnorm(P.persons, mean=mean.theta,
sd=sqrt(sig2.theta))


There are several ways to calculate Equation ref{eq:pipi}. In R, operations on whole vectors are much faster than operations on individual elements of vectors. Therefore the code will be much faster if we avoid looping over the $p$ and $i$ values to calculate Equation ref{eq:pipi} directly and instead translate Equation ref{eq:pipi} into a series of operations on matrices.

One way to do this is given below, where we decompose the linear part of Equation ref{eq:pipi} into two terms: an outer product of $boldsymbol theta$ and $boldsymbol a$ and a matrix where every row is the same. We then calculate the logit with the native R function plogis, so that P.prob is the matrix corresponding to $pi_{pi}$, and then flip our biased coins with the inverse CDF method to generate $boldsymbol U$.

## the I x P matrix of response probabilities
term.1     <- outer(theta.abl, a.disc)
term.2     <- matrix( rep(a.disc*b.diff, P.persons),
nrow=P.persons, byrow=TRUE)
P.prob     <- plogis(term.1-term.2) # 1/(1 + exp(term.2 - term.1))

## generate the 0/1 responses U as a matrix of Bernoulli draws
U          <- ifelse(runif(I.items*P.persons) < P.prob,1,0)


# Checking the fake data

## Check Method 1: Matching theoretical and empirical moments

One way to check the simulated data is to check that the moments (means, variances, covariances, etc.) of the simulated data agree with the same moments of the simulating model. This approach is especially useful when MCMC is meant to be the first estimation algorithm. Although the 2PL IRT model is itself a well-known model, we illustrate this approach below with 1 dimensional and 2 dimensional moments.

### 1 dimensional moments

We can use the model to calculate the probability of item $i$ being correctly answered by integrating over the person ability parameters: begin{equation} p_i = int_{-infty}^infty frac{exp{a_i(theta - b_i)}} {1 + exp{a_i(theta - b_i)}} cdot f_{text{Normal}}(theta|0,sigma^2_theta) ,dtheta quad . label{eq:1Dtheo} end{equation} A simple estimator of this probability is the average of the observed response across all of the persons: begin{equation} widehat{p}_{i} = frac{1}{P}sum_{p=1}^P U_{pi} quad . label{eq:1Demp} end{equation}

We can use R to calculate the theoretical values from equation ref{eq:1Dtheo} and visually compare them with the empirical estimates from ref{eq:1Demp}.

Equation ref{eq:1Dtheo} can be calculated in R by using the integrate function:

## 1 Dimensional moments
## Initialize a vector to hold the results
theo.1D <- rep(NA, I.items)

## Loop over each item
for( ii in 1:I.items){
## For a given item ii, evaluate Equation 1
## N.B. th.dummy   is the integration variable
##      a.disc[ii] is the discrimination parameter for item ii
##      b.diff[ii] is the difficulty parameter for item ii
theo.1D[ii] <- integrate(
function(th.dummy) {return(
1/(1+exp(-a.disc[ii]*(th.dummy-b.diff[ii])))
* dnorm(th.dummy,mean.theta, sqrt(sig2.theta))
)},-Inf, Inf )$value }  Equation ref{eq:1Demp} can be calculated in R by using apply to calculate the mean of every column of the matrix: ## The dimension of U is 2000 (persons) by 30 (items) dim(U) #[1] 2000 30 ## To calculate the item averages, we 'apply' the 'mean' ## function on the columns of items. We select the columns ## with 'MARGIN=2' (the first margin is the rows, the second ## is the columns). emp.1D <- apply(U, MARGIN=2, mean)  We can check that the empirical moments from Equation ref{eq:1Demp} match the theoretical moments from Equation ref{eq:1Dtheo} by making a scatter plot:  ## Draw a scatter plot with the theoretical values ## on the x-axis and the empirical values on the y-axis. plot( theo.1D, emp.1D, asp=1, main='1D Moments', xlab='Theoretical', ylab='Empirical') ## Draw a 45 degree line through the origin abline(0,1)  Since the points lie mostly on the 45 degree line, it looks like the data were generated correctly. ### 2 dimensional moments We can use the model to calculate the joint probability of item $i$ being correctly answered and item $j$ being correctly answered by integrating over the person ability parameters: begin{equation} p_{ij} = int_{-infty}^infty frac{exp{a_i(theta - b_i)}} {1 + exp{a_i(theta - b_i)}} cdot frac{exp{a_j(theta - b_j)}} {1 + exp{a_j(theta - b_j)}} cdot f_{text{Normal}}(theta|0,sigma^2_theta) ,dtheta quad . label{eq:2Dtheo} end{equation} A simple estimator of this probability is the average of the product of observed responses across all of the persons: begin{equation} widehat{p}_{ij} = frac{1}{P}sum_{p=1}^P U_{pi}U_{pj} quad . label{eq:2Demp} end{equation} Calculation of Equation ref{eq:2Dtheo} and Equation ref{eq:2Demp} use the same ideas as the 1D moments, but the details are more complex. We implement the calculation as follows: ## Initialize vectors to hold the results ## N.B. choose(n,p) calculates the binomial coefficient ## "n choose p" theo.2D <- rep(NA, choose(I.items, 2) ) emp.2D <- theo.2D ## Generate all possible combinations of items ## N.B. combn(x,y) generates all combinations of x ## taken y at a time. cmbn.matrix <- combn(I.items, 2) ## where each column of the matrix is a unique combination dim(cmbn.matrix) # [1] 2 435 ## We iterate over all unique combinations for( which.cmbn in 1:choose(I.items,2) ) { ## We define the ii and jj for this combination ii <- cmbn.matrix[1,which.cmbn] jj <- cmbn.matrix[2,which.cmbn] ## We calculate Equation 3 theo.2D[which.cmbn] <- integrate( function(th.dummy) {return( 1/(1+exp(-a.disc[ii]*(th.dummy-b.diff[ii]))) * 1/(1+exp(-a.disc[jj]*(th.dummy-b.diff[jj]))) * dnorm(th.dummy,mean.theta, sqrt(sig2.theta)) )},-Inf, Inf )$value

## We calculate Equation 4
emp.2D[which.cmbn] <- mean(U[,ii]*U[,jj])
}


We can check that the empirical moments from Equation ref{eq:2Demp} match the theoretical moments from Equation ref{eq:2Dtheo} by making a scatter plot:

 ## Draw a scatter plot with the theoretical values ## on the x-axis and the empirical values on the y-axis. plot( theo.2D, emp.2D, asp=1, main='2D Moments', xlab='Theoretical', ylab='Empirical') ## Draw a 45 degree line through the origin abline(0,1) 

Which again looks pretty good!

### Conclusions from Method 1

Based on the 1D and 2D moment scatter plots, it appears that the code to generate the fake data is working well.

## Check Method 2: Recovering parameters with an existing estimation method

In the case of a familiar model such as the 2PL IRT model for which many other estimation algorithms have been written, we can also check to see that parameters are recovered by a known algorithm.

In R, the ltm package can be used to recover the item discrimination $a_i$ and item difficulty $b_i$ parameters. Example code is as follows:

require(ltm)
## If the above command fails, you must install the
## ltm library. To do so, un-comment the install.packages
## line below, run it, and follow the directions it gives.
##
## install.packages('ltm')

## Fit a 2PL IRT model with ltm() and store the resulting
## object in ml.check
ml.check  <- ltm( U ~ z1, IRT.param=TRUE )
## See
##    help(ltm)
## for details on its syntax.

## Extract out the discrimination and difficulty parameters
ml.a.disc <- coef(ml.check)[,'Dscrmn']
ml.b.diff <- coef(ml.check)[,'Dffclt']


We can check that ltm estimates of $a_i$ and $b_i$ match their true values with two scatter plots:

 ## Generate the unequated scatter plots side-by-side par(mfrow=c(1,2)) plot( a.disc, ml.a.disc, asp=1, xlab="True values", ylab="ltm estimates", xlim=c(.5,1.6), ylim=c(.5,1.6), main="ltm Item Discrimination") abline(0,1) plot( b.diff, ml.b.diff, asp=1, xlab="True values", ylab="ltm estimates", main="ltm Item Difficulty") abline(0,1) 

The fit between the ltm estimates and the true values is bad.

In this case, the poor fit of the ltm estimates with the true values is because of a latent-space indeterminacy in estimating the 2PL IRT model. Cook and Eignor (1992) give a method to equate two sets of estimates from a 2PL model so that they can be compared. We implement the equating in R as follows:

## The Cook and Eignor (1992) method to equate
## *this* discrimination parameter with *that*
## discrimination parameter.
equate.2pl.a.disc <- function( this.a, this.b,
that.a, that.b ) {
## N.B. that.a is not used, but is included
##             for ease of use.
return( this.a * sd( this.b ) / sd( that.b ) )
}

## The Cook and Eignor (1992) method to equate
## *this* difficulty parameter with *that*
## difficulty parameter.
equate.2pl.b.diff <- function( this.a, this.b,
that.a, that.b ) {
## N.B. this.a and that.a are not used, but are
##                        included for ease of use.
return( (this.b-mean(this.b))*sd(that.b)/sd(this.b)
+ mean(that.b) )
}


And then equate this specific example as follows:

equated.a.disc <- equate.2pl.a.disc( ml.a.disc, ml.b.diff,
a.disc, b.diff )
equated.b.diff <- equate.2pl.b.diff( ml.a.disc, ml.b.diff,
a.disc, b.diff )


A scatter plot of the equated parameters shows that both the code to generate the fake data and the code to equate the parameter estimates works well:

 ## Generate the equated scatter plots side-by-side par(mfrow=c(1,2)) plot( a.disc, equated.a.disc, asp=1, xlab="True values", ylab="Equated ltm estimates", xlim=c(.5,1.6), ylim=c(.5,1.6), main="Equated ltm Item Discrimination") abline(0,1) plot( b.diff, equated.b.diff, asp=1, xlab="True values", ylab="Equated ltm estimates", main="Equated ltm Item Difficulty") abline(0,1) 

### Conclusion from checking with an existing method

Based on the equated plots, it appears that the code to generate the fake data is working well.

# Robustness of Method 1 and Method 2

In this section, we introduce a bug in the fake data generation code to see how well the two methods detect the bug.

We introduce a sign error as our candidate bug. Instead of subtracting the two terms in the logit, we add them:

## re-set the seed to keep results reproducible
set.seed(314159)

## term.1 and term.2 should be subtracted, not added:
P.prob.buggy  <- plogis(term.1 + term.2)
U.buggy       <- ifelse(runif(I.items*P.persons) < P.prob.buggy,1,0)


Now we calculate our empirical "buggy" moments:

## Calculate the "buggy" moments
emp.1D.buggy  <- apply(U.buggy, MARGIN=2, mean)
emp.2D.buggy  <- rep(NA, choose(I.items, 2) )
for( which.cmbn in 1:choose(I.items,2) ) {
ii <-  cmbn.matrix[1,which.cmbn]
jj <-  cmbn.matrix[2,which.cmbn]
emp.2D.buggy[which.cmbn] <- mean(U.buggy[,ii]*U.buggy[,jj])
}


And our "buggy" ltm estimates, which we also equate:

## Calculate the "buggy" ltm estimates
ml.check.buggy  <- ltm( U.buggy ~ z1, IRT.param=TRUE )
ml.a.disc.buggy <- coef(ml.check.buggy)[,'Dscrmn']
ml.b.diff.buggy <- coef(ml.check.buggy)[,'Dffclt']

## Equate the "buggy" ltm estimates
equated.a.disc.buggy <- equate.2pl.a.disc(
ml.a.disc.buggy, ml.b.diff.buggy,
a.disc         , b.diff )
equated.b.diff.buggy <- equate.2pl.b.diff(
ml.a.disc.buggy, ml.b.diff.buggy,
a.disc         , b.diff )


Now we can visualize the effect of the bug on our moment matching and parameter recovery approaches:

 par(mfrow=c(2,2)) ## Graph the buggy moments plot( theo.1D, emp.1D.buggy, asp=1, main='1D Moments -- Intentional Bug', xlab='Theoretical', ylab='Empirical (Buggy)') abline(0,1) plot( theo.2D, emp.2D.buggy, asp=1, main='2D Moments -- Intentional Bug', xlab='Theoretical', ylab='Empirical (Buggy)') abline(0,1) ## Graph the buggy equated ltm estimates plot( a.disc, equated.a.disc.buggy, asp=1, main="Equated ltm Item Discrimination -- Intentional Bug", xlab="True values", ylab="Equated ltm estimates (Buggy)") abline(0,1) plot( b.diff, equated.b.diff.buggy, asp=1, main="Equated ltm Item Difficulty -- Intentional Bug", xlab="True values", ylab="Equated ltm estimates (Buggy)") abline(0,1) 

The moment matching method detects this bug very well. The 1D moments have the wrong slope and the 2D moments end up with an interesting shape that is most decidedly not a 45 degree line. This sends a strong signal that something is wrong in the code.

The parameter recovery method does detect the bug, but it sends a weaker signal since the item discrimination parameters seem fine. This is because the form of the scale indeterminacy for the item discrimination parameters is agnostic to this particular bug. Thankfully, that is not the case for the item difficulty parameters.

# Conclusion

Based off of the results of both Method 1 and Method 2, we believe that our fake data is being generated correctly.

Note that the details of Method 1 and Method 2 are specific to the 2PL IRT model. In other models, one must derive the analogues of Equations ref{eq:1Dtheo} through ref{eq:2Demp} in order to compare moments, or use other available software and equating methods in order to compare recovered parameters.