Post 4: Sampling the person ability parameters

October 8, 2013
By

(This article was first published on Markov Chain Monte Carlo in Item Response Models, and kindly contributed to R-bloggers)

The previous post outlined the general strategy of writing a MH within Gibbs sampler by breaking the code into two levels: a high level shell and a series of lower-level samplers which do the actual work.

This post discusses the first of the lower-level samplers, the sampler for the person ability parameters. The chapter gives a complete treatment of how to write the sampler, please read it for the details. Here, we summarize the development of the sampler, check that the sampler code is correct, and give two examples of how bugs might be detected. The final bug will motivate the next post where we will refactor this specific person ability sampler into a general MH sampler.

The person ability sampler

The person ability sampler will use the Metropolis-Hastings (MH) algorithm to sample from the complete conditional. See the chapter for details on how (and why!) to write the code presented in this section.

In a previous post we derived the complete conditional as:

begin{equation}
f(theta_p|text{rest}) propto
prod_{i=1}^I
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Normal}(theta_p| 0,sigma_theta^2) label{eq:thetacc}
end{equation}
where pi_{pi} is the probability of person p answering item i correctly. pi_{pi} depends on both the person and item parameters, and as is defined by
begin{equation*}
ln{frac{pi_{pi}}{1+pi_{pi}}} = a_i ( theta_p – b_i) .
end{equation*}

The likelihood term in Equation ref{eq:thetacc} left( pi_{pi}^{u_{pi}} (1 - pi_{pi})^{1-u_{pi}} right) must be calculated twice for each iteration of the MH algorithm: first for the proposed values of boldsymbol theta and second for the “old” values of boldsymbol theta. To minimized code duplication, we define the log.prob function to calculate it on the log scale:

## log.prob
##
## Output : A matrix such that the
##          column sums are item log-likelihood values, and
##          row sums are person log-likelihood values.
log.prob <- function(U.data, th, a, b) {
  ## Step 1: Build the matrix of probabilities 
  ##         (see Post 2 for details) 
  term.1     <- outer(th, a)
  term.2     <- matrix(rep(a*b, P.persons), nrow=P.persons, 
                byrow=TRUE)
  P.prob <- plogis(term.1 - term.2) #### 1/(1 + exp(term.2 - term.1))

  ## Step 2: Build the likelihood from Equation 1 on the log scale
  log.bernoulli <- log(P.prob^U.data) + log((1-P.prob)^(1-U.data))
  return(log.bernoulli)
}

The chapter gives details on how to implement the MH sampler. A more highly commented version of the code is given here:

sample.th <- function(U.data, old) {
 ## Access the current (soon to be old) state  of the chain
 ## N.B. This is vector. All operations in this function
 ##      are on the complete vector instead of element-at-a-time
 th.old    <- old$th

 ## Draw the MH proposal
 ## ... get the MH proposal variance
 MH.th     <- old$MH$th
 ## ... get the number of persons in the data
 P.persons <- length(th.old)
 ## ... draw a new person ability parameter for each person
 ##     centered around their previous (th.old) parameter, with
 ##     a fixed variance (MH.th)
 th.star   <- rnorm(P.persons,th.old,MH.th)

 ## Calculate the probability of accepting the proposal draw
 ## ... N.B. All calculations are done on the log scale to keep
 ##          numerical accuracy when dealing with small numbers
 ## ... The acceptance probability is a function of four things:
 ##
 ## ... ... 1: The complete conditional at the proposal     
 log.cc.star <- (
     ## ... ... likelihood term
     apply(log.prob(U.data, th.star, old$a, old$b),1,sum)
     ## ... ... prior term
     + log(dnorm(th.star,0,sqrt(old$s2))))
 ##
 ## ... ... 2: The complete conditional at the old value
 log.cc.old  <- (
     ## ... ... likelihood term
     apply(log.prob(U.data, th.old, old$a, old$b),1,sum)
     ## ... ... prior term 
     + log(dnorm(th.old,0,sqrt(old$s2))))
 ##
 ## ... ... 3: The MH proposal density at the proposal
 log.prop.star <- log(dnorm(th.star,th.old,MH.th))
 ##   
 ## ... ... 4: The MH proposal density at the old value
 log.prop.old  <- log(dnorm(th.old,th.star,MH.th))
 ## 
 ## ... Calculate the acceptance probability on the log scale
 log.alpha.star <- pmin(log.cc.star + log.prop.old - log.cc.old -
                    log.prop.star,0)
     
 ## Accept or reject the proposals by flipping coins with the
 ## calculated acceptance probabilities 
 acc.new <- ifelse(log(runif(P.persons))<=log.alpha.star, 1, 0)

 ## Update the chain 
 cur          <- old   
 cur$th       <- ifelse(acc.new==1, th.star, th.old)

 ## Calculate acceptance probabilities (for tuning)
 cur$ACC$th   <- old$ACC$th + mean( acc.new )
 cur$ACC$th.n <-  cur$ACC$th.n + 1
 
 return(cur) 
}

Please see the chapter for additional details.

Running the sampler

We can run the sampler as follows:

## Load the necessary code from this and previous posts
source("http://mcmcinirt.stat.cmu.edu/setup/post-4.R")

## Set the seed to keep results reproducible 
set.seed(314159)

## Run the sampler with overdispersed theta.abl
## values and the true, simulated values for the other
## parameters. 
run.B <- run.chain.2pl(
    ## No burnin, keep 1000 iterations, do not thin
    M.burnin = 0, M.keep  = 1000, M.thin = 1,
    ## Use the fake data simulated in Post #2
    U.data  = U,
    ## Pass in the hyperpriors from Post #1
    hyperpars = hyperpars,
    ## Generate starting values uniformly from -5 to 5
    th.init = runif( length(theta.abl), 
                     min=-5, max=5     ),
    ## Set the other parameters to their true values 
    ## from Post #2
    a.init  = a.disc,
    b.init  = b.diff,
    s2.init = sig2.theta,
    ## Set the person ability proposal variance to 1, 
    ## the others to NULL
    MH.th=1, MH.a=NULL, MH.b=NULL,
    ## Output the acceptance rates
    verbose=TRUE)
## [1] 100
## [1] 200
## ... snip ... 
## [1] 1000
## Average acceptance rates: 
##  theta.abl: 0.481 
##  a.disc:    NaN 
##  b.diff:    NaN

## Convert to coda
run.B.mcmc <- mcmc( t(run.B) )

Note that the acceptance rate reported by the sampler is about 48%, which is near the optimal rate. I achieved this by changing the value for MH.th until it was "just right". The tuning runs are not shown (see Post 9 for how to tune the sampler).

Testing the sampler

Using the coda library and the get.2pl.params function from Post 3 (both of which were loaded by the source("http://mcmcinirt.stat.cmu.edu/setup/post-4.R") line in the code above), we can quickly visualize the trace plots for theta_1 through theta_9:
run.B.mcmc.theta.1-9
using the default plot function:

## N.B. ylim specifies the scale on the y-axis. 
plot( run.B.mcmc[, get.2pl.params(1:9,NULL,NULL,NULL)],
      density=FALSE,
      ylim=c(-5,5), smooth=TRUE )

From the trace plots, it appears that each chain quickly converges to the mode from its starting value. We can test that the chains are getting the right answers by calculating the average values of the chains (Expected A Posteriori estimates; EAP estimates),

theta.eap.B <- apply( 
       run.B.mcmc[, get.2pl.params(1:P.persons, NULL, NULL, NULL)],
       MARGIN=2,
       mean )

and graphically comparing them with the simulated true values:
run.B.mcmc.theta.check

To generate the graph, we define a special visualization function

## We'll be doing these checks a lot, so let's make a function
check.sampler.graph <- function(true.values, estimated.values,
                                desc, ylab, col) {    
par(mfrow=c(1,2))
  plot( true.values, estimated.values,
      xlab='True values', ylab=ylab,
      main=paste('Parameter recovery:',desc),
      col=col)
  abline(0,1)

  plot( ecdf( true.values ), xlab=deparse(substitute(true.values)),
        main=paste('Empirical CDFs:',desc) )
  plot( ecdf( estimated.values ), col=col, lty='dashed', add=TRUE )
  legend( 'topleft', c('True values', ylab), col=c('black', col),
           lty=c('solid','dashed'), bg='white')
   ## Add in the KS test
  legend( 
     'bottomright',
     paste("Kolmogorov-Smirnov p value:",
       round( ks.test( true.values, estimated.values )$p.value, 2)),
     bg='white')
par(mfrow=c(1,1))
}

and call it for this specific run of the sampler:

## Run the function for this example: 
check.sampler.graph(theta.abl, theta.eap.B, 'Person Ability',
                    'EAP estimates', 'blue')
## Parameter recovery looks a little biased
## Empirical CDF looks very close 

Both the parameter recovery scatter plot and the empirical CDF plot suggest that the sampler is properly recovering the simulated values. This is good news; it suggests that we implemented the sampler correctly.

What if there is a bug in the code?

Bug 1: Good sampler properties, but wrong answers

In my experience, a common bug in writing a sampler is not calculating the log-likelihood correctly. Here we introduce a bug in the calculation of the log-likelihood to see what happens with the sampler.

## Introduce a bug into the calculation of the log-likelihood
log.prob.buggy <- function(U.data, th, a, b) {
  term.1     <- outer(th, b) ## This should be 'a' not 'b'
  term.2     <- matrix(rep(a*b, P.persons), nrow=P.persons, 
                byrow=TRUE)
  P.prob <- plogis(term.1 - term.2) 
 
  log.bernoulli <- log(P.prob^U.data) + log((1-P.prob)^(1-U.data))
  return(log.bernoulli)
}

Now we swap in the buggy log.prob function, make a buggy run of the sampler, and the swap out the buggy log.prob function for the correct one:

## Set the seed to keep results reproducible 
set.seed(314159)

## Swap in the buggy log likelihood
log.prob.correct <- log.prob
log.prob         <- log.prob.buggy

## Make a buggy run
run.B.buggy <- run.chain.2pl(
    ## No burnin, keep 1000 iterations, do not thin
    M.burnin = 0, M.keep  = 1000, M.thin = 1,
    ## Use the fake data simulated in Post #2
    U.data  = U,
    ## Pass in the hyperpriors from Post #1
    hyperpars = hyperpars,
    ## Generate starting values uniformly from -5 to 5
    th.init = runif( length(theta.abl), 
                     min=-5, max=5     ),
    ## Set the other parameters to their true values 
    ## from Post #2
    a.init  = a.disc,
    b.init  = b.diff,
    s2.init = sig2.theta,
    MH.th=1, MH.a=NULL, MH.b=NULL, verbose=TRUE)

## Swap out the buggy sampler with the correct one
log.prob  <- log.prob.correct

## Convert to coda
run.B.buggy.mcmc <- mcmc( t(run.B.buggy) )

Now, the first nine person ability parameters still converge quickly to their mode:

run.B.buggy.mcmc.theta.1-9
plot( run.B.buggy.mcmc[, get.2pl.params(1:9,NULL,NULL,NULL)],
      density=FALSE,
      ylim=c(-5,5), smooth=TRUE )

However, those modes do not correspond to the true values:

run.B.buggy.mcmc.theta.check
## Calculate buggy EAP estimates
theta.eap.B.buggy <- apply( 
       run.B.buggy.mcmc[, get.2pl.params(1:P.persons, NULL, NULL, NULL)],
       MARGIN=2,
       mean )

## Visually check
check.sampler.graph(theta.abl, theta.eap.B.buggy, 'Person Ability',
                    'Buggy EAP estimates', 'purple')

Lesson learned from bug #1

Sometimes the sampler will converge well to a global mode, but that mode will be the wrong answer because there is an error in the way that the likelihood (or prior) is being calculated. Only a check with simulated data -- where the true values are known -- will be able to spot this bug.

Bug 2: Terrible sampler properties, but correct answers

In my experience, a common bug in writing a sampler is making a small error in how the Metropolis-Hastings algorithm is implemented. Here we introduce a bug in the calculation of the acceptance probability:

sample.th.buggy <- function(U.data, old) {
 th.old    <- old$th
 MH.th     <- old$MH$th
 P.persons <- length(th.old)
 th.star   <- rnorm(P.persons,th.old,MH.th)
 
 log.cc.star <- apply(
                   log.prob(U.data, th.star, 
                            old$a, old$b),1,sum) +
                   log(dnorm(th.star,0,sqrt(old$s2)))
 log.cc.old  <- apply(
                   log.prob(U.data, th.old, 
                            old$a, old$b),1,sum) +
                   log(dnorm(th.old ,0,sqrt(old$s2))) 
 log.prop.star <- log(dnorm(th.star,th.old,MH.th))
 log.prop.old  <- log(dnorm(th.old,th.star,MH.th))

 ## Bug: add log.prop.star, instead of subtract
 log.alpha.star <- pmin(log.cc.star + log.prop.old - log.cc.old +
                    log.prop.star,0) 

 acc.new <- ifelse(log(runif(P.persons))<=log.alpha.star, 1, 0)
 
 cur          <- old   
 cur$th       <- ifelse(acc.new==1, th.star, th.old)
 cur$ACC$th   <- old$ACC$th + mean( acc.new )
 cur$ACC$th.n <-  cur$ACC$th.n + 1
 
 return(cur) 
}

Since the bug involves the acceptance rate, we must adjust the variance of the Metropolis-Hastings proposal to get reasonable acceptance rates:

## Set the seed to keep results reproducible 
set.seed(314159)
 
## Swap in 
sample.th.correct <- sample.th
sample.th         <- sample.th.buggy 

## Make a buggy run
run.B.buggy.th <- run.chain.2pl(
    ## No burnin, keep 1000 iterations, do not thin
    M.burnin = 0, M.keep  = 1000, M.thin = 1,
    ## Use the fake data simulated in Post #2
    U.data  = U,
    ## Pass in the hyperpriors from Post #1
    hyperpars = hyperpars,
    ## Generate starting values uniformly from -5 to 5
    th.init = runif( length(theta.abl), 
                     min=-5, max=5     ),
    ## Set the other parameters to their true values 
    ## from Post #2
    a.init  = a.disc,
    b.init  = b.diff,
    s2.init = sig2.theta,

    ## Note that MH.th is 0.44 instead of 1
    MH.th=0.44,

    MH.a=NULL, MH.b=NULL, verbose=TRUE)
## Average acceptance rates: 
##  theta.abl: 0.443

## Swap out the buggy sampler with the correct one
sample.th <- sample.th.correct

## Convert to coda
run.B.buggy.th.mcmc <- mcmc( t(run.B.buggy.th) )

Again, the first nine person ability parameters still converge quickly to their modes:

run.B.mcmc.theta.buggy.th.1-9
plot( run.B.buggy.th.mcmc[, get.2pl.params(1:9,NULL,NULL,NULL)],
      density=FALSE, ylim=c(-5,5), smooth=TRUE )

And the chains even recover the true values:

run.B.buggy.th.mcmc.theta.check
## Calculate buggy EAP estimates
theta.eap.B.buggy.th <- apply( 
       run.B.buggy.th.mcmc[, get.2pl.params(1:P.persons, NULL, NULL, NULL)],
       MARGIN=2,
       mean )

## Visually check
check.sampler.graph(theta.abl, theta.eap.B.buggy.th, 'Person Ability',
                    'Buggy EAP estimates', 'purple')

So, if the chains converge quickly to their modes and recover their true values, in what sense is this a bug?

This bug is about efficiency. Note that the buggy chain has much higher auto-correlation than the correct chain, even though both acceptance rates are around 0.44.

run.B.mcmc.theta.buggy.th.acf
par(mfrow=c(1,2))
   acf( run.B.mcmc[, 'theta.abl 1'],
        main="ACF: correct sampler", col='blue')
   acf( run.B.buggy.th.mcmc[, 'theta.abl 1'],
        main="ACF: buggy sampler", col='purple')
par(mfrow=c(1,1))

Since the autocorrelation is so high in the buggy sampler, there is less information about the posterior distribution in the chain. One metric that is both simple to calculate and to interpret is the effective sample size. This is (roughly) the number of draws from an independent sampler that would have the same amount of information as the (in this case, 1,000) draws from the correlated sampler. In R, we can calculate an effective sample size for each theta_p and examine boxplots as follows:

run.B.mcmc.theta.buggy.th.viz.ess
## Effective sample size of the correct sampler
ess.run.B <- effectiveSize(
  run.B.mcmc[ , get.2pl.params(1:P.persons, NULL, NULL, NULL)]
) 

## Effective sample size of the bug #2 sampler
ess.run.B.buggy.th <- effectiveSize(
  run.B.buggy.th.mcmc[, get.2pl.params(1:P.persons, NULL, NULL, NULL)]
) 

## Compare the two distributions
boxplot( ess.run.B, ess.run.B.buggy.th,
         main="Effective Sample Size comparison",
         names=c("Correct sample.th", "Buggy sample.th"),
         col=c("lightblue","mediumpurple"))

Note that the 1,000 draws from the correct sampler is worth about 200 draws from a hypothetical independent sampler, while the 1,000 draws from the buggy sampler is worth only about 50. Therefore, this bug does affect our ability to draw inferences -- mainly because our standard error of estimation will be quite a bit worse.

Lesson learned from bug #2

Sometimes a sampler will appear to recover the true values, but the sampler will be very inefficient. This could be due to a mistake in the algorithm. Unfortunately, detecting these bugs is quite difficult.

In the next post, we give a partial solution to avoiding such bugs.

To leave a comment for the author, please follow the link and comment on their blog: Markov Chain Monte Carlo in Item Response Models.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)