[This article was first published on

** Statistics et al.**, and kindly contributed to

R-bloggers]. (You can report issue about the content on this page

here)

Want to share your content on R-bloggers?

click here if you have a blog, or

here if you don't.

Making good exam questions is universally hard. The ideal question should have a clear solution to those with the requisite understanding, but also difficult enough that someone without the knowledge needed can guess at an answer.

An item response theory (IRT) based analysis can estimate the difficulty of a question, as well as the general skill of each of the test takers. The generalized partial credit model extends classical IRT from questions with binary scores to ones with an ordinal set of possible scores.

R code and example inside.

Additional complicating factors include accommodations for correct, but alternate or unexpected solutions, and barriers not directly related to the understanding being measures, such as language limitations. This is just one interpretation of ‘good’ for an exam question, because there are other issues like rote memorization vs. integrative understanding that are harder still to measure or define.

On top of that, exam questions have a limited lifetime. To be able to reuse a question and maintain integrity between exams, information about the question cannot leave the exam room, which is technically impossible because people remember the exams they just did. Using new questions for each evaluation means bearing the risk of using untested questions, as well as bearing the workload of going through this difficult process every workload.

In a later post, I will be releasing a large set of the exam questions I have made and used in the past, as well as the answer key and some annotations about each problem, including measures of difficulty and discrimination power that can be found using item response theory (IRT). This post is a vignette on using IRT to get these measures from exams that have been graded using the Crowdmark learning management software, which retains the score awarded to each student for each question.

Item response theory (IRT) can be used to find, after the fact, not just which of your exam questions were difficult, but also which ones were more or less effective at finding differences in ability between different students. There are many R packages that use IRT, as found here

https://cran.r-project.org/web/views/Psychometrics.html. For this analysis, I opted to use the latent trait model (ltm) package because it included an implementation of the generalized partial credit model (gpcm), which is particularly apt for the large, open-ended questions that I prefer for exams.

The exam data from Crowdmark has been scrubbed of identifying information. The order of the students has also been randomized and the original row order removed. However, for safety, it is highly recommended you check your data this step manually to make sure no identifying information remains. Crowdmark exports the data of the students in alphabetical order, which is why reordering matters. If you plan to use multiple evaluations, such the final exam and each midterm, make sure to combine the data sets with cbind() before reordering the rows.

dat = dat[,-c(“Crowdmark.ID”,”Score.URL”,”Email”,”Canvas.ID”,”Email”,”Name”,”Student.ID”,”Total”)]

dat = dat[sample(1:nrow(dat)),]

row.names(dat) = “”

head(dat)

The standard IRT method like latent trait models (LTM) won’t suffice because it only works for binary responses, and these questions have partial credit. Furthermore, a Rasch model won’t be appropriate because it assumes that all the questions have the same discriminatory ability. Finding which questions are better at discriminating between students a key research question. More details are found here

https://www.rdocumentation.org/packages/ltm/versions/1.1-0/topics/gpcm
First, an inspection of the model without further cleaning, looking specifically at the first question of a final exam.

The scores for question 1 range from 0-5, and can take any score of a half-point increment. Nobody scored a 4.5/5. Some of these counts are small to the point of uselessness.

table(datQ$Q1)

0 0.5 1 1.5 2 2.5 3 3.5 4 5

6 1 12 5 32 14 32 12 22 3

library(ltm)

mod = gpcm(dat)

summary(mod)

Coefficients:

$Q1

value std.err z.value

Catgr.1 9.144 6.788 1.347

Catgr.2 -14.322 7.073 -2.025

Catgr.3 4.426 3.353 1.320

Catgr.4 -10.542 4.108 -2.566

Catgr.5 4.480 2.289 1.957

Catgr.6 -4.524 2.282 -1.982

Catgr.7 5.649 2.492 2.267

Catgr.8 -2.991 2.287 -1.308

Catgr.9 11.566 4.745 2.438

Dscrmn 0.180 0.056 3.232

If we reduce the number of categories to six by rounding up each half point, then only the endpoints have fewer than 10 observations.

dat = round(dat)

table(round(dat$Q1))

0 1 2 3 4 5

7 12 51 32 34 3

How does the model for question 1 differ after rounding the scores to the nearest whole?

mod = gpcm(dat)

summary(mod)$coefficients$Q1

value std.err z.value

Catgr.1 -2.25102527 1.4965783 -1.5041146

Catgr.2 -4.70417153 1.6356180 -2.8760821

Catgr.3 1.38695931 0.8272908 1.6765077

Catgr.4 0.07941232 0.7656370 0.1037206

Catgr.5 7.91144221 2.8570050 2.7691384

Dscrmn 0.32993218 0.1039574 3.1737254

The coefficients and their standard errors are smaller. Given that the small groups were removed, the smaller standard errors make sense.

That the coefficients is smaller is partly a reflection of the larger discrimination coefficient. To determine the log-odds of being in one categories as opposed to an adjacent one, the discrimination coefficient is multiplied by the appropriate category coefficient. The discrimination coefficient also determines how much each unit of ability increases or decreases those same log-odds. By default, ability scores range from -4 for the weakest student in the cohort, to 0 for the average student, to +4 for the strongest student.

For example,

The fitted probability of an **average **student (ability = 0) getting 5/5 on the question, assuming that they got at least 4/5 would be

1 – (exp(7.911 * 0.3300) / (1 + exp(7.911 * 0.3300))) = **0.0685**.

Compare this to the observed 3/37 = 0.0811 for the entire cohort.

Repeating this for getting a score of 4+ given that the student got 3+.

Fitted: 1 – [exp(0.079 * 0.3300) / (1 + exp(0.079 * 0.3300))) = **0.4935.**

Compare this to the observed 37/69 = 0.5362.

and for getting a score of 3+ given that the student got 2+.

Fitted: 1 – (exp(1.387 * 0.3300) / (1 + exp(1.387 * 0.3300))) = **0.3875.**

Compare this to the observed 69/120 = 0.5750.

and for getting a score of 2+ given that the student got 1+.

Fitted: 1 – (exp(-4.704 * 0.3300) / (1 + exp(-4.704 * 0.3300))) = 0.8252.

Compare this to the observed 120/132 = 0.9091.

and finally, for getting a score of 1+, unconditional

Fitted: 1 – (exp(-2.251 * 0.3300) / (1 + exp(-2.251 * 0.3300))) = 0.6776.

Compare this to the observed 132/139 = 0.9496.

These probabilities can be quickly found with…

library(faraway) # for the inverse logit function

coefs_cat = coef(mod2)$Q1[1:5]

coefs_disc = coef(mod2)$Q1[6]

ability = 0

prob_cond = 1 – ilogit((coefs_cat – ability) * coefs_disc)

prob_cond

For someone with ability score +4 instead of zero, subtract 4 from each category coefficent. So for a top student, the coefficients above would be 3.911, -3.921, -2.613, -8.704, and -6.651 respectively. Their counterpart probabilities would be 0.2157, 0.7848, 0.7031, 0.9465, and 0.8998 respectively.

ability = 4

prob_cond = 1 – ilogit((coefs_cat – ability) * coefs_disc)

prob_cond

We can use these conditional probabilities to get marginal probabilities of score 1+, 2+, …, and therefore an expected score for an average student, or any student with ability in the -4 to 4 range.

Ncat = length(coef(mod)$Q1) – 1

coefs_cat = coef(mod)$Q1[1:Ncat]

coefs_disc = coef(mod)$Q1[Ncat + 1]

ability = 0

top = exp(c(0, cumsum(coefs_disc * (ability – coefs_cat))))

bottom = sum(top)

prob = top / bottom

prob

E_score = sum( 0:Ncat * prob)

E_score

These calculations yield an expected score of 2.61/5 for an average student, 3.83/5 for the best student, and 1.04 for the worst student.

Compare the observed mean of 2.59 and range of 0 to 5 for n=131 students.

For convenience and scalability, we can put the expected score calculations in a function.

get.exp.score = function(coefs, ability)

{

Ncat = length(coefs) – 1

coefs_cat = coefs[1:Ncat]

coefs_disc = coefs[Ncat + 1]

top = exp(c(0, cumsum(coefs_disc * (ability – coefs_cat))))

bottom = sum(top)

prob = top / bottom

E_score = sum( 0:Ncat * prob)

return(E_score)

}

get.discrim = function(coefs)

{

Ncat = length(coefs) – 1

coefs_disc = coefs[Ncat + 1]

return(coefs_disc)

}

Now we repeat this process across every question. Note that the coef() function for the gpcm model is a list of vectors, not a matrix. Therefore to reference the kth element, we need to use [[k]].

library(ltm)

mod = gpcm(dat)

Nquest = length(coef(mod))

ability_level = c(-4,-2,0,2,4)

Nability = length(ability_level)

E_scores_mat = matrix(NA,nrow=Nquest,ncol=Nability)

discrim = rep(NA,Nquest)

scoremax = apply(dat,2,max)

for(k in 1:Nquest)

{

for(j in 1:Nability)

{

E_scores_mat[k,j] = get.exp.score(coef(mod)[[k]], ability_level[j])

discrim[k] = get.discrim(coef(mod)[[k]])

}

## Normalizing

E_scores_mat[k,] = E_scores_mat[k,] / scoremax[k]

}

E_scores_mat = round(E_scores_mat,4)

q_name = paste0(“FinalQ”,1:Nquest)

q_info = data.frame(q_name,E_scores_mat,scoremax,discrim)

names(q_info) = c(“Name”,paste0(“Escore_at”,ability_level),”max”,”discrim”)

q_info

Name Escore_at-4 Escore_at-2 Escore_at0 Escore_at2 Escore_at4 max discrim

Q1 FinalQ1 0.2036 0.3572 0.5218 0.6666 0.7697 5 0.330

Q2 FinalQ2 0.0795 0.2585 0.7546 0.9789 0.9973 6 0.819

Q3 FinalQ3 0.0657 0.2916 0.6110 0.8681 0.9626 7 0.742

Q4 FinalQ4 0.0198 0.2402 0.7568 0.9281 0.9791 9 0.508

Q5 FinalQ5 0.1748 0.4496 0.5614 0.5870 0.5949 5 0.446

Q6 FinalQ6 0.1072 0.5685 0.9516 0.9927 0.9982 4 0.521

Q7 FinalQ7 0.3042 0.5379 0.7413 0.8465 0.9018 7 0.201

Q8 FinalQ8 0.1206 0.3451 0.6526 0.8945 0.9744 7 0.628

Q9 FinalQ9 0.1686 0.3031 0.4875 0.5827 0.6114 8 0.232

Q10 FinalQ10 0.1079 0.4182 0.7577 0.8466 0.8663 8 0.408

Q11 FinalQ11 0.2066 0.4157 0.5480 0.6187 0.6653 8 0.275

References:

Original paper on the Generalized Partial Credit Model, by Eiji Muraki (1992)

*Related*