**Deeply Trivial**, and kindly contributed to R-bloggers)

Because I can’t share data from our item banks, I’ll generate a fake dataset to use in my demonstration. For the exams I’m using for my upcoming standard setting, I want to draw a large sample of items, stratified by both item difficulty (so that I have a range of items across the Rasch difficulties) and item domain (the topic from the exam outline that is assessed by that item). Let’s pretend I have an exam with 3 domains, and a bank of 600 items. I can generate that data like this:

domain1 <- data.frame(domain = 1, b = sort(rnorm(200)))

domain2 <- data.frame(domain = 2, b = sort(rnorm(200)))

domain3 <- data.frame(domain = 3, b = sort(rnorm(200)))

The variable domain is the domain label, and b is the item difficulty. I decided to sort that varible within each dataset so I can easily see that it goes across a range of difficulties, both positive and negative.

head(domain1)

## domain b

## 1 1 -2.599194

## 2 1 -2.130286

## 3 1 -2.041127

## 4 1 -1.990036

## 5 1 -1.811251

## 6 1 -1.745899

tail(domain1)

## domain b

## 195 1 1.934733

## 196 1 1.953235

## 197 1 2.108284

## 198 1 2.357364

## 199 1 2.384353

## 200 1 2.699168

If I desire, I can easily combine these 3 datasets into 1:

item_difficulties <- rbind(domain1, domain2, domain3)

I can also easily visualize my item difficulties, by domain, as a group of histograms using ggplot2:

library(tidyverse)

item_difficulties %>%

ggplot(aes(b)) +

geom_histogram(show.legend = FALSE) +

labs(x = "Item Difficulty", y = "Number of Items") +

facet_wrap(~domain, ncol = 1, scales = "free") +

theme_classic()

Now, let’s say I want to draw 100 items from my item bank, and I want them to be stratified by difficulty and by domain. I’d like my sample to range across the potential item difficulties fairly equally, but I want my sample of items to be weighted by the percentages from the exam outline. That is, let’s say I have an outline that says for each exam: 24% of items should come from domain 1, 48% from domain 2, and 28% from domain 3. So I want to draw 24 from domain1, 48 from domain2, and 28 from domain3. Drawing such a random sample is pretty easy, but I also want to make sure I get items that are very easy, very hard, and all the levels in between.

I’ll be honest: I had trouble figuring out the best way to do this with a continuous variable. Instead, I decided to classify items by quartile, then drew an equal number of items from each quartile.

To categorize by quartile, I used the following code:

domain1 <- within(domain1, quartile <- as.integer(cut(b, quantile(b, probs = 0:4/4), include.lowest = TRUE)))

The code uses the quantile command, which you may remember from my post on quantile regression. The nice thing about using quantiles is that I can define that however I wish. So I didn’t have to divide my items into quartiles (groups of 4); I could have divided them up into more or fewer groups as I saw fit. To aid in drawing samples across domains of varying percentages, I’d probably want to pick a quantile that is a common multiple of the domain percentages. In this case, I purposefully designed the outline so that 4 was a common multiple.

To draw my sample, I’ll use the sampling library (which you’ll want to install with install.packages(“sampling”) if you’ve never done so before), and the strata function.

library(sampling)

domain1_samp <- strata(domain1, "quartile", size = rep(6, 4), method = "srswor")

The resulting data frame has 4 variables – the quartile value (since that was used for stratification), the ID_unit (row number from the original dataset), probability of being selected (in this case equal, since I requested equally-sized strata), and stratum number. So I would want to merge my item difficulties into this dataset, as well as any identifiers I have so that I can pull the correct items. (For the time being, we’ll just pretend row number is the identifier, though this is likely not the case for large item banks.)

domain1$ID_unit <- as.numeric(row.names(domain1))

domain1_samp <- domain1_samp %>%

left_join(domain1, by = "ID_unit")

qplot(domain1_samp$b)

For my upcoming study, my sampling technique is a bit more nuanced, but this gives a nice starting point and introduction to what I’m doing.

**leave a comment**for the author, please follow the link and comment on their blog:

**Deeply Trivial**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...