Balancing on multiple factors when the sample is too small to stratify

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Ideally, a study that uses randomization provides a balance of characteristics that might be associated with the outcome being studied. This way, we can be more confident that any differences in outcomes between the groups are due to the group assignments and not to differences in characteristics. Unfortunately, randomization does not guarantee balance, especially with smaller sample sizes. If we want to be certain that groups are balanced with respect to a particular characteristic, we need to do something like stratified randomization.

When the sample size is small and we want to guarantee balance across multiple characteristics, the task is a bit more challenging. Say we have 20 schools that we are randomizing to two groups, 10 in each, and want to make sure the groups are balanced with respect to 4 characteristics: language, poverty, location, and size. Simple stratification may not work so well. If we assume that these four characteristics are binary (e.g. either “yes” or “no”), there are 16 possible combinations. One or more of these combinations could easily be represented by a single school – so it would be impossible to randomize within each of the 16 combinations. What to do?

One possible approach is to generate all possible randomization schemes of the 20 schools, and keep only those schemes that are balanced with respect to the four characteristics. Once we have a list of acceptable randomization schemes, we can just pick one of those at random. (Of course, it is preferable if each school has close to a 50% chance of being assigned to either intervention group.)

Simulate school-level data

To start, we generate data for our 20 hypothetical schools using simstudy functions:

library(simstudy)
set.seed(125)

# define data characteristics for schools
ddef <- defData(varname = "language", formula = .3, dist = "binary")
ddef <- defData(ddef, "poverty", formula = .2, dist = "binary")
ddef <- defData(ddef, "location", formula = .5, dist = "binary")
ddef <- defData(ddef, "size", formula = .5, dist = "binary")
ddef
##     varname formula variance   dist     link
## 1: language     0.3        0 binary identity
## 2:  poverty     0.2        0 binary identity
## 3: location     0.5        0 binary identity
## 4:     size     0.5        0 binary identity
# generate schools
dt <- genData(20, ddef)

# number of schools in each combination
dt[, .N, keyby = .(language,poverty,location,size)]
##    language poverty location size N
## 1:        0       0        0    1 5
## 2:        0       0        1    0 1
## 3:        0       0        1    1 5
## 4:        0       1        0    0 1
## 5:        0       1        1    0 2
## 6:        1       0        0    0 2
## 7:        1       0        0    1 1
## 8:        1       0        1    0 1
## 9:        1       0        1    1 2

In this case, we have nine different combinations of the four characteristics, four of which include only a single school (rows 2, 4, 7, and 8). Stratification wouldn’t work necessarily work here if our goal was balance across all four characteristics.

Create randomization scenarios to assess for balance

Ideally, we would generate all possible randomization combinations and check them all for balance. If the number of total units (e.g. schools) is small, this does not pose a challenge (e.g. if N=4, then we only have six possible randomization schemes: TTCC, TCTC, TCCT, CTTC, CTCT, CCTT). However, with N=20, then there are 184,756 possible randomization schemes. Depending on the efficiency of the algorithm, it may be impractical to evaluate all the schemes. So, an alternative is to sample a subset of the schemes and evaluate those. For illustration purposes (so that you can understand what I am doing), I am using some very inefficient R code (using a loops). As a result, I cannot evaluate all possible schemes in a reasonable period of time to get this post out; I decided to sample instead to evaluate 1000 possible randomizations. (At the end of this post, I show results using much more efficient code that uses data.table and Rcpp code much more effectively - so that we can quickly evaluate millions of randomization schemes.)

To start, I create all combinations of randomization schemes:

totalSchools = 20
rxSchools = 10

xRx <- t(combn(totalSchools, rxSchools)) 

# show 5 randomly sampled combinations

sampleRows <- sample(nrow(xRx), 5, replace = FALSE)
xRx[sampleRows,]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    3    5    6    7    8   10   12   14    19
## [2,]    5    6    7    8   13   14   15   16   17    18
## [3,]    1    3    4    5    7    9   12   15   17    20
## [4,]    2    3    4    5    9   11   14   15   19    20
## [5,]    3    5    6    7    8   10   11   12   15    16

Below is a function (which I chose to do in Rcpp) that converts the xRx matrix of school ids to a 20-column matrix of 1’s and 0’s indicating whether or not a school is randomized to the intervention in a particular scenario:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]

NumericMatrix convert01(NumericMatrix xmat, int tcols) {
  
  int xrows = xmat.nrow();
  int xcols = xmat.ncol();
  
  NumericMatrix pmat(xrows, tcols);
  
  for (int i=0; i < xrows; i++) {
    for (int j=0; j < xcols; j++)  {
      pmat(i, xmat(i,j) - 1) = 1; 
    }
  } 
  return(pmat);
}
x01 <- convert01(xRx, totalSchools)

# show some rows

x01[sampleRows,]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]    0    1    1    0    1    1    1    1    0     1     0     1
## [2,]    0    0    0    0    1    1    1    1    0     0     0     0
## [3,]    1    0    1    1    1    0    1    0    1     0     0     1
## [4,]    0    1    1    1    1    0    0    0    1     0     1     0
## [5,]    0    0    1    0    1    1    1    1    0     1     1     1
##      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]     0     1     0     0     0     0     1     0
## [2,]     1     1     1     1     1     1     0     0
## [3,]     0     0     1     0     1     0     0     1
## [4,]     0     1     1     0     0     0     1     1
## [5,]     0     0     1     1     0     0     0     0

Because the evaluation code is so inefficient, I draw 1,000 rows at random from this “intervention” matrix x01 (after converting it to a data.table).

# convert matrix to data.table
d01 <- data.table(x01)
d01[, id := .I]

ids <- sample(nrow(d01), 1000, replace = FALSE)
sampleD01 <- d01[id %in% ids]

Now we are ready to evaluate each of the 1,000 schemes. As I mentioned before, this approach is highly inefficient as the algorithm requires us to literally loop through each each combination to find the balanced ones. I have sacrificed efficiency and speed for clarity of code (I hope).

for (i in 1:1000) {
  
  dt[, grp:= t(sampleD01[i,1:20])]
  
  dx <- dt[ , .N, keyby = .(language, grp)]
  dc <- dcast(dx, language ~ grp, fill = 0, value.var = "N" )
  dc[, diff := abs(`1` - `0`)]
  
  # we declare a scheme balanced if counts differ by
  # no more than 1 school
  
  sampleD01[i, language := (sum(dc[, diff > 1]) == 0)]
  
  dx <- dt[ , .N, keyby = .(poverty, grp)]
  dc <- dcast(dx, poverty ~ grp, fill = 0, value.var = "N" )
  dc[, diff := abs(`1` - `0`)]
  
  sampleD01[i, poverty := (sum(dc[, diff > 1]) == 0)]
  
  dx <- dt[ , .N, keyby = .(location, grp)]
  dc <- dcast(dx, location ~ grp, fill = 0, value.var = "N" )
  dc[, diff := abs(`1` - `0`)]
  
  sampleD01[i, location := (sum(dc[, diff > 1]) == 0)]
  
  dx <- dt[ , .N, keyby = .(size, grp)]
  dc <- dcast(dx, size ~ grp, fill = 0, value.var = "N" )
  dc[, diff := abs(`1` - `0`)]
  
  sampleD01[i, size := (sum(dc[, diff > 1]) == 0)]
  
}

The final determination of balance is made if a scheme is balanced across all four characteristics. In this case, 136 of the 1,000 schemes were balanced based on this criterion:

sampleD01[, balanced := all(language, poverty, location, size), keyby = id]

# proportion of sampled combinations that are balanced ...

sampleD01[,mean(balanced)]
## [1] 0.136

And let’s inspect the actual balance of two randomly selected schemes - one which is balanced, and one which is not:

sTrue <- sampleD01[balanced == TRUE]
sFalse <- sampleD01[balanced == FALSE]

A balanced scheme

dtAssigned <- copy(dt)
dtAssigned[, group := as.vector(t(sTrue[sample(.N, 1), 1:20]))]

dtAssigned[, .N, keyby=.(language, group)]
##    language group N
## 1:        0     0 7
## 2:        0     1 7
## 3:        1     0 3
## 4:        1     1 3
dtAssigned[, .N, keyby=.(poverty, group)]
##    poverty group N
## 1:       0     0 9
## 2:       0     1 8
## 3:       1     0 1
## 4:       1     1 2
dtAssigned[, .N, keyby=.(location, group)]
##    location group N
## 1:        0     0 4
## 2:        0     1 5
## 3:        1     0 6
## 4:        1     1 5
dtAssigned[, .N, keyby=.(size, group)]
##    size group N
## 1:    0     0 3
## 2:    0     1 4
## 3:    1     0 7
## 4:    1     1 6

An unbalanced scheme

In this case, language and location are imbalanced, though size and poverty are fine.

dtAssigned <- copy(dt)
dtAssigned[, group := as.vector(t(sFalse[sample(.N, 1), 1:20]))]

dtAssigned[, .N, keyby=.(language, group)]
##    language group N
## 1:        0     0 8
## 2:        0     1 6
## 3:        1     0 2
## 4:        1     1 4
dtAssigned[, .N, keyby=.(poverty, group)]
##    poverty group N
## 1:       0     0 8
## 2:       0     1 9
## 3:       1     0 2
## 4:       1     1 1
dtAssigned[, .N, keyby=.(location, group)]
##    location group N
## 1:        0     0 3
## 2:        0     1 6
## 3:        1     0 7
## 4:        1     1 4
dtAssigned[, .N, keyby=.(size, group)]
##    size group N
## 1:    0     0 4
## 2:    0     1 3
## 3:    1     0 6
## 4:    1     1 7

Fast implementation with data.table and Rcpp

As I alluded to before, if we want to implement this in the real world, it would be preferable to use code that does not bog down when we want to search 100,000+ possible randomization schemes. I have written a set of R and Rcpp functions the facilitate this. (Code is available here.)

# generate all possible schemes

xperm <- xPerms(totalSchools, rxSchools, N=NULL) 

nrow(xperm)
## [1] 184756
xperm[sample(nrow(xperm), 5, replace = FALSE)]
##    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
## 1:  0  1  1  1  1  0  1  1  0   0   0   1   0   1   0   0   0   1   0
## 2:  1  1  0  1  1  1  1  1  0   1   0   0   0   0   1   0   1   0   0
## 3:  1  0  1  0  0  1  1  1  1   1   1   0   1   0   0   1   0   0   0
## 4:  1  1  1  0  0  1  0  1  0   1   1   1   1   0   0   1   0   0   0
## 5:  1  1  0  0  1  0  0  1  1   1   1   0   1   0   0   0   1   1   0
##    V20    id
## 1:   1 94784
## 2:   0 19535
## 3:   0 61644
## 4:   0 14633
## 5:   0 35651
# prepare data for evaluation

dtMat <- as.matrix(dt[,-1])
cc <- parse(text=attr(xperm, "varlist"))
cc
## expression(c(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, 
##     V13, V14, V15, V16, V17, V18, V19, V20))
# evaluate each combination

sF <-  xperm[, cppChk(eval(cc), dtMat), keyby = id]
sF[sample(nrow(sF), 5, replace = FALSE)]
##        id    V1
## 1:  15924 FALSE
## 2:  68284 FALSE
## 3: 149360 FALSE
## 4:  62924 FALSE
## 5:  14009  TRUE
# keep only the balanced schemes

sFinal <- xperm[sF$V1]
nrow(sFinal)
## [1] 7742
# randomize from the balanced schemes

selectRow <- sample(nrow(sFinal), 1)

# check balance of randomized scheme

dtAssigned <- copy(dt)
dtAssigned[, group := as.vector(t(sFinal[selectRow, -"id"]))]

dtAssigned[, .N, keyby=.(language, group)]
##    language group N
## 1:        0     0 7
## 2:        0     1 7
## 3:        1     0 3
## 4:        1     1 3
dtAssigned[, .N, keyby=.(poverty, group)]
##    poverty group N
## 1:       0     0 9
## 2:       0     1 8
## 3:       1     0 1
## 4:       1     1 2
dtAssigned[, .N, keyby=.(location, group)]
##    location group N
## 1:        0     0 5
## 2:        0     1 4
## 3:        1     0 5
## 4:        1     1 6
dtAssigned[, .N, keyby=.(size, group)]
##    size group N
## 1:    0     0 3
## 2:    0     1 4
## 3:    1     0 7
## 4:    1     1 6

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)