Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A question on stats.stackexchange.com reminded me of some code I wrote earlier this summer. The code provides a correspondence between the natural numbers 1 to (N choose K) and all the unique K sized combinations one could draw from N items. This relationship is know as the combinadic of an integer (and my code is pased on the reference implementation). Generating combinations is useful for permutation tests, in which one applies a test statistic on all possible allocations of treatment to an experimental pool.

Since the number of possible combinations grows extremely rapidly, realizing all possible combinations at once can be extremely memory intentsive. Using combinadics, one can trade increased execution time for lower memory usage. Since they are indexed by integers, keeping track of which combination is currently used is trivial. But since the number of combinations grows very quickly, we still need to handle extremely large integers, perhaps larger than the default integer type in R accepts. Luckily, the GMP package provides for “big ints,” with which we can write a “N Choose K” algorithm for arbitrarily large numbers:

 library(gmp) bigchoose <- function(n, k) { if (n < 1 || k < 1 || k > n) { return(as.bigz(0)) } if (n == k) { return(as.bigz(1)) } if (k > (n/2)) { k <- n - k } numer <- as.bigz(1) for (i in n:(n - k + 1)) { numer <- numer * i } denom <- as.bigz(1) for (i in 1:k) { denom <- denom * i } return(numer/denom) } 

Here are two functions to turn an integer into a vector representing a combination (a process I call decoding) and to turn a combination into an integer (encoding). As I expect that these operations will be frequent for a given N and K, these functions produce functions that take integers and combinations, respectively.

 combinadic.decoder.factory <- function(n, k) { # n, k are fixed at the start # i is the combinadic index (0 < i < n) # precompute a few sequences we'll need ks <- k:1 # the bottom of the choose tests max.combinadic <- bigchoose(n, k) function(i) { i <- as.bigz(i) if (i < 1 || >.bigz(i, max.combinadic)) { stop(paste("Combinatic out of range for", n, "choose", k )) } # part of the frequent translation to R's 1... sequences i <- i - 1 # initialize a vector to hold the values remaining <- i previous.candidate <- n combination <- numeric(k) for(j in ks) { value <- remaining + 1 while (>.bigz(value, remaining)) { current.candidate <- previous.candidate - 1 value <- bigchoose(current.candidate, j) previous.candidate <- current.candidate } remaining <- sub.bigz(remaining, value) combination[j] <- current.candidate } return(combination + 1) # translate to 1... counting } } combinadic.encoder.factory <- function(n, k) { ks <- 1:k function(encoded) { stopifnot(length(encoded) == k) encoded <- encoded - 1 # translate from 1... counting expanded <- as.bigz(0) for (i in ks) { expanded <- +.bigz(expanded, bigchoose(encoded[i], i)) } return(expanded + 1) } } 

Finally, as an illustration, the classic Lady Tasting Tea problem. There are 8 cups, 4 of which have the milk added first (for concreteness, say these are the first 4 cups). What is the distribution of correctly labeling cups as having milk added first? To answer the question, we need a test statistic to indicate how many cups were correctly labeled.

 > test.statistic <- function(cups) { + sum(cups %in% c(1, 2, 3, 4)) + } 

Apply this test statistic to each possible allocation of cups, which corresponds to a combination of size 4 taken from 8 possible units.

 > maxn <- bigchoose(8, 4) > decoder <- combinadic.decoder.factory(8, 4) > counts <- numeric(5) > names(counts) <- 0:4 > i <- 1 > while (i < (maxn + 1)) { + tmp <- test.statistic(decoder(i)) + counts[tmp + 1] = counts[tmp + 1] + 1 + i <- i + 1 + } > barplot(counts/as.numeric(maxn)) 

(NB: I’ve encountered strange results when creating lists/vectors of “big integers,” the result of encoding a combination. Use for and while loops instead.) The plot shows that while a guess by chance of one, two, or three cups are not unlikely, a random guess that results in zero or four correct cups would be extremely rare. While astute readers will notice that this result can be found analytically, it still serves as a simple demonstration of a permutation test. The test statistic employed in this example leads to an analytical result. Other test statistics may not be as simple to solve. Permutation tests using all possible combinations always generates an exact distribution for any test statistic.

This code was originally written for inclusion in RItools. Subsequently, I decided that only rare cases does one need to generate the entire null distribution of the test statistic, and in most cases sampling from possible combinations is sufficient. Future posts will address approximate approaches to permutation tests.

Code for this post.