Transition probabilities when adjacent sequence items must be different

September 24, 2012

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

Generating a random sequence from a fixed set of items is a common requirement, e.g., given the items A, B and C we might generate the sequence BACABCCBABC. Often the randomness is tempered by requirements such as each item having each item appear a given number of times in a sequence of a given length, e.g., in a random sequence of 100 items A appears 20 times, B 40 times and C 40 times. If there are rules about what pairs of items may appear in the sequence (e.g., no identical items adjacent to each other), then sequence generation starts to get a bit complicated.

Let’s say we want our sequence to contain: A 6 times, B 12 times and C 12 times, and no same item pairs to appear (i.e., no AA, BB or CC). The obvious solution is to use a transition matrix containing the probability of generating the next item to be added to the end of the sequence based on knowing the item currently at the end of the sequence.

My thinking goes as follows:

  • given A was last generated there is an equal probability of it being followed by B or C,
  • given B was last generated there is a 6/(6+12) probability of it being followed by A and a 12/(6+12) probability of it being followed by C,
  • given C was last generated there is a 6/(6+12) probability of it being followed by A and a 12/(6+12) probability of it being followed by B.

giving the following transition matrix (this row by row approach having the obvious generalization to more items):

                 Second item
               A      B       C
          A    0     .5      .5
item      B   .33     0      .67
          C   .33    .67      0

Having read Generating constrained randomized sequences: Item frequency matters by Robert M. French and Pierre Perruchet (from whom I take these examples and algorithm on which the R code is based), I now know this algorithm for generating transition matrices is wrong. Before reading any further you might like to try and figure out why.

The key insight is that the number of XY pairs (reading the sequence left to right) must equal the number of YX pairs (reading right to left) where X and Y are different items from the fixed set (and sequence edge effects are ignored).

If we take the above matrix and multiply it by the number of each item we get the following (if A occurs 6 times it will be followed by B 3 times and C 3 times, if B occurs 12 times it will be followed by A 4 times and C 8 times, etc):

                 Second item
               A      B       C
          A    0      3       3
item      B    4      0       8
          C    4      8       0

which implies the sequence will contain AB 3 times when counted forward and BA 4 times when counted backwards (and similarly for AC/CA). This cannot happen, the matrix is not internally consistent.

The correct numbers are:

                 Second item
               A      B       C
          A    0      3       3
item      B    3      0       9
          C    3      9       0

giving the probability transition matrix:

                 Second item
               A      B       C
          A    0     .5      .5
item      B   .25     0      .75
          C   .25    .75      0

This kind of sequence generation occurs in testing and I wonder how many people have made the same mistake as me and scratched their heads over small deviations from the expected results.

The R code to calculate the transition matrix is straight forward but obscure unless you have the article to hand:

# Calculate expression (3) from:
# Generating constrained randomized sequences: Item frequency matters
# Robert M. French and Pierre Perruchet
# expected number of transitions
ni_nj=(item_count %*% t(item_count))/(N_total-1)
diag(ni_nj) = 0
# expected number of repeats
# Now juggle stuff around to put the repeats someplace else
s_k=n - 2*n_k
R_i=d_k / s_k
new_ij=ni_nj*(1-R) + (n_k %*% t(R_i)) +  (R_i %*% t(n_k))
tp=tc / rowSums(tc)  # relies on recycling

the following calls:

transition_count(c(6, 12, 12))
transition_prob(c(6, 12, 12))

return the expected results.

French and Perruchet provide an Excel spreadsheet (note this contains a bug, the formula in cell F20 should start with F5 rather than F6).

To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

plotly webpage

dominolab webpage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)