# A Sequence Clustering Model in R

February 22, 2012
By

(This article was first published on anrprogrammer » R, and kindly contributed to R-bloggers)

I’ve just released my first R package!
Over the past 1.5 years or so, I’ve been studying an obscure statistical model over ranking (full, or partial) data called Mallows’ model.  It hypothesizes that a set of sequence data has a “modal” sequence about which the data cluster, and that the data fall away from that modal sequence with probability $\exp{-\lambda}$, where $\lambda$ is derived from the data.

There is a closed form solution to determining the $\lambda$ parameter and the modal sequence, as outlined in my third reference in the package manual . However, we often have data that are heterogeneous, and it makes more sense to hypothesize multiple modal sequences. That is where this package comes in.

The RMallow package can take any set of sequence data of reasonable size (currently I have gotten it to process data sets with 2000 + rows and 20 + attributes in under a minute), and fit a Multi-Modal Mallows’ Model to the data using an EM algorithm as described in my first reference in the package summary.

Certain conditions have to be met for the model to fit very well. For example, if there happen to be multiple modal sequences, but they are very close together (as determined by a distance metric called Kendall’s metric, which is equivalent to Bubble Sort in a way), Mallows’ model may simply group them all into the same “cluster.” Here is an example where I fit the model to some crude synthetic data, and discover two constructed modal sequences. The sequences are no more than distance 20 apart (the maximum possible distance in this space is 20*(20-1)/2 = 190), and it still groups the data very well!

The idea of the construction of the data in the following is that we create two modal sequences for the code to find. I then construct a good number of identical sequences to the modal ones, and randomly transpose between 1 and 10 of the elements of each (uniformly, which I stress is NOT how Mallows’ model works…). I then add a good deal of noise in partial rankings (rankings where there are no full orderings between certain groups). The final call is the code to fit the model.

# Some synthetic data to represent the capability of RMallow
library(RMallow)
modal <- 1:20
second.modal <- c(1, 2, 3, 4, 5, 17, 7, 9, 8, 11, 10,
12, 13, 14, 15, 16, 6, 18, 19, 20)
partial.noise <- list()
modal.noise <- list()
second.noise <- list()
for (i in 1:300) {
# Partial ranking noise
partial.noise[[i]] <- sample(20, replace = TRUE)
# Modal sequence 1
modal.noise[[i]] <- modal
# Modal sequence 2
second.noise[[i]] <- second.modal
# Choose to switch between 1 and 10 random indicies
swit <- sample(20, sample(10, 1))
swit.2 <- sample(20, sample(10, 1))
modal.noise[[i]][swit] <- modal.noise[[i]][rev(swit)]
second.noise[[i]][swit.2] <- second.modal[rev(swit.2)]
}
partial.noise <- do.call("rbind", partial.noise)
partial.noise <- SimplifySequences(partial.noise)
modal.noise <- do.call("rbind", modal.noise)
second.noise <- do.call("rbind", second.noise)
datas <- do.call("rbind", list(modal.noise, partial.noise, second.noise))
datas <- datas[sample(nrow(datas)), ]
test <- Mallows(datas, iter = 100, G = 2)


Now for the moment of truth, to determine whether the code found the sequences. The output is a list, where the first element is the modal sequences found, second is the proportion of data assigned to each cluster, third is the cluster membership probabilities for each row, and fourth a data frame with the data, probabilities of cluster memberships, cluster assignments, and distance from each modal sequence.

> test[[1]]
[[1]]
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

[[2]]
[1]  1  2  3  4  5 17  7  9  8 11 10 12 13 14 15 16  6 18 19 20


GREAT SUCCESS! I purposely did not set a seed here because when I ran the code above, it actually did not produce what I was hoping for. I DID get that output when I ran a function I will put into the next release, which runs the algorithm several times and selects the best fit, based on the log-likelihood of the data in the model. EM Algorithms are prone to local maxima, so this is a necessary fix. You may very well get the output above when you run that code on your machine, depending on the state of your system. If not, try running it 5 times or so and it will most likely work out!

So there we have it. I have demonstrated that this package has some value in mining patterns from sequence data with noise. I will be working on implementing some solutions to increase the probability of convergence, and trying to extend the model to larger sequence spaces soon. Also, I will be providing a Goodness Of Fit Assessment of the model, when our data satisfies certain conditions (sample size, …). Finally, I hope to soon be able to demonstrate the value of this model in real-world data in my next post.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...