Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m always intrigued by techniques that have cool names: Support Vector Machines, State Space Models, Spectral Clustering, and an old favorite Hidden Markov Models (HMM’s). While going through some of my notes, I stumbled onto a fun experiment with HMM’s where you feed a bunch of English text into a two-state HMM and it will (tend to) discover what letters are vowels. The first thing we’ll need is a bunch of English text. You may have some sitting around, but the classic dataset is the Brown Corpus, which you can download and use `readChar` to pull out a section 40-50,000 characters long. (The Brown Corpus’ `words` file is 5.7 million characters long.) Say we have this string in `corpus`. We can lowercase everything and replace everything outside of A-Z with an underscore, and load the `HMM` library:

```a <- unlist (strsplit (gsub ("[^a-z]", "_", tolower (corpus)), ""))
letter.labels=c(“_”, letters)
library (HMM)
library (lattice)```

There are other HMM packages that you might prefer, including `hmm.discnp` and `RHmm`, but I’ve chosen to use `HMM` for this entry. Now, let’s prepare an HMM with two hidden states and our alphabet. You can let the start, transition, and emissions probabilities default, but I’ve had much better luck if the probabilities are not all perfectly flat, so I randomize them a bit:

```prob <- function (x) {x / sum (x)}  # Makes it a probability (it sums to 1)
hmm <- initHMM (c("A", "B"), letter.labels, startProbs=(prob (runif (2))),
transProbs=apply (matrix (runif (4), 2), 1, prob),
emissionProbs=apply (matrix (runif (54), 2), 1, prob))```

Then, we use the Baum Welch algorithm to update the probabilities based on repeated iteration over the corpus:

`system.time (a.bw <- baumWelch (hmm, a))`

CAUTION: On my laptop, this took about 15 minutes. The default maximum number of iterations is 100, and you can try smaller numbers to see if you’re on the right track:

`system.time (a.bw <- baumWelch (hmm, a, 5))`

would iterate at most 5 times. The results won’t be nearly as good as going, say, 50 iterations but you should see one of the two classes having larger bumps at the vowels and space. The graph at the top of the article was generated by:

`xyplot (a.bw\$hmm\$emissionProbs[1,] ~ c(1:27), scales=list(x=list(at=1:27, labels=colnames (a.bw\$hmm\$emissionProbs))), type=”h”)`

and the other state is:

`xyplot (a.bw\$hmm\$emissionProbs[2,] ~ c(1:27), scales=list(x=list(at=1:27, labels=colnames (a.bw\$hmm\$emissionProbs))), type=”h”)`

You can’t tell in advance which of the two states will reflect the vowels (and space). You can add more states and see what you get. From the reference I found, below, if you’re smart and patient you can find sensible interpretations for up to 12 states. You can get better results with more text, and for more states you may well need it.

It turns out that HMM’s are related to the Kalman Filter and are more or less discrete versions of State Space Models.

The idea for this experiment originally came from the paper, “Hidden Markov models for English”, by R. L. Cave and L. P. Neuwirth, which was in Hidden Markov Models for Speech, October 1980. I haven’t found that paper, but Prof. Mark Stamp wrote a nice introduction to HMM’s, and a student of his, Tun Tao Tsai, wrote a thesis around 2005 which mentions more details of the Cave and Neuwirth paper.

Filed under: R, Rblog, Statistics  