[This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been talking to Michael Betancourt and Charles Margossian about implementing analytic derivatives for HMMs in Stan to reduce memory overhead and increase speed. For now, one has to implement the forward algorithm in the Stan program and let Stan autodiff through it. I worked out the adjoint method (aka reverse-mode autodiff) derivatives of the HMM likelihood (basically, reverse-mode autodiffing the forward algorithm), but it was stepwise and the connection to forward-backward wasn’t immediately obvious. So I thought maybe someone had already put a bow on this in the literature.

It was a challenging Google search, but I was rewarded with one of the best papers I’ve read in ages and by far the best thing I’ve ever read on hidden Markov models (HMM) and their application:

The paper provides elegant one-liners for the forward algorithm, the backward algorithm, the likelihood, and the derivative of the likelihood with respect to model parameters. For example, here’s the formula for the likelihood:

$L = \pi^{\top} \cdot \textrm{diag}(B_1) \cdot A \cdot \textrm{diag}(B_2) \cdot \cdots A \cdot \textrm{diag}(B_T) \cdot 1.$

where $\pi$ is the initial state distributions, $B_t$ is the vector of emission densities for the states, $A$ is the stochastic transition matrix, and $1$ is a vector of 1s. Qin et al.’s software uses an external package to differentiate the solution for $\pi$ as the stationary distribution for the transition matrix $A,$, i.e., $\pi^{\top} \cdot A = \pi^{\top}.$

The forward and backward algoritms are stated just as neatly, as are the derivatives of the likelihood w.r.t parameters. The authors put the likelihood and derivatives together to construct a quasi-Newton optimizer to fit max likelihood estimates of HMM. They even use second derivatives for estimating standard errors. For Stan, we just need the derivatives to plug into our existing quasi-Newton solvers and Hamiltonian Monte Carlo.

But that’s not all. The paper’s about an application of HMMs to single-channel kinetics in chemistry, a topic about which I know nothing. The paper starts with a very nice overview of HMMs and why they’re being chosen for this chemistry problem. The paper ends with a wonderfully in-depth discussion of the statistical and computational properties of the model. Among the highlights is the joint modeling of multiple experimental data sets with varying measurement error.

In conclusion, if you want to understand HMMs and are comfortable with matrix derivatives, read this paper. Somehow the applied math crowd gets these algorithms down correctly and cleanly where the stats and computer science literatures flail in comparison.

Of course, for stability in avoiding underflow of the densities, we’ll need to work on the log scale. Or if we want to keep the matrix formulation, we can use the incremental rescaling trick to rescale the columns of the forward algorithm and accmulate our own exponent to avoid underflow. We’ll also have to autodiff through the solution to the stationary distirbution algorithm, but Stan’s internals make that particular piece of plumbing easy to fit and also a potential point of dervative optimization. We also want to generalize to the case where the transition matrix $A$ depends on predictors at each time step through a multi-logit regression. With that, we’d be able to fit anything that can be fit with the nicely designed and documented R package moveHmm, which can already be used in R to fit a range of maximum likelihood estimates for HMMs.