I have recently restarted working with Stan and unfortunately ran into the problem that my (hierarchical) Bayesian models often produced divergent transitions. And when this happens, the warning basically only suggests to increase
1: There were X divergent transitions after warmup. Increasing adapt_delta above 0.8 may help.
2: Examine the pairs() plot to diagnose sampling problems
adapt_delta often does not help, even if one uses values such as .99. Also, I never found
pairs() especially illuminating. This is the first of two blog posts dealing with this issue. In this (the first) post I will show which Stan settings need to be changed to remove the divergent transitions (to foreshadow, these are
max_treedepth). In the next blog post I will show how reparameterizations of the model following Stan recommendations can remove divergent transitions often without the necessity to extensively fiddle with the sampler settings while at the same time dramatically improve the fitting speed.
My model had some similarities to the multinomial processing tree (MPT) example in the Lee and Wagenmakers cognitive modeling book. As I am a big fan of both, MPTs and the book, I investigated the issue of divergent transitions using this example. Luckily, a first implementation of all the examples of Lee and Wagenmakers in Stan has been provided by Martin Šmíra (who is now working on his PhD in Birmingham) and is part of the Stan example models. I submitted a pull request with the changes to the model discussed here so they are now also part of the example models (and contains a README file discussing those changes).
The example uses the pair-clustering model also discussed in the paper introducing MPTs formally . The model has three parameters,
c for cluster-storage,
r for cluster-retrieval, and
u for unique storage-retrieval. For the hierarchical structure the model employs the latent trait approach of : The group level (i.e., hyper-) parameters are estimated separately on the unconstrained space from -infinity to +infinity. Individual level parameters are added to the group means as displacements estimated from a multivariate normal with mean zero and freely estimated variance/covariance matrix. Only then is the unconstrained space mapped onto the unit range (i.e., 0 to 1), which represents the parameter space, via the probit transformation. This allows to freely estimate the correlation among the individual parameters on the unconstrained space and at the same time constrains the parameters after transformation onto the allowed range.
The original implementation employed two features that are particularly useful for models estimated via Gibbs sampling (as implemented in Jags), but not so much for the NUTS sampler implemented in Stan: (a) A scaled inverse Wishart as prior for the covariance matrix due to its computational convenience (following ) and (b) parameter expansion to move the scale parameters of the variance-covariance matrix away from zero and ensure reasonable priors.
The original implementation of the model in Stan is simply a literal translation of the Jags code given in Lee and Wagenmakers. Consequently, it retains the Gibbs specific features. When fitting this model it seems to produce stable estimates, but Stan reports several divergent transitions after warm up. Given that the estimates seem stable and the results basically replicate what is reported in Lee and Wagenmakers (Figures 14.5 and 14.6) one may wonder why not too trust these results. I can give no full explanation, so let me copy the relevant part from the
shinystan help. Important is the last section, it clearly says not to use the results if there are any divergent transitions.
Quick definition The number of leapfrog transitions with diverging error. Because NUTS terminates at the first divergence this will be either 0 or 1 for each iteration. The average value of
n_divergentover all iterations is therefore the proportion of iterations with diverging error.
Stan uses a symplectic integrator to approximate the exact solution of the Hamiltonian dynamics and when
stepsizeis too large relative to the curvature of the log posterior this approximation can diverge and threaten the validity of the sampler.
n_divergentcounts the number of iterations within a given sample that have diverged and any non-zero value suggests that the samples may be biased in which case the step size needs to be decreased. Note that, because sampling is immediately terminated once a divergence is encountered,
n_divergentshould be only 0 or 1.
If there are any post-warmup iterations for which
n_divergent = 1then the results may be biased and should not be used. You should try rerunning the model with a higher target acceptance probability (which will decrease the step size) until
n_divergent = 0for all post-warmup iterations.
My first step trying to get rid of the divergent transitions was to increase
adapt_delta as suggested by the warning. But as said initially, this did not help in this case even when using quite high values such as
.999. Fortunately, the quote above tells that divergent transitions are related to the stepsize with which the sampler traverses the posterior.
stepsize is also one of the
control arguments one can pass to Stan in addition to
adapt_delta. Unfortunately, the
stan help page is relatively uninformative with respect to the
stepsize argument and does not even provide its default value, it simply says
stepsize (double, positive). Bob Carpenter clarified on the Stan mailing list that the default value is
1 (referring to the CMD Stan documentation). He goes on:
The step size is just the initial step size. It lets the first few iterations move around a bit and set relative scales on the parameters. It’ll also reduce numerical issues. On the negative side, it will also be slower because it’ll take more steps at a smaller step size before hitting a U-turn.
The adapt_delta (target acceptance rate) determines what the step size will be during sampling — the higher the accept rate, the lower the step size has to be. The lower the step size, the less likely there are to be divergent (numerically unstable) transitions.
Taken together, this means that divergent transitions can be dealt with by increasing
adapt_delta above the default value of
.8 while at the same time decreasing the initial
stepsize below the default value of
1. As this may increase the necessary number of steps one might also need to increase the
max_treedepth above the default value of
10. After trying out various different values, the following set of control arguments seems to remove all divergent transitions in the example model (at the cost of prolonging the fitting process quite considerably):
control = list(adapt_delta = 0.999, stepsize = 0.001, max_treedepth = 20)
As this uses
R interface to stan, here the full call:
samples_1 <- stan(model_code=model, data=data, init=myinits, # If not specified, gives random inits pars=parameters, iter=myiterations, chains=3, thin=1, warmup=mywarmup, # Stands for burn-in; Default = iter/2 control = list(adapt_delta = 0.999, stepsize = 0.01, max_treedepth = 15) )
With these values the traceplots of the post-warmup samples look pretty good. Even for the
sigma parameters which occasionally have problems moving away from 0. As you can see from these nice plots,
traceplot(samples_1, pars = c("muc", "mur", "muu", "Omega", "sigma", "lp__"))