Questions on the parallel Rao-Blackwellisation

Posted on December 21, 2010 by xi'an in R bloggers, Uncategorized | 0 Comments

[This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pierre Jacob and I got this email from a student about our parallel Rao-Blackwellisation paper. Here are some parts of the questions and our answer:

Although I understand how the strategy proposed in the paper helps in variance reduction, I do not understand why you set b=1 (mentioned in Section 3.2) and why it plays no role in the comparison. If b=1 and p=4 for example, every chains (out of the 4 chains of the block) has a length of 4 samples. And there are no additional blocks. How is this length enough for the sampler to converge and sample the distribution adequately? You mention that you do 10000 independent runs (Which means you do the above procedure 10000 times independently) but aren’t all these runs equally short and inadequate? Even for p=100, aren’t the chains too short?

Indeed setting b = 1 and p to be “small” leads to very short chains, so the approximations based on these short chains would clearly be very poor! However we were interested in the comparison between the variances computed using standard IMH and block IMH. Our point there is that, to compare the methods, no need to use b > 1. As a matter of fact, Figure 6 shows for various b (1, 10, 100, 1000) that the variance reduction of $tau_2$ compared to $tau_1$ is pretty much the same (always around 30% in this case). This simply allowed us to produce the results very quickly. Using 10,000 independent replicates allows to get a good precision on the variance. In every practical situation where we would like to use the chains to get some actual estimate, we would use a much bigger b (typically, so that b * p = T would be more than 100,000). And we wouldn’t necessarily compute 10,000 independent replicates either, depending on what kind of results we want to produce.

I would like to ask which applications-models that use MCMC as a simulation tool are in greater need of acceleration and what are the characteristics of these problems that make them difficult (e.g. strongly correlated components, multimodality). Also, which sampler is preferred in each case? I have found some work from Lee, Yau, Giles, Doucet and Holmes, in which they accelerate Population MCMC and SMC and use it in Mixture Models inference as a representative example of multimodal distributions (difficulty in exploring all modes). Could you suggest any other cases where you think hardware acceleration of MCMC would have an impact and some representative problems I could work on?

As in the previous answer, you have to realise that our paper is a formal post-processing of MCMC output, not a new way of running MCMC, nor even a way to speed it up!! We are simply reconsidering the way the MCMC output (the chain) is used for approximation purposes. And showing that reprogramming through the abilities offered by parallel processors or even GPU’s brings a free improvement or even a decrease in computing time when the acceptance probability is the most expensive block in the computation. Thus, the paper does not advise about choices of kernels, strategies for better convergence or even for convergence checking. Once again this is about post-processing…

As to which application would better benefit from acceleration, I do not see any that would not. Large scale problems and complex likelihoods (eg in population genetics, graphical models, cosmology) cry out for faster implementation.

In the last paragraph of your text you also note that independent chains could be used to initialize an MCMC algorithm from several well dispersed starting points. First, is this approach actually used in real applications to reduce variance by collecting more samples or is it only used to detect convergence (Gelman-Rubin)? Does inference usually come from one long chain? Second, do we know that all of the chains are mixing with the same rate if they start them from different positions? For example, could one chain sample properly and fast from the whole space and another one slowly or get stuck? Would this mean that if we take all samples from all chains into account they would not be representative of the actual distribution (because some chains sample the space slowly or even partially)? Will this change if we start all chains from the same position?

Parallel multiple chains versus single Markov chains is a debate that has somehow died out in the field. Mathematically, it is very hard to assess parallel chains (except as a whole) because of the dependence on the starting measure and all that. In practice, there is no loss in running parallel chains if you have the parallel abilities and averaging across chains is harmless in the huge majority of cases. Another instance of a free lunch in my opinion.

Starting MCMC algorithms from various starting points is common practice. Both to assess convergence and to get estimates of the variance of the estimator of interest. However if all the independent chains converge, then after the burn-in time, you consider your Markov chains to be in their stationary / invariant distribution, and therefore 10 times 10,000 iterations is equivalent to 100,000 iterations. It’s obviously faster to run 10 times 10,000 iterations with parallel processors and stuff. However if it takes 50,000 iterations to converge, ie the burn-in time is around 50,000, then it’s clearly better to run one chain for 100,000 iterations!! There cannot be any absolute rule or number in this game, it all depends on the problem at hand.

If your chains don’t seem to mix, then you have identified an issue. It can be because of multimodality in the target distribution. Then you cannot use the chains to get some estimates since you cannot consider your values to be drawn from the target distribution. So, to answer your question precisely, if you start various chains, with the same transition kernel, from various starting points, you cannot have one chain mixing well on the whole space and another one mixing poorly and getting stuck; it goes against their Markov property. However observing that the chains get stuck is pretty common with multimodal targets. Imagine one chain that looks like it mixes well around one mode, then you run another chain and it looks like it mixes well around another mode. Then you do not know how many modes there are, there could be more modes that you didn’t find yet; so using both chains is better than using only one, but it’s still a very poor idea. Starting various chains from the same starting point does not seem like a useful idea to me, I cannot see any advantages of doing that over starting from various points or running a longer chain. Overall, there is not much theoretical difference between starting chains from a single point or from a single measure. Ergodicity tells us the chains should eventually free themselves from the starting item. At which speed is a problem-dependent issue.