understanding computational Bayesian statistics: a reply from Bill Bolstad

Posted on October 23, 2011 by xi'an in R bloggers | 0 Comments

[This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bill Bolstad wrote a reply to my review of his book Understanding computational Bayesian statistics last week and here it is, unedited except for the first paragraph where he thanks me for the opportunity to respond, “so readers will see that the book has some good features beyond having a “nice cover”.” (!) I simply processed the Word document into an html output and put a Read More bar in the middle as it is fairly detailed. (As indicated at the beginning of my review, I am obviously biased on the topic: thus, I will not comment on the reply, lest we get into an infinite regress!)

The target audience for this book are upper division undergraduate students and first year graduate students in statistics whose prior statistical education has been mostly frequentist based. Many will have knowledge of Bayesian statistics at an introductory level similar to that in my first book, but some will have no previous Bayesian statistics course. Being self-contained, it will also be suitable for statistical practitioners without a background in Bayesian statistics.

The book aims to show that:

Bayesian statistics makes different assumptions from frequentist statistics, and these differences lead to the advantages of the Bayesian approach.
Finding the proportional posterior is easy, however finding the exact posterior distribution is difficult in practice, even numerically, especially for models with many parameters.
Inferences can be based on a (random) sample from the posterior.
There are methods for drawing samples from the incompletely known posterior.
Direct reshaping methods become inefficient for models with large number of parameters.
We can find a Markov chain that has the long-run distribution with the same shape as the posterior. A draw from this chain after it has run a long time can be considered a random draw from the posterior
We have many choices in setting up a Markov chain Monte Carlo. The book shows the things that should be considered, and how problems can be detected from sample output from the chain.
An independent Metropolis-Hastings chain with a suitable heavy-tailed candidate distribution will perform well, particularly for regression type models. The book shows all the details needed to set up such a chain.
The Gibbs sampling algorithm is especially well suited for hierarchical models.

I am satisfied that the book has achieved the goals that I set out above. The title “Understanding Computational Bayesian Statistics” explains what this book is about. I want the reader (who has background in frequentist statistics) to understand how computational Bayesian statistics can be applied to models he/she is familiar with. I keep an up-to-date errata on the book website..The website also contains the computer software used in the book. This includes Minitab macros and R-functions. These were used because because they had good data analysis capabilities that could be used in conjunction with the simulations. The website also contains Fortran executables that are much faster for models containing more parameters, and WinBUGS code for the examples in the book.

Some particular comments:

I do not think statements such as “statisticians have long known that the Bayesian approach offered clear cut advantages over the frequentist approach” or “clearly the Bayesian approach is more straightforward than the frequentist p-value” are either unbalanced or antagonistic. Wald’s findings that all admissible procedures are Bayesian goes way back. Frequentists all know that Bayesian statistics is the only coherent approach to inference (we have been telling them that for many years!). Historically, almost all applied statistics was done using frequentist methods despite this knowledge. Frequentist p-values are constructed in the parameter dimension using a probability distribution defined only in the observation dimension. Bayesian credible intervals are constructed in the parameter dimension using a probability distribution in the parameter dimension. I think that is more straightforward. The target audience for my book is people with statistical background mainly in frequentist statistics. My book aims to build on that knowledge, not antagonize them. I think the best way to convert frequentists to the Bayesian approach is to show them how it can be used for models they are familiar with.

In Chapter 6 we show and graphically illustrate the Metropolis-Hastings algorithm using both random-walk and independent candidate distributions for both one-dimensional and two-dimensional parameters. In the two-dimensional case we do show how it works blockwise as well and show how the Gibbs sampler is a special case of blockwise Metropolis-Hastings. An important aspect is how their traceplots show the different mixing properties these chains have, particularly for highly correlated parameters. The independent Metropolis-Hastings chain sampling all parameters in a single block has the best mixing properties when the candidate density has a similar shape to the target and dominates in the tails. In Chapter 7 we show how to find such a candidate density using the mode of the target and curvature of the target at the mode, but with Student’s t shape in the tails. This is shown graphically for the single dimensional case, and the steps of the algorithm in the multivariate case (including how to sample from it) are given in the Chapter Appendix. In the multivariate case, this candidate density will have similar relationship structure for the parameters as the target, as well as having heavy tails. This approach is not misguided as suggested in the review. It leads to chains having excellent mixing properties requiring only a short burn-in and minimal thinning. The review takes the position that thinning is not required, and more precise estimates can be obtained using all the draws after burn-in. Perhaps I am a dinosaur in my view that inferences should be obtained from a random sample. In particular I worry that inferences about the relations between parameters might depend on the path the Markov chain has a tendency to take through the parameter space as well as their true relationship in the posterior. This path dependence can be avoided by thinning to give an approximate random sample from the posterior. The real inference in Bayesian statistics is the entire posterior distribution, not univariate summaries. In any case, the simulations thrown away during burn-in and thinning are not data, rather they are computer generated Markov chain output. If we want larger random sample size for our inferences we can just let the Markov chain run longer. The sample autocorrelations, Gelman-Rubin statistics, and coupling with the past are used to determine burn-in and thinning required to get this random sample. In my view, the burn-in time and the number of steps omitted between consecutive elements of the thinned sample are essentially the same idea in principle. In both cases we are looking for the number of steps required for the effect of the state we start in (or are currently in) are in to have died out. That means we have converged to long run distribution. In practice, the long burn-in times relative to thinning only occur when we start a very inefficient chain a long way from the main body of the posterior. This doesn’t happen when we use the independent Metropolis-Hastings chain using the heavy tailed candidate distribution derived from the matched curvature normal.

Chapters 8 and 9 apply this method for logistic regression, Poisson regression, and proportional hazards models using both multivariate flat priors and multivariate normal priors, including how to choose multivariate normal priors. It is well known that in regression type models including predictor variables that the response does not depend on, degrades the fit and predictive effectiveness of the model. It is also well known that correlated predictor variables can steal each others effect. This means we should try to remove all suspect predictors simultaneously instead of one at a time. The variable selection issue is handled in a similar manner to hypothesis testing. We determine an (approximate) multivariate credible region for all suspect parameters and remove the whole group if the zero vector lies inside the credible region. This builds on frequentist ideas that readers would bring in with them. The credible region is based on the normal approximation to the posterior sample.

Chapter 10 shows how the Gibbs sampler works very well for hierarchical models. Including a short diversion into the empirical Bayes method for hierarchical models for comparison purposes. Perhaps my suggestion about not using improper priors in hierarchical models is a bit too strong. It is Jeffreys’ priors for scale parameters in the hierarchical model that cause the problem. This is due to the improperness caused by the vertical asymptote and the inability of the model to force this parameter away from the asymptote as shown in the Chapter appendix (I also note this fact in a footnote in Introduction to Bayesian Statistics.) The more moderate positive uniform prior although also improper does not cause this problem.