Bayesian model selection

Posted on December 7, 2010 by xi'an in R bloggers, Uncategorized | 0 Comments

[This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, I received a box of books from the International Statistical Review, for reviewing them. I thus grabbed the one whose title was most appealing to me, namely Bayesian Model Selection and Statistical Modeling by Tomohiro Ando. I am indeed interested in both the nature of testing hypotheses or more accurately of assessing models, as discussed in both my talk at the Seminar of philosophy of mathematics at Université Paris Diderot a few days ago and the post on Murray Aitkin’s alternative, and the computational aspects of the resulting Bayesian procedures, including evidence, the Savage-Dickey paradox, nested sampling, harmonic mean estimators, and more…

After reading through the book, I am alas rather disappointed. What I consider to be innovative or at least “novel” parts with comparison with existing books (like Chen, Shao and Ibrahim, 2000, which remains a reference on this topic) is based on papers written by the author over the past five years and it is mostly a sort of asymptotic Bayes analysis that I do not see as particularly Bayesian, because involving the “true” distribution of the data. The coverage of the existing literature on Bayesian model choice is often incomplete and sometimes misses the point, as discussed below. This is especially true for the computational aspects that are generally mistreated or at least not treated in a way from which a newcomer to the field would benefit. The author often takes complex econometric examples for illustration, which is nice; however, he does not pursue the details far enough for the reader to be able to replicate the study without further reading. (An example is given by the coverage of stochastic volatility in Section 4.5.1, pages 83-84.) The few exercises at the end of each chapter are rather unhelpful, often sounding rather like notes than true problems (an extreme case is Exercise 6 pages 196-197 which introduces the Metropolis-Hastings algorithm within the exercise (although it has already been defined on pages 66-67) and then asks to derive the marginal likelihood estimator. Another such exercise on page 164-165 introduces the theory of DNA microarrays and gene expression in ten lines (which are later repeated verbatim on page 227), then asks to identify marker genes responsible for a certain trait.) The overall feeling after reading this book is thus that the contribution to the field of Bayesian Model Selection and Statistical Modeling is too limited and disorganised for the book to be recommended as “helping you choose the right Bayesian model” (backcover).

This is rather minor but I find the quality of the editing to be quite poor, with many typos, which makes me wonder if CRC Press is so financially pressed as to be unable to afford a copy-editor. For instance, one section of Chapter 6 covers the Gelfand-Day’s approximation instead of the Gelfand-Dey’s approximation, Gibbs sampling is spelled Gibb’s sampling in Chapter 6, the bibliography is not printed in alphabetical order and contains erroneous entries, like Jacquier, Nicolas and Rossi (2004), instead of Jacquier, Polson and Rossi (2004). Tierney and Kanade (1986) is used instead of Tierney and Kadane (1986), some sentences are not grammatically correct (e.g., the posterior has multimodal, because…, page 55) or meaningful (e.g., the accuracy of this approximation on the tails may not be accurate, page 49)., … While I do not want to discuss about asymptotics, I do not understand the presentation made in the book of priors satisfying

$logpi(theta) = O_p(1)qquadtext{or}qquad logpi(theta) = O_p(n)$

where n is the sample size. Indeed, (a) this would mean priors that depend on the sample size and (b) the opposition between both cases does not seem to be processed in the examples used in Bayesian Model Selection and Statistical Modeling. (I also think the nine assumptions for the “Bayesian central limit theorem” page 48 are missing the definition of the value $theta_0$ .) A more important matter is the way improper priors are handled. The author recognises the difficulty with using improper priors in Bayesian model comparison, however he instead resorts to proper priors with very large variances (see e.g. page 37), failing to mention this is a perfect case for the Lindley-Jeffreys paradox. He further considers using Bayes factors to compare “models” that only differ via their prior distributions. I find this use difficult to defend from a Bayesian perspective, since it means picking the prior according to the data (and hence selecting the Dirac mass in the MLE as the optimal choice).

“In contrast [to maximum likelihood estimation], a Bayesian treatment of this inference problem relies solely on probability theory.” Bayesian Model Selection and Statistical Modeling, page 205

A rather confusing mistake about the nature of Bayesian testing is found on page 106. When comparing

$H_0:,thetainTheta_0qquadtext{ versus }qquad H_1: ,thetainTheta_1$

under a prior covering both subsets, the Bayes factor is given as

$B_{01}=dfrac{int_{Theta_0} f(y|theta) pi(theta)text{d}theta}{int_{Theta_1} f(y|theta) pi(theta)text{d}theta}$

instead of

$B_{01}dfrac{int_{Theta_0} f(y|theta) pi(theta|thetainTheta_0)text{d}theta}{int_{Theta_1} f(y|theta) pi(theta|thetainTheta_1)text{d}theta}$

and is thus missing the normalising factors for the prior restricted to each subset… This means that a small null set will never get a chance to achieve a high Bayes factor, which should have warned the author about the mistake.

“When selecting among various Bayesian models, the best one is chosen by maximising the posterior mean of the expected log-likelihood.” Bayesian Model Selection and Statistical Modeling, page 200.

The most critical part of the book is, in my opinion, related with the computational aspects. From this perspective, I consider the book to be a significant regression from Chen, Shao and Ibrahim, not to mention more recent works on the topic of model choice. In several occurences, it appears that the author is confused about those computational issues. For instance, take the first (true) introduction of the Metropolis-Hastings algorithm on page 66. As this algorithm is presented following the Gibbs sampler, the book applies the Metropolis-Hastings algorithm to the full conditional densities used in a Gibbs sampler, rather than to an arbitrary target, but fails to account for the other components of the parameter in the Metropolis-Hastings acceptance probability for the k-th component,

$alpha(theta_k^{(j)},theta_k^{(j+1)})=minleft{1,dfrac{f(x|theta_k^{(j+1)}pi(theta_k^{(j+1)})/p(theta_k^{(j+1)},theta_k^{(j)})}{f(x|theta_k^{(j)}pi(theta_k^{(j)})/p(theta_k^{(j)},theta_k^{(j+1)})}right}$

which makes the whole matter incomprehensible (not to mention the fact that the proposed value is denoted the same way as the next value of the Markov chain)! On page 75, we find the remark that simulated from a truncated normal can be done by simulated from the corresponding untruncated normal and discarding values outside the truncated region. Chapter 6 introduces the worst possible choice for the Gelfand-Day’s (sic!) estimator by considering the harmonic mean version with the sole warning that it “can be unstable in some applications” (page 172). Chib and Jeliazkov’s (2001) estimator is defined with a confusion between numerator and denominator (page 180). The presentation of the bridge sampling estimator in Section 6.5 misses the appeal of the method and concludes with an harmonic mean version, instead of the asymptotically optimal version well-covered by Chen, Shao and Ibrahim. The Savage-Dickey approach unsurprisingly misses the difficulty with the representation (as well as the spelling for Isabella Verdinelli’s last name). The description of Carlin and Chib’s (1995) representation of the product space via pseudo-priors (pages 190-191) does not put enough emphasis on the difficulty of calibrating those pseudo-priors. A final example of the computational difficulties within Bayesian Model Selection and Statistical Modeling is given by Section 7.1.4 where particle filtering is introduced on pages 205-206 in a very confusing manner with mixed-up indices and is further followed by a return to MCMC for the simulation of the model parameters on page 210, thus negating the whole appeal of running the filter.