Book Review: Regression and Other Stories by Gelman, Hill, and Vehtari

[This article was first published on Statistical Science & Related Matters on Less Likely, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Over a decade ago, Andrew Gelman and Jennifer Hill gave applied researchers a comprehensive book (Data Analysis Using Regression and Multilevel/Hierarchical Models) on fitting simple and complex statistical models in R both from a classical framework and a Bayesian one. Now they’re back with an updated version and a new author (Aki Vehtari).

Much has changed in applied statistics since 2006 (when the book was first released). The primary software used at the time and in the book to fit Bayesian models was BUGS (Bayesian inference Using Gibbs Sampling).

However, both BUGS and some of the R code in the first edition are now outdated. The new edition updates the R code and contains intuitive instructions on how to fit simple and complex models using the probabilistic programming language, Stan (also developed by Gelman and colleagues), which is now used in several fields (even for studying wine!).

Indeed, running a Bayesian regression model in R is now as simple as

# I use the sample PlantGrowth dataset in R
pg <- PlantGrowth
model1 <- stan_glm(weight ~ group, data = pg, refresh = 0)
summary(model1); plot(model1)
## Model Info:
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      weight ~ group
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 30
##  predictors:   3
## Estimates:
##               mean   sd   10%   50%   90%
## (Intercept)  5.0    0.2  4.8   5.0   5.3 
## grouptrt1   -0.4    0.3 -0.7  -0.4   0.0 
## grouptrt2    0.5    0.3  0.1   0.5   0.8 
## sigma        0.6    0.1  0.5   0.6   0.8 
## Fit Diagnostics:
##            mean   sd   10%   50%   90%
## mean_PPD 5.1    0.2  4.9   5.1   5.3  
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## MCMC diagnostics
##               mcse Rhat n_eff
## (Intercept)   0.0  1.0  2619 
## grouptrt1     0.0  1.0  2839 
## grouptrt2     0.0  1.0  2944 
## sigma         0.0  1.0  3068 
## mean_PPD      0.0  1.0  3707 
## log-posterior 0.0  1.0  1646 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Another key difference between the first edition and the new edition is that the 2006 book attempted to cover several topics at once. It contained instructions on how to fit simple models in a classical framework all the way up to multilevel models in a Bayesian framework. The new edition attempts to reduce this information overload by splitting itself into two volumes.

The first volume (Regression and Other Stories) covers fitting simple and complex models using R and Stan, and is oriented towards the applied researcher or statistician, who wants a smooth introduction to fitting Bayesian models using Stan without diving into much theory or math.

A draft copy of the table of contents in the new edition can be found here, though it’s very likely that the published edition will have some changes.

The book does not cover much of multilevel modeling, which is reserved for the second volume, Advanced Regression and Multilevel Models (planned to be released in the next year or two).

Make no mistake, although both of these books are unlikely to touch on a serious amount of theory or math, they are not books that can be read without serious engagement and practice. Every chapter contains enough math for the reader to understand the concepts being discussed with exercises at the end to solidify these concepts.

The chapter exercises are incredibly similar to these exam questions that Gelman created for his Applied Regression class.

Question: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, solution to 15.

I suspect that many of the commenters of the blog who had some difficulty with these questions would have had an easier time had they had the opportunity to read the book.

The new edition also covers several news stories from the past few years (some of which long-time blog readers will be familiar with) and gives readers a set of tools to think critically about these stories and how proper statistical thinking could’ve prevented mishaps. In addition, it incorporates concepts that Gelman and colleagues have developed and solidified over the years, since the first edition was published, such as the concept of Type-M and Type-S errors.

Overall, the book is quite comprehensive and will leave the reader with a rich set of tools to think critically about statistics and to fit models in the real world. I look forward to grabbing a hard copy once the book is out, which seems to be in the summer to fall of 2020.

Update: Looks like the book is out!

To leave a comment for the author, please follow the link and comment on their blog: Statistical Science & Related Matters on Less Likely. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)