Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been working through Gelman et al.’s otherwise excellent Bayesian Data Analysis and it’s going reasonably well. My statistics is a little bit rusty so it’s taken time to work through all of the exercises and really understand what’s going on. But I say “otherwise excellent” because yesterday I spent ages trying to figure out a problem, only to discover that the data published in the book don’t correspond to the text discussion.

The troublemaker is the SAT problem in section 5.5. The authors give values for two variables $y_j$ and $\sigma_j$ for $j=1\ldots8$ schools, rounded to integer values. Using the formulas given in the text, I then calculated estimates for the complete pooling statistics and came up with a posterior treatment effect, $y$, of 7.7 and a variance of 16.6. However the text says these values should be 7.9 and 17.4 respectively.

I puzzled over this for quite a while, thinking that maybe I’d missed a prior/posterior distinction somewhere and my estimates were supposed to be subtlely shifted. But no, when I went and checked the original data source, I found that the values were reported to 4 sig figs. Repeating the calculation with the new values gives the expected results. Grr…

Here’s the data and R-code if anyone’s interested.

### Load SAT data from the Gelman et al book and the original Rubin paper
df <- data.frame(school=LETTERS[1:8],
book.y=c(28,8,-3,7,-1,1,18,12),
rubin.y=c(28.39,7.94,-2.75,6.82,-0.64,0.63,18.01,12.16),
book.sigma=c(15,10,16,11,9,11,10,18),
rubin.sigma=c(14.9,10.2,16.3,11,9.4,11.4,10.4,17.6))

### Rearrange data into a handy form
df <- melt(df,id="school")
vals <- colsplit(df\$variable,"\\.",c("source","metric"))
df <- cbind(df,vals)
df <- df[,-2]
df <- cast(df,school+source ~ metric)

### Calculate the summary statistics
ddply(df,.(source),summarize,y=sum(y/sigma^2)/sum(1/sigma^2),sig=1/sum(1/sigma^2))