Grrr…

[This article was first published on James Keirstead » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been working through Gelman et al.’s otherwise excellent Bayesian Data Analysis and it’s going reasonably well. My statistics is a little bit rusty so it’s taken time to work through all of the exercises and really understand what’s going on. But I say “otherwise excellent” because yesterday I spent ages trying to figure out a problem, only to discover that the data published in the book don’t correspond to the text discussion.

The troublemaker is the SAT problem in section 5.5. The authors give values for two variables y_j and \sigma_j for j=1\ldots8 schools, rounded to integer values. Using the formulas given in the text, I then calculated estimates for the complete pooling statistics and came up with a posterior treatment effect, y, of 7.7 and a variance of 16.6. However the text says these values should be 7.9 and 17.4 respectively.

I puzzled over this for quite a while, thinking that maybe I’d missed a prior/posterior distinction somewhere and my estimates were supposed to be subtlely shifted. But no, when I went and checked the original data source, I found that the values were reported to 4 sig figs. Repeating the calculation with the new values gives the expected results. Grr…

Here’s the data and R-code if anyone’s interested.

### Load SAT data from the Gelman et al book and the original Rubin paper
df <- data.frame(school=LETTERS[1:8],
                   book.y=c(28,8,-3,7,-1,1,18,12),
                   rubin.y=c(28.39,7.94,-2.75,6.82,-0.64,0.63,18.01,12.16),
                   book.sigma=c(15,10,16,11,9,11,10,18),
                   rubin.sigma=c(14.9,10.2,16.3,11,9.4,11.4,10.4,17.6))
 
### Rearrange data into a handy form
df <- melt(df,id="school")
vals <- colsplit(df$variable,"\\.",c("source","metric"))
df <- cbind(df,vals)
df <- df[,-2]
df <- cast(df,school+source ~ metric)
 
### Calculate the summary statistics
ddply(df,.(source),summarize,y=sum(y/sigma^2)/sum(1/sigma^2),sig=1/sum(1/sigma^2))

To leave a comment for the author, please follow the link and comment on their blog: James Keirstead » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.