Empirical Bayes Estimation of On Base Percentage

December 30, 2010

(This article was first published on Statistically Significant, and kindly contributed to R-bloggers)

I guess you could call this On Bayes Percentage. *cough*

Fresh off learning Bayesian techniques in one of my classes last quarter, I thought it would be fun to try to apply the method. I was able to find some examples of Hierarchical Bayes being used to analyze baseball data at Wharton.

Setting up the problem
On base percentage (OBP) is probably the most important basic offensive statistic in baseball. Getting a reliable estimate of a players true ability to get on base is therefore important. The basic problem is that the sample size we get from one season rarely has enough observations so that we are certain of a player's ability. Even though there are 162 games in a season, there is a possibility that the actual OBP is the result of luck rather than skill. Bayesian analysis will "regress" the actual observed OBP to the mean, in that if a player has a small number of plate appearances (PA) it doesn't give them very much weight and the result will be something closer to the overall (MLB) average. On the other hand, if a player has quite a few PAs then it believes that the results are not the result of luck and it gives the observations a lot of weight.
We are trying to estimate the "true" OBP of each batter. Bayesian analysis assumes that the true OBP is random. Empirical Bayes is a method of figuring out the distribution of "true" OBP using the data. OBP is times on base divided by PA. Times on base (X) for each batter is distributed binomial with n=PA and p=true OBP. We further assume that p is distributed Beta with parameters a and b. It follows from this that the marginal distribution of X is distributed according to the distribution:
gamma(a+b)*gamma(a+x)*gamma(n-x+b)*(n choose x)/(gamma(a)*gamma(b)*gamma(a+b+n))
where gamma is the gamma function.
We will estimate the parameters a and b based on the data (X), using its marginal distribution (the "empirical" part of Bayes). To do this I found that likelihood of the marginal distribution of all the batters. Then I maximized this likelihood by adjusting the parameters a and b. This is called the ML-II.

The Analysis
I used data for all non-pitchers in 2010. I assume that each player is independent. In doing that, I just have to multiply all the marginals for each player together to get the likelihood. When I do this and maximize it with respect to a and b, I get estimates that a = 83.48291 and b = 174.9038. I think this can be interpreted that prior mean (what we would assume that average OBP of a batter is before seeing him bat) is a/(a+b) = 0.323. This is pretty close to what the overall OBP of the league was (0.330). I think it makes sense that the prior is lower than the league average because batters who do well will get more opportunities and players that do poorly will get fewer. So the league average is biased high.
Below is a graph of the prior distribution and the updated posteriors of every batter. You can (sort of) see that the posteriors have tighter distributions than the prior does. (The posterior distribution of each batter in this case is the distribution of OBP after we have observed PA and the actual OBP.)

One way to see why this Bayesian analysis is useful is to compare the posterior means with the observed OBP. If someone has only a few PAs, their OBP could be very high or very low and this may mislead you into thinking that this batter is very good or bad. However, the posterior mean takes into account the number of PAs. Below is a graph comparing the two. You can see that the range of values for posterior mean is pretty small, especially compare to actual OBP.

Here is a list of the highest posterior mean OBP:

BatterPosterior MeanActual OBP
Joey Votto0.3960.424
Miguel Cabrera0.3920.420
Albert Pujols0.3900.414
Justin Morneau0.3880.437
Josh Hamilton0.3830.411
Prince Fielder0.3800.401
Shin-Soo Choo0.3790.401
Kevin Youkilis0.3790.412
Joe Mauer0.3780.402
Adrian Gonzalez0.3740.393
Daric Barton0.3740.393
Jim Thome0.3730.412
Paul Konerko0.3730.393
Jason Heyward0.3730.393
Matt Holliday0.3710.390
Carlos Ruiz0.3710.400
Manny Ramirez0.3710.409
Billy Butler0.3700.388
Jayson Werth0.3700.388
Ryan Zimmerman0.3690.388

And here is a list of the lowest posterior mean OBP:

BatterPosterior MeanActual OBP
Brandon Wood0.2520.175
Pedro Feliz0.2710.240
Jeff Mathis0.2760.219
Garret Anderson0.2770.204
Adam Moore0.2810.230
Josh Bell0.2850.224
Jose Lopez0.2860.270
Peter Bourjos0.2870.237
Aaron Hill0.2870.271
Tony Abreu0.2880.244
Koyie Hill0.2910.254
Gerald Laird0.2910.263
Drew Butera0.2910.237
Jeff Clement0.2910.237
Matt Carson0.2910.193
Humberto Quintero0.2920.262
Wil Nieves0.2920.244
Matt Tuiasosopo0.2920.234
Luis Montanez0.2920.155
Cesar Izturis0.2920.277

You can see that all of the posterior means are pulled closer to the overall mean (the good players look worse and the bad players look better). The order changes a little bit but not too much.

You can see the effect of sample size (PAs) by comparing Justin Morneau with Joey Votto. Morneau had a higher OBP, but Votto ended up with a higher posterior mean because he had more PAs (Votto had 648 while Morneau had 348). Here are their posterior distributions:

Because of the additional PAs, you can see that the distribution of Votto is a little tighter than Morneau. We are more sure that Votto is excellent than we are sure that Morneau is excellent.

To leave a comment for the author, please follow the link and comment on his blog: Statistically Significant.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.