I gave a talk last month at SAP Labs in Palo Alto, along with Jim Porzak of ResponSys, introducing the R Statistical Language to a Business Intelligence interest group. The goal was to highlight how open source tools, like R, can be used to build predictive models. The example I gave centered around baseball and a simple question: how do you measure a baseball slugger?
Michael Lewis, in Moneyball , described how the baseball analyst Bill James was frustrated by the fact that major league hitters were consistently rated by their batting averages. James wrote:
“a hitter should be measured by his success in that which he is trying to do, … create runs. It is startling, when you think about it, how much confusion there is about this.”
– Bill James, 1979 Baseball Abstract
However, since teams create runs, not batters, the only way to connect batting statistics with runs is to use team averages. The idea is that if we know which statistics predict runs at the team level, these statistics could be used to measure individual hitters.
I decided to test the value of three batting statistics myself — batting average, slugging percentage, and OPS (on-base plus slugging) — and see how well they predicted team runs, using MLB team data for the years 2000-2005 (available from baseball-databank.org). The results are shown in the three scatter plots below, and no surprise, Bill James is right: a team’s overall batting average (top-most chart) is a comparatively poor predictor of how many runs it will score in an average game. Slugging percentage (middle plot) is a slightly better predictor, and OPS (bottom plot) is the best of the three statistics I looked at: it has a 0.95 correlation with runs scored (the r shown in the upper right corner of the plots is the Pearson correlation coefficient, the red lines represent least-squares fits to the points).
I highlighted a couple of interesting outliers in the top batting average plot: teams that achieved a high level of scoring with a comparatively low team batting average. Who were these teams? None other than Billy Beane’s 2000 and 2001 Oakland Athletics. This suggests that the As management may have found excess value in fielding players who — despite having slightly lower batting averages — were capable of generating runs.
These results show what Bill James and others already know: that a baseball slugger should not be measured by his batting average, but by OPS or other hybrid statistics that best correlate with his success at generating runs. There is nothing novel about the results of my analysis. But what I hope is novel is showing how it can be done using open source tools (R and MySQL), free data (baseball-databank.org), and a few lines of code ( sabermetrics using R page).