by Joseph Rickert
Baseball fans have been serious about statistics since Carl Pearson was a young man (although I doubt that Carl followed the game). It is not clear, though, exactly when baseball statisticians moved from doing descriptive stats into predictive analytics. In his book Super Crunchers, Ian Ayers credits Bill James of Baseball Abstract fame for getting this particular ball into play. And of course, Michael Lewis' Moneyball brought the power of predictive statistics to reshape the game of baseball itself to the public’s attention in a very dramatic way.
Moreover, baseball statistics are likely to become even more dramatic in the future. For a look at how companies like Sportvision are building the infrastructure to overlay a TV image of a batter with a heat map of the sweet spots of his strike zone and the like, take a look at the presentation Graham Goldbeck made to the Bay Area User Group last October.
So, today, it is widely recognized that a deep understanding of the venerable American pastime requires a fairly high level of statistical play. Those of us blogging at Revolution Analytics have encouraged testing assumptions, looking for patterns and making predictions with several baseball related posts including:
- Learning R through baseball: sab-R-metrics
- Comparing baseball pitcher styles with lattice graphics
- Baseball Games: getting longer?
- Mariano Rivera’s baseball prowess, illustrated with R
- Where Ichiro hits
But now, Professor Michael Friendly of Canada’s York University has taken the analysis of the game to a new, more approachable level by wrapping up Sean Lahman’s Database into the R package Lahman. Version 2.0-1 is available on R-Forge and should be up on CRAN by he end of the week. Michael recently wrote me that “the original motivation was to provide a comprehensive, R version of data on baseball statistics for an annual project I run in a graduate course on multivariate data analysis”. (This project is actually a pretty cool thing itself. Students hone their data analysis and conference presentation skills by preparing papers on topics related to baseball statistics for presentation to the prestigious but fictional “Hotelling Society“.)
The file Lahman Data Sets (Download Lahman Data Sets) lists the 25 data sets that are available in the Lahman package. Note that some of these are of a pretty good size: Batting contains 96,600 rows and 24 columns while Fielding is a 164,898 by 18 table. The file 400 Hitters Plot (Download 400 Hitters Plot) contains Michaels code for the following plot which is big league (geekier) version of the graph that appeared in the New York Times two years ago.
Looking at the way this curve breaks I find myself vacillating between a sense of awe at the accomplishments of Ty Cobb and Rogers Hornsby and wondering how much money could be made by explaining the dip and rise of the curve. In any event, I'd like to think that, the Lahman package will forever link Spring, baseball and lazy afternoons of doing some stats with R.
Michael also indicated that he would welcome contributions to the Lahman package project. So, if you have some examples that you would like to share, or you would like to write some code please sign up on R-Forge.