Analyzing baseball data with R

November 27, 2013
By

(This article was first published on Milano R net, and kindly contributed to R-bloggers)

This week, the post is an interview with Max Marchi. Max is the author, with Jim Albert, of the book "Analyzing baseball data with R".

Hi, Max. Welcome back to MilanoR. Last time you wrote for us a series of articles about maps with R. Now you're here as author of a book. How this idea was born?
Some time ago CRC Press sent a call for proposals to several mailing lists. They were accepting suggestions for books (for their R Series) on three main themes, one of which was “Applications of R to specific disciplines”. The examples they suggested were biology, epidemiology, genetics, engineering, finance, and the social sciences. But I thought “Why not baseball”?

R is very popular among statisticians but it's not such a widespread programming language like Java or C. At the same time, baseball is not very popular in Italy and only few people know it. You wrote a book about baseball and R. A gamble?
Not exactly. I definitely wasn’t thinking about selling copies in Italy, but I thought the book could be of some interest to baseball fans in the United States, especially those wanting to wet their toes in a field that is growing in popularity.

In fact, data analysis is very popular in baseball. Tell us more about that.
Well, baseball features what is probably the perfect combination for a data analyst. A long history of data collection, a season consisting of 162 games per teams, and the games progressing in discrete events, making its analysis easier. And then, a couple of years ago, a big movie was made about that (based on a best-seller book), starring Brad Pitt.

And is R popular for analyzing baseball data? What software is most often used to analyze sport data?
I believe many of the guys doing baseball data analysis have more an IT than a statistician background, thus a lot of them use languages not directly related to stats, such as SQL, Python, etc. Some of them told me they were thinking about learning R, so a book featuring baseball examples is just what they were looking for.
While writing the introduction I surveyed people working as analysts inside front offices of Major League Baseball teams, and most of them mentioned R as one of their tools.

The book is co-written with Jim Albert. Tell us about this collaboration.
Well this is one of the great turns of luck that happen once in a while. It happened that the editor of the series, John Kimmell had been the editor for the book Curve Ball, also co-authored by Jim, back in 2003, a very successful book on statistics applied to baseball. Can you believe that was the first book I read on the subject?
Well, John asked me if I would be fine if they gave me Jim as a teammate. From my perspective it was the perfect match: it was the first time I was writing a book, and I definitely needed an expert guide (just look at Jim’s body of work!).
In the third millennium, working with a guy who lives more than 4,000 miles away is not so difficult: we frequently exchanged emails, and we had a couple of videochats along the way.

What about R to analyze data in other sports, in the whole world and, specifically, in Italy?
Other sports are catching up. More and more frequently you see ads for open positions for analysts in NBA front offices, so basketball is joining the numbers revolution. Hockey and (American) football are in the mix as well. When you say sport in Italy, you’re basically saying soccer, and there’s something going on there as well: if you take a look at Opta Sports website and/or follow their Twitter handles you get an idea of what’s going on there.
I don’t know much about the situation of sports data analysis in Italy, but I feel there’s not much around. Unfortunately that’s not just for sports: you see much more job advertising for statisticians in the UK or in the US than here.

Let’s get into the book. What kind of knowledge is expected from the audience? Should readers be a bit familiar with R? What about baseball and baseball data analysis?
Having used R previously is not a prerequisite for reading the book. We devote one full chapter to explaining the basics, plus one dedicated to basic plots. On the other hand we assume knowledge on how the game of baseball works. For those who know baseball but not sabermetrics (that’s how baseball analysis is often referred to), a bunch of initial chapters (one describing the publicly available datasets, one on how to quantify the events on the field in terms of runs, and one on the translation from runs to wins) should do the work.

Events in terms of runs, translation from runs to wins… That’s a bit obscure for the uninitiated.
OK, I’ll try to make it simple. In sports your goal is winning, thus the goal for the sports data analyst is to assess how much a player helps his/her team winning. Ideally you would want to state “Player X is responsible for Y% of team Z’s wins”. Doing it directly is nearly an impossible task, but there are indirect ways. Generally teams win by outscoring opponents, thus scoring a lot of runs (in baseball), points (basketball, american football), goals (hockey, soccer)in a season (and obviously allowing few of them) is highly correlated to winning games. So you are trying to give fair credit to players for their contribution to the runs/points/goals scored and prevented by the team.

If you had to choose an example from your book, which code chunk would you share with the readers of this blog?
The good news is that all of the code used in the book is available on GitHub for everyone. The second good news is that Jim and I are keeping a companion blog with even more code!

This is great!
This is the R essence, right?
Having said that, I’ll probably have different suggestions depending on the readers.

For those who are familiar with R but have struggled with getting their baseball data in a ready-for-analysis format, I’d point to code for performing the whole process (downloading and parsing) in R.
IT guys who have their very well rounded databases would be more interested in going through the step-by-step examples for creating advanced plots.
Plus there are the chapters that introduce baseball data analysis that are suitable for the uninitiated, and then there’s the one dedicated to simulation… It’s my (and Jim’s) book, so I love every part of it!
By the way, on page 157 we show code for this chart.

I know it’s usually not a good idea to use a background image in a scatter plot (or any kind of chart for that matter), but here is one possible exception, as the background image is actually useful as a reference more than the grid.
And in R, it’s just a few lines of code (again, readers who want to run this in their R console, will find the relevant files in the GitHub repository).

?View Code RSPLUS
library(jpeg)
library(jpeg)
library(ggplot2)
 
# load the Comerica Park diagram
diamond <- readJPEG("Comerica.jpg")
 
# spray chart overlaid on jpeg image
ggplot(cabrera, aes(hitx, hity)) +
  coord_equal() +
  annotation_raster(diamond, -310, 305, -100, 480) +
  stat_binhex(alpha = .9, binwidth = c(5, 5)) +
  scale_fill_gradient(low = "grey70", high = "black")

Neat, isn’t it? A background image, binning for a better visualization of overlapping data, plus some transparency, so that the field of play is seen behind the data points. The final line isn’t even necessary: it was needed for the book as it’s printed in black and white.

Is there a suggestion you’d give to someone who wants to write a book about R?
Are you still reading this? Start writing right now!
No, that’s not true actually. You definitely need a good plan laid out before starting to type on your keyboard--The publisher asked us for a full table of contents (and they submitted it to reviewers) before giving us the green light.
And the other important thing is having bright people reviewing your book as you are writing it. Our publisher definitely found us a number of smart guys who helped a lot with their suggestions and critiques.
Today you don’t even need a publisher to get your book done, as there are many print-on-demand services out there. But if you choose to go that way make sure to have a bunch of people willing to go through your TOC and your chapters as you write them. You may even think about making chapters publicly available as you write them, to get the wisdom of the crowds at your disposal.
Finally, as is probably true for books in general, reading a lot of R stuff is certainly going to help. I go to R-bloggers every day and read the good stuff coming out on the several blogs dedicated to R, including this one.
And now R-addicted sports fans have a new book to read


To leave a comment for the author, please follow the link and comment on his blog: Milano R net.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.