NESSIS 2013: The future of sports statistics is here!

October 24, 2013

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

We have been following sports statistics regularly on the Revolutions Blog with quite a few sports related posts this year. In one post I did back in April about the Latham R package for baseball statistics I speculated on how baseball was poised to move from Moneyball style predictive analytics to real-time descriptive stats by showing strike zone heat maps overlaying TV images of batters swinging away. Sports statistics, however, is moving much more quickly than I imagined. Apparently, the NBA is blowing by this milestone and setting up to do real-time predictions.

Recently Mark Glickman, one of the organizers of NESSIS 2013 (The New England Symposium of Statistics in Sport) sent me links to the slides and videos of the presentations made at the conference. There are several excellent presentations here, but I was astounded by Dan Cerone's presentaton on "State of Transition: Estimating Real-Time Expected Possession Value in the NBA with a Spatiotemporal Transition Model and Player Tracking Data".

Dan, a Harvard graduate student, describes how he and his fellow researchers are using an optical tracking data a system developed by STATS, and scheduled to be installed in all 30 NBA areanas, to build predictive state transition models. The optical system tracks 2D locations of all 10 players on the court as well as the 3D position of the ball by taking 25 images per second. Using the 800 million data points generated from only 515 games the Harvard researchers are trying to answer questions like "How many points is a team expected to score given the spatial evolution of its possession up to time t?"

EPV = E[X|F(t) ] where X = number of points scored on this possession (unknown). and F(t) = space-time information of the possession up to time t.

The following graph shows spatial effect surface plots for some San Antonio Spurs players. These surfaces are components of the predictive model.


Just how big this kind of modeling is expected to be can be inferred from the opening remarks made by Mike Zarren, Assistant GM of the Boston Celtics, at the beginning of Dan's Presentation. Speaking about plans for the continued availability of the data, Mr. Zarren says "I've talked with people on both sides, at the league and also at Stats, and both are still interested in researchers getting some access to this data, but exactly what the model looks like is still up to debate". My guess is that there will be some serious money riding on this data and the predictive models based on it.

All of the NESSIS presentations exhibit a fairly high level of statistical play. In addition to Dans presentation, there are four more basketball related studies, one each on the Boston Marathon, soccer and tennis, one on Football about using Random Forest models to estimate win probabilities on each play during a game, and three presentations on baseball, including an R based analysis of "streakiness" by Jim Albert, long time R contributor and editor of the Journal of Quantitative Analysis in Sports. At the beginning of his talk Jim recounts how early in his career he was surprised to find an analysis of baseball data in a paper by Brad Efron and Carl Morris on Stein's Paradox in Statistics. At that time, Jim remarks, "you don't write about sports to get tenure… maybe times have changed": maybe they have.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)