Predicting who will win a NFL match at half time

[This article was first published on Fellgernon Bit - rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It was great to have a little break, Spring break, although the weather didn’t feel like spring at all! During the early part of the break I worked on my final project for Jeff Leek’s data analysis class, which we call 140.753 here. Continuing my previous posts on the topic, this time I’ll share the results of my final project.

At the beginning of the course, we had to submit a project plan (more like a proposal) and in mine I announced my interest to look into some sports data. At the time I included a few links to Brian Burke’s Advanced NFL Stats site (Burke). At the time I didn’t know that Burke’s site described in detail a lot of the information I would end up using.

My final project had to do with splitting NFL games by half and then use only the play-by-play data from the first half to predict if team A or B would win the game. My overall goal was to have some fun with sports data which I had never looked at, but then also try to come up with something I would personally use in the future. So, why split games by half? I personally would like to know if I should keep watching a game or not at half time. Having a tool to help me decide would be great, and well, if the team I’m rooting for has high chances of losing or winning, ideally I would switch to doing something else. A related question that I didn’t try to answer is which half is worth watching? This would be a meaningful question if you only have time to watch one of them.

To truly satisfy my goals, it wasn’t enough to just build a predictive model. That is why I also built a web application using the shiny package (RStudio and Inc., 2013). It was the first time I did a shiny app, but thanks to the good manual and some examples on GitHub from John Muschelli like his Shiny_model it wasn’t so bad. I thus invite you to test and browse my shiny app at http://glimmer.rstudio.com/lcolladotor/NFLhalf/. It could be improved by adding some functions that scrape live data for the 2013 season so you don’t have to input all the variables needed by using the sliders. Anyhow, I’m happy with the result.

The entire project’s code, EDA steps, shiny app, and report are available via GitHub in my repository (lcollado753). While the details are in the report, I’ll give a brief summary here.

Basically, I summarized the play-by-play data for all NFL games from 2002 to 2012 seasons as provided by Burke (Burke, 2010). I used some of the variables Burke uses (Burke, 2009) and some others like the score difference, who starts the second half, and the game day winning percentages of both teams. After exploring the data, I discarded the years 2002 to 2005. Then, I trained a model using the 2006 to 2011 data and did some quick model selection. Note that I’m not doing the adjustment by opponent the way Burke did it (Burke, 2009-2) in part because I was running out of time, but also because the model already uses the current game winning percentages of both teams to consider the two team’s strength. I evaluated the model using the 2012 data and after seeing that it worked decently enough, I trained a second model using the data from 2006 to 2012 so it can be used for the 2013 season. These two trained models are the ones available in the shiny app I made.

In the report, I didn’t include ROCs—a big miss—so here they go. The code I will show below is heavily based on a post on GLMs (denishaine, 2013). The code below is written in a way that you can easily reproduce it if you have cloned my repository for the 140.753 class (lcollado753).

First, some setup steps.

<span class="c1">## Specify the directory where you cloned the lcollado753 repo</span>
maindir <span class="o"><-</span> <span class="s">"whereYouClonedTheRepo"</span>
<span class="c1">## Load packages needed</span>
suppressMessages<span class="p">(</span>library<span class="p">(</span>ROCR<span class="p">))</span>
library<span class="p">(</span>ggplot2<span class="p">)</span>

<span class="c1">## Load fits.</span>
<span class="c1">## Remember that 1st one used data from 2006 to 2011</span>
<span class="c1">## and the 2nd one used data from 2006 to 2012.</span>
load<span class="p">(</span>paste0<span class="p">(</span>maindir<span class="p">,</span> <span class="s">"/lcollado753/final/nfl_half/EDA/model/fits.Rdata"</span><span class="p">))</span>

Next, I make the ROCs for both trained models using the data that they were trained on. They should be quite good since it uses the same data to build the model that it will then try to predict.

<span class="c1">## Make the ROC plots</span>

<span class="c1">## Simple list where I'll store all the results so I can compare the ROC plots later on</span>
all <span class="o"><-</span> list<span class="p">()</span>

<span class="c1">## Construct prediction function</span>
<span class="kr">for</span><span class="p">(</span>i <span class="kr">in</span> <span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">## Predict on the original data</span>
    pred <span class="o"><-</span> predict<span class="p">(</span>fits<span class="p">[[</span>i<span class="p">]])</span>
    
    <span class="c1">## Subset original data (remove NA's)</span>
    data <span class="o"><-</span> fits<span class="p">[[</span>i<span class="p">]]</span><span class="o">$</span>data
    data <span class="o"><-</span> data<span class="p">[</span>complete.cases<span class="p">(</span>data<span class="p">),]</span>
    
    <span class="c1">## Construct prediction function</span>
    pred.fn <span class="o"><-</span> prediction<span class="p">(</span>pred<span class="p">,</span> data<span class="o">$</span>win<span class="p">)</span>
    
    <span class="c1">## Get performance info</span>
    perform <span class="o"><-</span> performance<span class="p">(</span>pred.fn<span class="p">,</span> <span class="s">"tpr"</span><span class="p">,</span> <span class="s">"fpr"</span><span class="p">)</span>
    
    <span class="c1">## Get ready to plot</span>
    toPlot <span class="o"><-</span> data.frame<span class="p">(</span>tpr <span class="o">=</span> unlist<span class="p">(</span>slot<span class="p">(</span>perform<span class="p">,</span> <span class="s">"y.values"</span><span class="p">)),</span> fpr <span class="o">=</span> unlist<span class="p">(</span>slot<span class="p">(</span>perform<span class="p">,</span> <span class="s">"x.values"</span><span class="p">)))</span>
    all <span class="o"><-</span> c<span class="p">(</span>all<span class="p">,</span> list<span class="p">(</span>toPlot<span class="p">))</span>

    <span class="c1">## Make the plot</span>
    res <span class="o"><-</span> ggplot<span class="p">(</span>toPlot<span class="p">)</span> <span class="o">+</span> geom_line<span class="p">(</span>aes<span class="p">(</span>x<span class="o">=</span>fpr<span class="p">,</span> y<span class="o">=</span>tpr<span class="p">))</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">,</span> colour<span class="o">=</span><span class="s">"orange"</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">"Sensitivity"</span><span class="p">)</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">"1 - Specificity"</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span>paste<span class="p">(</span><span class="s">"Years 2006 to"</span><span class="p">,</span> c<span class="p">(</span><span class="s">"2011"</span><span class="p">,</span> <span class="s">"2012"</span><span class="p">)[</span>i<span class="p">]))</span>
    print<span class="p">(</span>res<span class="p">)</span>
    
    <span class="c1">## Print the AUC value</span>
    print<span class="p">(</span>unlist<span class="p">(</span>performance<span class="p">(</span>pred.fn<span class="p">,</span> <span class="s">"auc"</span><span class="p">)</span><span class="o">@</span>y.values<span class="p">))</span>
<span class="p">}</span>

plot of chunk ROC

<span class="c1">## [1] 0.8506</span>

plot of chunk ROC

<span class="c1">## [1] 0.8513</span>

Both ROC plots look pretty similar (well, the data sets are very similar!) and have relatively high AUC values.

Next, I make the ROC plot using the model trained with the data from 2006 to 2011 to predict the outcomes for the 2012 games.

<span class="c1">## Load 2012 data</span>
load<span class="p">(</span>paste0<span class="p">(</span>maindir<span class="p">,</span> <span class="s">"/lcollado753/final/nfl_half/data/pred/info2012.Rdata"</span><span class="p">))</span>

<span class="c1">## Predict using model fit with data from 2006 to 2011</span>
pred <span class="o"><-</span> predict<span class="p">(</span>fits<span class="p">[[</span><span class="m">1</span><span class="p">]],</span> info2012<span class="p">)</span>

<span class="c1">## Construction prediction function</span>
pred.fn <span class="o"><-</span> prediction<span class="p">(</span>pred<span class="p">,</span> info2012<span class="o">$</span>win<span class="p">)</span>

<span class="c1">## Get performance info</span>
perform <span class="o"><-</span> performance<span class="p">(</span>pred.fn<span class="p">,</span> <span class="s">"tpr"</span><span class="p">,</span> <span class="s">"fpr"</span><span class="p">)</span>

<span class="c1">## Get ready to plot</span>
toPlot <span class="o"><-</span> data.frame<span class="p">(</span>tpr <span class="o">=</span> unlist<span class="p">(</span>slot<span class="p">(</span>perform<span class="p">,</span> <span class="s">"y.values"</span><span class="p">)),</span> fpr <span class="o">=</span> unlist<span class="p">(</span>slot<span class="p">(</span>perform<span class="p">,</span> <span class="s">"x.values"</span><span class="p">)))</span>
all <span class="o"><-</span> c<span class="p">(</span>all<span class="p">,</span> list<span class="p">(</span>toPlot<span class="p">))</span>

<span class="c1">## Make the plot</span>
ggplot<span class="p">(</span>toPlot<span class="p">)</span> <span class="o">+</span> geom_line<span class="p">(</span>aes<span class="p">(</span>x<span class="o">=</span>fpr<span class="p">,</span> y<span class="o">=</span>tpr<span class="p">))</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">,</span> colour<span class="o">=</span><span class="s">"orange"</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">"Sensitivity"</span><span class="p">)</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">"1 - Specificity"</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">"Model trained 2006-2011 predicting 2012"</span><span class="p">)</span>

plot of chunk pred2012

<span class="c1">## Print the AUC value</span>
print<span class="p">(</span>unlist<span class="p">(</span>performance<span class="p">(</span>pred.fn<span class="p">,</span> <span class="s">"auc"</span><span class="p">)</span><span class="o">@</span>y.values<span class="p">))</span>
<span class="c1">## [1] 0.816</span>

The steps in the curve are more visible since it is using less data. It also seems to be a little less good than the other two, as expected. This is clear when comparing the AUC values.

Finally, I plot all curves in the same picture to visually compare them.

names<span class="p">(</span>all<span class="p">)</span> <span class="o"><-</span> c<span class="p">(</span><span class="s">"train2011"</span><span class="p">,</span> <span class="s">"train2012"</span><span class="p">,</span> <span class="s">"pred2012"</span><span class="p">)</span>
<span class="kr">for</span><span class="p">(</span>i <span class="kr">in</span> <span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">)</span> <span class="p">{</span>
    all<span class="p">[[</span>i<span class="p">]]</span> <span class="o"><-</span> cbind<span class="p">(</span>all<span class="p">[[</span>i<span class="p">]],</span> rep<span class="p">(</span>names<span class="p">(</span>all<span class="p">)[</span>i<span class="p">],</span> nrow<span class="p">(</span>all<span class="p">[[</span>i<span class="p">]])))</span>
    colnames<span class="p">(</span>all<span class="p">[[</span>i<span class="p">]])[</span><span class="m">3</span><span class="p">]</span> <span class="o"><-</span> <span class="s">"set"</span>
<span class="p">}</span>
all <span class="o"><-</span> do.call<span class="p">(</span>rbind<span class="p">,</span> all<span class="p">)</span>

ggplot<span class="p">(</span>all<span class="p">)</span> <span class="o">+</span> geom_line<span class="p">(</span>aes<span class="p">(</span>x<span class="o">=</span>fpr<span class="p">,</span> y<span class="o">=</span>tpr<span class="p">,</span> colour<span class="o">=</span>set<span class="p">))</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">,</span> colour<span class="o">=</span><span class="s">"orange"</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">"Sensitivity"</span><span class="p">)</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">"1 - Specificity"</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">"Comparing ROCs"</span><span class="p">)</span>

plot of chunk allInOne

Both ROCs with the trained data (train2011, train2012) are nearly identical and both are slightly superior to the one predicting the 2012 games.

Overall I am happy with the results and while some things can certainly be improved, I look forward to the NFL 2013 season. Also, remember that Burke publishes his winning estimated probabilities from week 4 onward (The Fifth Down Blog). So you might be interested on comparing the probability at half time versus his estimated probability which is calculated before the game starts. I mean, maybe you could use the difference between the two to have an idea of how unexpected the first half was. After all, if a game falls outside the pattern it might be worth watching.

Citations made with knitcitations (Boettiger, 2013).

To leave a comment for the author, please follow the link and comment on their blog: Fellgernon Bit - rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)