Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

1. Contributed by James Hedges and Malcolm Hess.
2. James and Malcolm are part of the 12-Week Data Science Bootcamp with Vivian Zhang in the spring of 2015.
3. This post is based on their first in-class presentation, a review of Benjamin Morris’ article on FiveThirtyEight.com related to Seattle’s final offensive play in Super Bowl XLIX.

Videos

1. Video of the presentation can be found here:

Background
We’re interested in applying statistical and analytical approaches to competitive sports, and to gain surprising insights from doing so. To that end, we discussed a recent article by FiveThirtyEight.com’s Benjamin Morris in which he builds support for the contrarian position the decision underlying what may be remembered as one of the most impactful plays in Super Bowl history. He develops a probabilistic model in support of the conclusion that Seattle’s decision to throw the ball on second down from the 1-yard line wasn’t actually bad decision.
We wanted to learn more about his model and to see whether we could implement a version of it ourselves. We also wanted provide some context for it and to consider other approaches to problems of this kind. Doing well with such problems may hinge on understanding and simplifying the dependencies between a context (e.g., 2 down, 1-yard line, down by 4 points), a decision (e.g., run the ball or throw the ball), a specific outcome (e.g., a touchdown or an interception), and a more general outcome (e.g., win or lose the game).

Objectives

We initially attempted to recreate a primary result from the article, in which the probabilities of sequential play outcomes and overall game outcomes are computed, and in which those estimates change based on some additional assumptions. While the objective of recreating the model is important objective, we felt it was unrealistic to reach that point without having more information on the data Morris’ used and in how the model was actually computed. Our attention instead turned to numerically replicating some elements of the model, such as the probability of scoring a touchdown on a run play, and to evaluating whether tweaks to the model were reasonable.

Situation

1. We start with the play. This image from NFL Breakdowns shows Seattle in a shotgun formation at New England’s 1-yard line in the seconds just prior to the snap. Trailing by four points, a touchdown would have put Seattle up by three (assuming they go for and get the PAT), something many an observer would have as much assumed was going to happen. Russell Wilson’s attempted pass on a slant route to Ricardo Lockette.”

football in 5 rules
1. points: Touchdown = 6 pts; Field Goal = 3 puts; Point After Touchdown (easy kick) = 1 pt
2. attacking team (offense) scores a touchdown by getting the ball into the end zone (area beyond goal line)
3. offense has four attempts to move the ball 10 yds; if inside 10 yard line, then just the number of attempts to goal
4. ball is advanced by throwing the ball to someone who catches it or by someone running with the ball (i.e., pass or run the ball)
5. a given play ends with the person with the ball is tackled or goes out of bounds or when its passed and not caught
source: http://usafootball.com/football-basics

In a simpler view, imagine having two bowls each with three colored balls.  You pull a ball out blindfolded one at a time.  Pull a red ball you win, a black ball you lose, and a yellow ball lets you pull again.  However if you pull three yellow balls in a row you also lose.  There are two bowls to choose from, one called run and one called pass, each has a different amount amount of red, yellow, and black balls.

Using this mentality we created a probability tree that includes all possibilities from this decision.

Implementation

Data for play by play results of every NFL game of the 2014 season was found here: source: http://nflsavant.com/about.php
`# get data ----------------------------------------------------------------`
`library(downloader)`
`fileUrl<-http://nflsavant.com/pbp_data.php?year=2014"`
`downloadfileUrl, dest="./data/data.pbp.2014.csv",mode="wb")`
`list.files("./data")`
```# check data -------------------------------------------------------------- str(data.pbp.2014)```

`# 45k observations by 45 vars`

``` # 01 - GameId - integer - example: 2014090400 - date of game and two more digits```
``` # 02 - GameDate - factor - example: 2014-09-04 - date of game```
``` # 03 - Quarter - integer - example: 1 - quarter in game```
``` # 04 - Minute - integer - example: 15 - minutes left in quarter```
`# 05 - Second - integer - example: 0 - seconds left in quarter`
``` # 06 - OffenseTeam - factor - example: ARI - offensive team```
``` # 07 - DefenseTeam - factor - example: ARI - offensive team```
``` # 08 - Down - integer - example: 1 - down; not sure ab 0?```
``` # 09 - ToGo - integer - example: 10 - distance to go; not sure ab 0?```
``` # 10 - YardLine - integer - example: 35 - distance to go; not sure ab 0? *******```
``` # 11 - X - logical - example: ?? - not sure```
``` # 12 - SeriesFirstDown - integer - example: 1 - series 1st down```
``` # 13 - X.1 - logical - example: ?? - not sure```
``` # 14 - NextScore - integer - example: 0 - check this ***************************```
``` # 15 - Description - factor - example: "D.CARR.." - description```
``` # 16 - TeamWin - integer - example: 0 - unclear - ******************************```
``` # 17 - X.2 - logical - ?? - not sure```
``` # 18 - X.3 - logical - ?? - not sure```
``` # 19 - SeasonYear - integer - example: 2014 - season year```
``` # 20 - Yards - integer - example: 0 - yards from result of play? ***************```
``` # 21 - Formation - factor - example: SHOTGUN - simple formation on play```
``` # 22 - PlayType```
``` # 23 - IsRush - integer - example: 0 - whether rush play or not ****************```
``` # 24 - IsPass - integer - example: 0 - whether pass play or not ****************```
``` # 25 - IsIncomplete```
``` # 26 - IsTouchdown - integer - example: 0 - whether play was touchdown or not***```
``` # 27 - PassType```
``` # 28 - IsSack```
``` # 29 - IsChallenge```
``` # 30 - IsChallengeReversed```
``` # 31 - Challenger```
``` # 32 - IsMeasurement```
``` # 33 - IsInterception - integer - example: 0 - whether play was interception ***```
``` # 34 - IsFumble - integer - example: example: 0 - whether play was fumble ******```
``` # 35 - IsPenalty - integer - example: example: 0 - whether play was penalty ****```
``` # 36 - IsTwoPointConversion```
``` # 37 - IsTwoPointConversionSuccessful```
``` # 38 - RushDirection```
``` # 39 - YardLineFixed - integer - example: 35 - 0-50 yardline```
``` # 40 - YardLineDirection - factor - example: OPP - which side of field```
``` # 41 - IsPenaltyAccepted - integer - example: 0 - penalty accepted or not ******```
``` # 42 - PenaltyTeam - factor - example: ARI - why 33 levels```
``` # 43 - IsNoPlay - integer - example: 0 - not sure what this means```
``` # 44 - PenaltyType - factor - example: BLOCKED INTO PUNTER```
``` # 45 - PenaltyYards - integer - example: 5 - yards from penalty ```

Then we sum amount of events that met all the criteria.  For each, pass and run, we needed the total amount of attempts, the amount of touchdowns (successes), and amount of turnovers (either fumble or interception).

`# probability of outcomes`
```------------------------------------------------- ```

```n.rush <- nrow(data.pbp.2014[ data.pbp.2014\$YardLineFixed == 1 & data.pbp.2014\$YardLineDirection == "OPP" & data.pbp.2014\$IsPenalty == 0 & data.pbp.2014\$IsRush == 1,]) n.rush.td <- nrow(data.pbp.2014[ data.pbp.2014\$YardLineFixed == 1 & data.pbp.2014\$YardLineDirection == "OPP" & data.pbp.2014\$IsPenalty == 0 & data.pbp.2014\$IsRush == 1 & data.pbp.2014\$IsTouchdown == 1,]) n.rush.no.td <- nrow(data.pbp.2014[ data.pbp.2014\$YardLineFixed == 1 & data.pbp.2014\$YardLineDirection == "OPP" & data.pbp.2014\$IsPenalty == 0 & data.pbp.2014\$IsRush == 1 & data.pbp.2014\$IsTouchdown == 0,]) n.rush.fumble <- nrow(data.pbp.2014[ data.pbp.2014\$YardLineFixed == 1 & data.pbp.2014\$YardLineDirection == "OPP" & data.pbp.2014\$IsPenalty == 0 & data.pbp.2014\$IsRush == 1 & data.pbp.2014\$IsFumble == 1,]) round(n.rush.td / n.rush, digits=3) round(n.rush.no.td / n.rush, digits=3) round(n.rush.fumble / n.rush, digits=4) n.pass <- nrow(data.pbp.2014[ data.pbp.2014\$YardLineFixed == 1 & data.pbp.2014\$YardLineDirection == "OPP" & data.pbp.2014\$IsPenalty == 0 & data.pbp.2014\$IsPass == 1,]) n.pass.td <- nrow(data.pbp.2014[ data.pbp.2014\$YardLineFixed == 1 & data.pbp.2014\$YardLineDirection == "OPP" & data.pbp.2014\$IsPenalty == 0 & data.pbp.2014\$IsPass == 1 & data.pbp.2014\$IsTouchdown == 1,]) n.pass.no.td <- nrow(data.pbp.2014[ data.pbp.2014\$YardLineFixed == 1 & data.pbp.2014\$YardLineDirection == "OPP" & data.pbp.2014\$IsPenalty == 0 & data.pbp.2014\$IsPass == 1 & data.pbp.2014\$IsTouchdown == 0,]) n.pass.interception <- nrow(data.pbp.2014[ data.pbp.2014\$YardLineFixed == 1 & data.pbp.2014\$YardLineDirection == "OPP" & data.pbp.2014\$IsPenalty == 0 & data.pbp.2014\$IsPass == 1 & data.pbp.2014\$IsInterception == 1,]) ```

Lastly we calculate the success and failure chances by dividing those by the total amount of attempts.

```round(n.pass.td / n.pass, digits=3) round(n.pass.no.td / n.pass, digits=3) round(n.pass.interception / n.pass, digits=4) # > round(n.rush.td / n.rush, digits=3) # [1] 0.563 # > round(n.rush.no.td / n.rush, digits=3) # [1] 0.437 # > round(n.rush.fumble / n.rush, digits=4) # [1] 0.0101 # > round(n.pass.td / n.pass, digits=3) # [1] 0.579 # > round(n.pass.no.td / n.pass, digits=3) # [1] 0.421 # > round(n.pass.interception / n.pass, digits=4) ># [1] 0 ```

This success rate will is used to determine if the decision made in the Superbowl was good or not.  Since there are not enough sample size in the 2014 season, we felt it was unwise to use an individual team’s success rate given that there is not a big enough sample with exact parameters of the play (ball on 1 yard line).

Conclusion
We can recreate a victory probability model using these numbers.  Doing so shows us that passing is in fact more likely to succeed than running the ball.  Unfortunately we cannot compare our model to that found on the 538 article because there are many built in assumptions including a significant change in success rate which is dependent on if the first play was either a run or a pass.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.