Link to paper here
As a summary, I think the paper is about seeing if “box score” data can be used to predict the winner of a game.
Put another way, if I were to show you the AFL stats on say footywire at the end of a game and I of course covered up the Goals and Behinds as well could you predict who won the game?
The good thing about this paper, is that it only uses data that is already available in #fitzRoy. Of course fitzRoy wasn’t around when the paper was written, but it just so happens that the data used is also available within fitzRoy.
Of course there are some caveats to this. The main one being even though we have the same variables. We might not have the same data source. Nonetheless this post should be a good guide as to how to recreate and if you are so inclined maybe even improve on the work.
The variables used were across the 2013 and 2014 AFL season.
So something that could be an interesting extension of this, is because footywire has made some extra statistics available like meters gained, intercepts and tackles inside 50 to name a few. You could run the same model type that the paper used say logistic regression. But you could see if variables like meters gained, intercepts and tackles added any extra value. This is why this particular part of my blog exists, because its an interesting question and one that a fan of the game can hopefully answer online.
Other things to note from the paper is that its not the raw team total that is being used in the model, rather relative difference values between opposing teams for each performance indicators were used.
What I think this means is that instead of WCE totals kicked being used it was WCE total kicks minus Collingwoods total kicks being used.
The other thing to note, is that the paper only talks about home and away matches so we have to filter out finals
c("Variables used", "Kicks", "Marks", "Handballs", "Tackles", "Inside 50s", "Clearances", "Clangers", "Contested possessions", "Uncontested possessions", "Contested marks", "Marks inside 50", "Goal conversion", "Kick:handball ratio", "contested:uncontested possession ratio", "disposals", "hit outs", "Free kick differential") ##  "Variables used" ##  "Kicks" ##  "Marks" ##  "Handballs" ##  "Tackles" ##  "Inside 50s" ##  "Clearances" ##  "Clangers" ##  "Contested possessions" ##  "Uncontested possessions" ##  "Contested marks" ##  "Marks inside 50" ##  "Goal conversion" ##  "Kick:handball ratio" ##  "contested:uncontested possession ratio" ##  "disposals" ##  "hit outs" ##  "Free kick differential"
They then removed some variables from the model fitting process. For example Multicollinearity is a typical problem in regression. It happens when there is a relationship between your predictor variables. In this case we know that essentially kicks + handballs = disposals which would be an example of multicollinearity.
How to get the data ready
To fit the models described in the paper, we would need a dataset that contains the game information joined with a binary indicator 1 if the team won and 0 if the team lost (in the paper draws were removed).
To do this using fitzRoy would require joining
library(tidyverse) ## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.8 ## ✔ tidyr 0.8.2 ✔ stringr 1.3.1 ## ✔ readr 1.3.1 ✔ forcats 0.3.0 ## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() library(fitzRoy) library(knitr) library(broom) df1<-fitzRoy::match_results df2<-fitzRoy::player_stats
So once we have the box-score statistics from footywire,
fitzRoy::player_stats we have to aggregate them to a match level. That is because the paper looked at team level statistics not player level.
So to do that using the tidyverse we first have to come up with at what level we want to summarise our data by (team and game). To do this we use
library(tidyverse) teamdata<-df2%>%dplyr::select(-Player)%>% group_by(Date, Season, Round, Venue, Team, Opposition, Status, Match_id)%>% summarise_all(sum)%>% filter(Season %in% c("2013","2014"))%>% group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.))))%>% filter(!Round %in% c("Qualifying Final","Elimination Final", "Semi Final","Preliminary Final","Grand Final" ))
After that we would want to join on the score dataset so we can come up with win/loss column.
dataset_scores1<-df1%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game) dataset_scores2<-df1%>%dplyr::select(Date, Round, Away.Team, Away.Points,Game) #Sometimes when joining datasets together it helps to rename things for consistency colnames(dataset_scores1)<-"Team" colnames(dataset_scores1)<-"Points" colnames(dataset_scores2)<-"Team" colnames(dataset_scores2)<-"Points" df5<-rbind(dataset_scores1,dataset_scores2) dataset_margins<-df5%>%group_by(Game)%>% arrange(Game)%>% mutate(margin=c(-diff(Points),diff(Points))) # View(dataset_margins) # I have commented this out, but always good to view dataset_margins$Date<-as.Date(dataset_margins$Date) dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane")) dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs")) complete_df<-left_join(teamdata,dataset_margins,by=c("Date"="Date", "Team"="Team"))
The next step in the paper is feature creation. What this means is that we are creating some variables to be fed in the model. In this paper while it uses data available from fitzRoy it also has some variables created. These variables are:
library(knitr) df<-data.frame(stringsAsFactors=FALSE, Feature.created = c("Goal Conversion", "Kick to handball ratio", "Contested to uncontested possession ratio"), Description = c("the percentage of scoring events that were goals", "the number of kicks compared to handballs expressed as a ratio", "the number of contested possessions compared to uncontested posessessions expressed as a ratio")) df%>% kable("html", caption = "Table of Variables Created for Paper")
|Goal Conversion||the percentage of scoring events that were goals|
|Kick to handball ratio||the number of kicks compared to handballs expressed as a ratio|
|Contested to uncontested possession ratio||the number of contested possessions compared to uncontested posessessions expressed as a ratio|
- goal conversion is an example of a variable that we might not be able to fully replicate. For example if the player has a shot on goal but kicks it out of bounds that isn’t recorded in fitzRoy.
So lets just go ahead and create them. In the tidyverse, this would be done using the mutate function.
dataset_model<-complete_df%>%filter(margin!=0)%>% mutate(goal_conversion=(G/(G+B)), kicktohandball=K/HB, Contestedtouncontested=CP/UP, win_loss=ifelse(margin>0,1,0))%>% mutate(kicktohandball_diff=c(-diff(kicktohandball),diff(kicktohandball)), Contestedtouncontested_diff=c(-diff(Contestedtouncontested),diff(Contestedtouncontested)), goal_conversion_diff=c(-diff(goal_conversion),diff(goal_conversion)) )%>% filter(!margin==0)%>% filter(Season==2013)
Whats the next step now we have our dataset like the paper?
That would be to fit a logistic regression.
mylogit <- glm(win_loss ~ K_difference + M_difference + HB_difference+T_difference+I50_difference+CL_difference+CG_difference+CP_difference+UP_difference+CM_difference+goal_conversion+kicktohandball+Contestedtouncontested ,data = dataset_model, family ="binomial") summary(mylogit) ## ## Call: ## glm(formula = win_loss ~ K_difference + M_difference + HB_difference + ## T_difference + I50_difference + CL_difference + CG_difference + ## CP_difference + UP_difference + CM_difference + goal_conversion + ## kicktohandball + Contestedtouncontested, family = "binomial", ## data = dataset_model) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.17684 -0.42682 -0.00021 0.44963 2.48144 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -5.098957 1.632251 -3.124 0.001785 ** ## K_difference 0.017970 0.026365 0.682 0.495513 ## M_difference -0.040118 0.017819 -2.251 0.024356 * ## HB_difference -0.045039 0.029275 -1.538 0.123927 ## T_difference 0.027667 0.014805 1.869 0.061654 . ## I50_difference 0.065159 0.018815 3.463 0.000534 *** ## CL_difference -0.062603 0.021532 -2.907 0.003644 ** ## CG_difference -0.008463 0.020111 -0.421 0.673890 ## CP_difference 0.062228 0.028254 2.202 0.027633 * ## UP_difference 0.054032 0.028425 1.901 0.057324 . ## CM_difference 0.057289 0.041005 1.397 0.162379 ## goal_conversion 6.563385 1.600541 4.101 4.12e-05 *** ## kicktohandball 1.224616 0.900988 1.359 0.174086 ## Contestedtouncontested -0.771994 1.195178 -0.646 0.518328 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 546.20 on 393 degrees of freedom ## Residual deviance: 256.91 on 380 degrees of freedom ## AIC: 284.91 ## ## Number of Fisher Scoring iterations: 6
The question is why doesn’t our model here match the paper?
- I’m wrong (what should I do different thoughts feedback email me!)
- The data is different
- reading the paper again I didn’t remove draws whoops! (now fixed)
Some possible interesting changes to make
include some more years or different years
I think this makes sense as AFL is a constantly evolving game, the variables that might have been important a few years ago might not be so important now.