Taking Home Charlie

September 16, 2018
By

(This article was first published on Analysis of AFL, and kindly contributed to R-bloggers)

Build your own Brownlow Model

G’day all, your friendly neighbourhood PhD student here. I am here to help you all build your own models.

That is very different from what you might be used to which could consist of any one of the following scenarios.

  • Here is a model I have build that you are not smart enough too, follow my tips blindly
  • Here is a model that I have build with data you don’t have access to, therefore it is better than what you can do follow my tips
  • Here is a model that I don’t explain well apart from performance, follow my tips

Hopefully you see my point and see the unfortunate pattern.

This is NOT going to be one of those things, this is a guide to help you build a competitive Brownlow prediction model. It is fully reproducible and you can use the variables created if you want, you can create your own variables but by the end of this process the model is yours.

So let me try to ease some possible thoughts.

  1. Where can I get the data from?

James Day and myself have worked together to produce an R package called fitzRoy which means that you and anyone else can have easy access to all the good stuff you see on afltables and footywire.

  1. But what is the point other people do models why build my own when I can just use someone else’s?

That is a good point, there are Brownlow Models floating around, and you could even go to a tipping service and pay for tips. But I assume you are interested in predicting the Brownlow Medallist because you are a football nuffie and as such have ideas about what you think have an effect on how umpires vote.

  1. So if I have the data and I can build my own model, but I don’t want to because it won’t be any good.

All models are wrong but some are useful – This model is by no means perfect, that isn’t the point. But it will provide you a predicted order that has less ranking error vs Champion Data’s official prediction using data that you don’t have access to without \($\)$

  1. So what?

In a nutshell, below is a fully reproducible method you can edit and tinker with that without any tinkering with produced a better top 10 Brownlow Medal prediction vs the official data provider without any of the bells and whistles!

What Packages do I need?

library(MASS)
library(ordinal)
library(fitzRoy)
library(tidyverse)

So now you have the packages you need, first steps lets get the data ready

Getting the data

First we can use the fitzRoy package to get all the data from afltables

This will consist of basically 3 steps

  1. Get the data from afltables, we can do this using the fitzRoy::get_afltables_stats

  2. Create the game by game summaries so we have the total kicks per game, total handballs per game by team for all the statistics available on afltables

  3. Join on scores – because people feel as though margin has an effect on the order of 3,2,1 (Hard to get the 3 votes in a flogging)

We do this using the fitzRoy::match_results and then we join the datasets together creating a final data frame that has all player statistics at an individual level, joined with their team totals and game score.

df<-fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date())
## Returning data from 1897-01-01 to 2018-09-20
## Downloading data
## 
## Finished downloading data. Processing XMLs
## Finished getting afltables data
df<-df%>%filter(Season>2010)

names(df)
##  [1] "Season"                  "Round"                  
##  [3] "Date"                    "Local.start.time"       
##  [5] "Venue"                   "Attendance"             
##  [7] "Home.team"               "HQ1G"                   
##  [9] "HQ1B"                    "HQ2G"                   
## [11] "HQ2B"                    "HQ3G"                   
## [13] "HQ3B"                    "HQ4G"                   
## [15] "HQ4B"                    "Home.score"             
## [17] "Away.team"               "AQ1G"                   
## [19] "AQ1B"                    "AQ2G"                   
## [21] "AQ2B"                    "AQ3G"                   
## [23] "AQ3B"                    "AQ4G"                   
## [25] "AQ4B"                    "Away.score"             
## [27] "First.name"              "Surname"                
## [29] "ID"                      "Jumper.No."             
## [31] "Playing.for"             "Kicks"                  
## [33] "Marks"                   "Handballs"              
## [35] "Goals"                   "Behinds"                
## [37] "Hit.Outs"                "Tackles"                
## [39] "Rebounds"                "Inside.50s"             
## [41] "Clearances"              "Clangers"               
## [43] "Frees.For"               "Frees.Against"          
## [45] "Brownlow.Votes"          "Contested.Possessions"  
## [47] "Uncontested.Possessions" "Contested.Marks"        
## [49] "Marks.Inside.50"         "One.Percenters"         
## [51] "Bounces"                 "Goal.Assists"           
## [53] "Time.on.Ground.."        "Substitute"             
## [55] "Umpire.1"                "Umpire.2"               
## [57] "Umpire.3"                "Umpire.4"               
## [59] "group_id"
team_stats<-df%>%
  dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists)%>%
  group_by(Date, Season, Round, Playing.for)%>%
  summarise_if(is.numeric,funs(sum=c(sum(.))))
## Adding missing grouping variables: `Home.team`, `Away.team`
player_stats<-df%>%
  dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists)
## Adding missing grouping variables: `Home.team`, `Away.team`
complete_df<-left_join(player_stats,team_stats, by=c("Date"="Date", "Season"="Season",  "Playing.for"="Playing.for"))

dataset_scores<-fitzRoy::match_results
names(dataset_scores)
##  [1] "Game"         "Date"         "Round"        "Home.Team"   
##  [5] "Home.Goals"   "Home.Behinds" "Home.Points"  "Away.Team"   
##  [9] "Away.Goals"   "Away.Behinds" "Away.Points"  "Venue"       
## [13] "Margin"       "Season"       "Round.Type"   "Round.Number"
dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game)
dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game)
#Sometimes when joining datasets together it helps to rename things for consistency

colnames(dataset_scores1)[3]<-"Team"
colnames(dataset_scores1)[4]<-"Points"
colnames(dataset_scores2)[3]<-"Team"
colnames(dataset_scores2)[4]<-"Points"

df5<-rbind(dataset_scores1,dataset_scores2)
dataset_margins<-df5%>%group_by(Game)%>%
  arrange(Game)%>%
  mutate(margin=c(-diff(Points),diff(Points)))
# View(dataset_margins) # I have commented this out, but always good to view 
dataset_margins$Date<-as.Date(dataset_margins$Date)
complete_df$Date<-as.Date(complete_df$Date)

complete_df<-left_join(complete_df,dataset_margins,by=c("Date"="Date",  "Playing.for"="Team"))

Create the Ratios

Create ratios is important for the model, remember we are interested in their relative performance to their peers. Is it impressive kicking 3 goals if 25 have been kicked? Or is it more impressive getting 3 if only 5 goals have been kicked?

Intuition like this for kicks, handballs, clearances etc would lead us to believe that ratios are better predictors of polling votes over raw statistics.

complete_df_ratio<-complete_df%>%
  mutate(kick.ratio=Kicks/Kicks_sum,
         Marks.ratio=Marks/Marks_sum,
         handball.ratio=Handballs/Handballs_sum,
         Goals.ratio=Goals/Goals_sum,
         behinds.ratio=Behinds/Behinds_sum,
         hitouts.ratio=Hit.Outs/Hit.Outs_sum,
         tackles.ratio=Tackles/Tackles_sum,
         rebounds.ratio=Rebounds/Rebounds_sum,
         inside50s.ratio=Inside.50s/Inside.50s_sum,
         clearances.ratio=Clearances/Clearances_sum,
         clangers.ratio=Clangers/Clangers_sum,
         freefors.ratio=Frees.For/Frees.For_sum,
         freesagainst.ratio=Frees.Against/Frees.Against_sum,
         Contested.Possessions.ratio=Contested.Possessions/Contested.Possessions_sum,
         Uncontested.Possessions.ratio=Uncontested.Possessions/Uncontested.Possessions_sum,
         contested.marks.ratio=Contested.Marks/Contested.Marks_sum,
         marksinside50.ratio=Marks.Inside.50/Marks.Inside.50_sum,
         one.percenters.ratio=One.Percenters/One.Percenters_sum,
         bounces.ratio=Bounces/Bounces_sum,
         goal.assists.ratio=Goal.Assists/Goal.Assists_sum,
         disposals.ratio=(Kicks+Handballs)/(Kicks_sum+Handballs_sum))
df<-complete_df_ratio%>%dplyr::select(Date, First.name, Surname, Season, Round.x, Playing.for,-Brownlow.Votes, Brownlow.Votes_sum,everything())
df<-df%>%dplyr::select(-Brownlow.Votes,everything())
df[is.na(df)] <- 0 # we have to replace NA with 0

Now we have all that we need for the dataset now lets do some modelling!

Fit a proportional odds model

Basically all we are doing is using the dataset we have and fitting a proportional odds model. What this means is that we will end up with a probability that each player will poll 3 votes, 2 votes, 1 vote and 0 votes for each and every single game.

This will involve a few steps

  1. use subset to get our training data
  2. use filter to make sure we only have home and away games –this is a data knowledge step, we know that you can not get votes in finals, but the column instead of being left empty is filled with 0s so an easy way to make sure we only have home and away games is using Round as a filter.
in.sample  <- subset(df, Season %in% c(2013:2016)) # our training data

in.sample$Brownlow.Votes <- factor(in.sample$Brownlow.Votes)

in.sample<-in.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8",
                                             "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24"))


names(in.sample) 
##  [1] "Date"                          "First.name"                   
##  [3] "Surname"                       "Season"                       
##  [5] "Round.x"                       "Playing.for"                  
##  [7] "Brownlow.Votes_sum"            "Home.team"                    
##  [9] "Away.team"                     "Kicks"                        
## [11] "Marks"                         "Handballs"                    
## [13] "Goals"                         "Behinds"                      
## [15] "Hit.Outs"                      "Tackles"                      
## [17] "Rebounds"                      "Inside.50s"                   
## [19] "Clearances"                    "Clangers"                     
## [21] "Frees.For"                     "Frees.Against"                
## [23] "Contested.Possessions"         "Uncontested.Possessions"      
## [25] "Contested.Marks"               "Marks.Inside.50"              
## [27] "One.Percenters"                "Bounces"                      
## [29] "Goal.Assists"                  "Round.y"                      
## [31] "Kicks_sum"                     "Marks_sum"                    
## [33] "Handballs_sum"                 "Goals_sum"                    
## [35] "Behinds_sum"                   "Hit.Outs_sum"                 
## [37] "Tackles_sum"                   "Rebounds_sum"                 
## [39] "Inside.50s_sum"                "Clearances_sum"               
## [41] "Clangers_sum"                  "Frees.For_sum"                
## [43] "Frees.Against_sum"             "Contested.Possessions_sum"    
## [45] "Uncontested.Possessions_sum"   "Contested.Marks_sum"          
## [47] "Marks.Inside.50_sum"           "One.Percenters_sum"           
## [49] "Bounces_sum"                   "Goal.Assists_sum"             
## [51] "Round"                         "Points"                       
## [53] "Game"                          "margin"                       
## [55] "kick.ratio"                    "Marks.ratio"                  
## [57] "handball.ratio"                "Goals.ratio"                  
## [59] "behinds.ratio"                 "hitouts.ratio"                
## [61] "tackles.ratio"                 "rebounds.ratio"               
## [63] "inside50s.ratio"               "clearances.ratio"             
## [65] "clangers.ratio"                "freefors.ratio"               
## [67] "freesagainst.ratio"            "Contested.Possessions.ratio"  
## [69] "Uncontested.Possessions.ratio" "contested.marks.ratio"        
## [71] "marksinside50.ratio"           "one.percenters.ratio"         
## [73] "bounces.ratio"                 "goal.assists.ratio"           
## [75] "disposals.ratio"               "Brownlow.Votes"
in.sample$Player<-paste(in.sample$First.name,in.sample$Surname)

in.sample<-in.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, margin:Brownlow.Votes)
## Adding missing grouping variables: `Home.team`, `Away.team`
in.sample<-in.sample[-c(1,2)] #removed for 'cleaner look'



fm1<-clm(Brownlow.Votes~ handball.ratio +  Marks.ratio +  
           disposals.ratio+  hitouts.ratio+
           freefors.ratio +  freesagainst.ratio +  tackles.ratio +  Goals.ratio +   behinds.ratio + Contested.Possessions.ratio+
           Uncontested.Possessions.ratio +  clangers.ratio +    contested.marks.ratio + marksinside50.ratio +
           clearances.ratio +   rebounds.ratio +    inside50s.ratio +   one.percenters.ratio +  bounces.ratio+
           goal.assists.ratio  +margin, 
         data = in.sample)
## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

Variable selection

Nothing fancy here, remember the point is that you can pick and choose what variables you want as your predictors, here I am just going to use stepwise selection and go backwards.

But by all means, I am not saying this is the best way, I’m just getting to a result as quickly as I can with little thinking about the variables.

Ideally you might think about the variables and test them out for fit but moving on.

fm2<- stepAIC(fm1, direction='backward',type=AIC)
## Start:  AIC=13631.84
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + behinds.ratio + Contested.Possessions.ratio + 
##     Uncontested.Possessions.ratio + clangers.ratio + contested.marks.ratio + 
##     marksinside50.ratio + clearances.ratio + rebounds.ratio + 
##     inside50s.ratio + one.percenters.ratio + bounces.ratio + 
##     goal.assists.ratio + margin
## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met
##                                 Df   AIC
## - Uncontested.Possessions.ratio  1 13630
## - rebounds.ratio                 1 13630
## - behinds.ratio                  1 13631
## - goal.assists.ratio             1 13631
## - inside50s.ratio                1 13631
##                              13632
## - freesagainst.ratio             1 13638
## - Contested.Possessions.ratio    1 13639
## - clearances.ratio               1 13639
## - freefors.ratio                 1 13641
## - bounces.ratio                  1 13647
## - clangers.ratio                 1 13650
## - marksinside50.ratio            1 13650
## - Marks.ratio                    1 13650
## - handball.ratio                 1 13653
## - contested.marks.ratio          1 13656
## - one.percenters.ratio           1 13669
## - tackles.ratio                  1 13690
## - hitouts.ratio                  1 13790
## - disposals.ratio                1 13863
## - Goals.ratio                    1 14583
## - margin                         1 15649
## 
## Step:  AIC=13630.1
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + behinds.ratio + Contested.Possessions.ratio + 
##     clangers.ratio + contested.marks.ratio + marksinside50.ratio + 
##     clearances.ratio + rebounds.ratio + inside50s.ratio + one.percenters.ratio + 
##     bounces.ratio + goal.assists.ratio + margin
## 
##                               Df   AIC
## - rebounds.ratio               1 13628
## - behinds.ratio                1 13629
## - goal.assists.ratio           1 13629
## - inside50s.ratio              1 13629
##                            13630
## - freesagainst.ratio           1 13636
## - clearances.ratio             1 13638
## - freefors.ratio               1 13639
## - bounces.ratio                1 13645
## - clangers.ratio               1 13648
## - Marks.ratio                  1 13648
## - marksinside50.ratio          1 13648
## - handball.ratio               1 13652
## - contested.marks.ratio        1 13655
## - Contested.Possessions.ratio  1 13661
## - one.percenters.ratio         1 13667
## - tackles.ratio                1 13688
## - hitouts.ratio                1 13789
## - Goals.ratio                  1 14581
## - disposals.ratio              1 14601
## - margin                       1 15647
## 
## Step:  AIC=13628.54
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + behinds.ratio + Contested.Possessions.ratio + 
##     clangers.ratio + contested.marks.ratio + marksinside50.ratio + 
##     clearances.ratio + inside50s.ratio + one.percenters.ratio + 
##     bounces.ratio + goal.assists.ratio + margin
## 
##                               Df   AIC
## - behinds.ratio                1 13627
## - goal.assists.ratio           1 13627
## - inside50s.ratio              1 13628
##                            13628
## - freesagainst.ratio           1 13635
## - clearances.ratio             1 13636
## - freefors.ratio               1 13638
## - bounces.ratio                1 13644
## - marksinside50.ratio          1 13646
## - clangers.ratio               1 13646
## - Marks.ratio                  1 13647
## - contested.marks.ratio        1 13654
## - handball.ratio               1 13655
## - Contested.Possessions.ratio  1 13660
## - one.percenters.ratio         1 13668
## - tackles.ratio                1 13686
## - hitouts.ratio                1 13787
## - Goals.ratio                  1 14591
## - disposals.ratio              1 14799
## - margin                       1 15645
## 
## Step:  AIC=13627.36
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + Contested.Possessions.ratio + clangers.ratio + 
##     contested.marks.ratio + marksinside50.ratio + clearances.ratio + 
##     inside50s.ratio + one.percenters.ratio + bounces.ratio + 
##     goal.assists.ratio + margin
## 
##                               Df   AIC
## - goal.assists.ratio           1 13626
## - inside50s.ratio              1 13627
##                            13627
## - freesagainst.ratio           1 13634
## - clearances.ratio             1 13634
## - freefors.ratio               1 13637
## - bounces.ratio                1 13643
## - Marks.ratio                  1 13646
## - clangers.ratio               1 13646
## - marksinside50.ratio          1 13651
## - contested.marks.ratio        1 13652
## - handball.ratio               1 13656
## - Contested.Possessions.ratio  1 13660
## - one.percenters.ratio         1 13666
## - tackles.ratio                1 13685
## - hitouts.ratio                1 13786
## - Goals.ratio                  1 14591
## - disposals.ratio              1 14806
## - margin                       1 15646
## 
## Step:  AIC=13626.24
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + Contested.Possessions.ratio + clangers.ratio + 
##     contested.marks.ratio + marksinside50.ratio + clearances.ratio + 
##     inside50s.ratio + one.percenters.ratio + bounces.ratio + 
##     margin
## 
##                               Df   AIC
## - inside50s.ratio              1 13626
##                            13626
## - clearances.ratio             1 13633
## - freesagainst.ratio           1 13633
## - freefors.ratio               1 13636
## - bounces.ratio                1 13642
## - Marks.ratio                  1 13644
## - clangers.ratio               1 13645
## - marksinside50.ratio          1 13651
## - contested.marks.ratio        1 13651
## - handball.ratio               1 13654
## - Contested.Possessions.ratio  1 13659
## - one.percenters.ratio         1 13664
## - tackles.ratio                1 13684
## - hitouts.ratio                1 13784
## - Goals.ratio                  1 14590
## - disposals.ratio              1 14804
## - margin                       1 15645
## 
## Step:  AIC=13625.93
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + Contested.Possessions.ratio + clangers.ratio + 
##     contested.marks.ratio + marksinside50.ratio + clearances.ratio + 
##     one.percenters.ratio + bounces.ratio + margin
## 
##                               Df   AIC
##                            13626
## - freesagainst.ratio           1 13633
## - clearances.ratio             1 13634
## - freefors.ratio               1 13635
## - bounces.ratio                1 13642
## - Marks.ratio                  1 13643
## - clangers.ratio               1 13644
## - marksinside50.ratio          1 13650
## - contested.marks.ratio        1 13650
## - Contested.Possessions.ratio  1 13658
## - handball.ratio               1 13662
## - one.percenters.ratio         1 13663
## - tackles.ratio                1 13683
## - hitouts.ratio                1 13784
## - Goals.ratio                  1 14600
## - disposals.ratio              1 15042
## - margin                       1 15645

Lets do the prediction

Instead of just taking the data from afltables in 2017, lets do something just a little bit different. Lets use footywire as our out.sample

This might seem like a bit of a weird step, but it is really valuable. Footywire has more data than afltables for games, it contains stats like tackles inside 50 as an example. Also as an example within fitzRoy we essentially have 2 data sources for you to use. Either afltables or footywire. There are benefits to both for example afltables data goes further back, but doesn’t have extra columns that footywire has that you might want to use for example supercoach scores.

names(fitzRoy::player_stats)
##  [1] "Date"           "Season"         "Round"          "Venue"         
##  [5] "Player"         "Team"           "Opposition"     "Status"        
##  [9] "Match_id"       "CP"             "UP"             "ED"            
## [13] "DE"             "CM"             "GA"             "MI5"           
## [17] "One.Percenters" "BO"             "TOG"            "K"             
## [21] "HB"             "D"              "M"              "G"             
## [25] "B"              "T"              "HO"             "GA1"           
## [29] "I50"            "CL"             "CG"             "R50"           
## [33] "FF"             "FA"             "AF"             "SC"            
## [37] "CCL"            "SCL"            "SI"             "MG"            
## [41] "TO"             "ITC"            "T5"
df_2017<-fitzRoy::player_stats%>%
  filter(Season==2017)

team_stats_out<-df_2017%>%
  dplyr::select(Date, Player,Season, Round, Team, CP:T5)%>%
  group_by(Date,Season, Round, Team)%>%
  summarise_if(is.numeric,funs(sum=c(sum(.))))

player_stats_out<-df_2017%>%
  dplyr::select(Date, Player,Season, Round, Team, CP:T5)


complete_df_out<-left_join(player_stats_out,team_stats_out, by=c("Date"="Date", "Season"="Season",  "Team"="Team"))



dataset_scores<-fitzRoy::match_results
names(dataset_scores)
##  [1] "Game"         "Date"         "Round"        "Home.Team"   
##  [5] "Home.Goals"   "Home.Behinds" "Home.Points"  "Away.Team"   
##  [9] "Away.Goals"   "Away.Behinds" "Away.Points"  "Venue"       
## [13] "Margin"       "Season"       "Round.Type"   "Round.Number"
dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game)
dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game)

colnames(dataset_scores1)[3]<-"Team"
colnames(dataset_scores1)[4]<-"Points"
colnames(dataset_scores2)[3]<-"Team"
colnames(dataset_scores2)[4]<-"Points"


df5<-rbind(dataset_scores1,dataset_scores2)
dataset_margins<-df5%>%group_by(Game)%>%
  arrange(Game)%>%
  mutate(margin=c(-diff(Points),diff(Points)))
dataset_margins$Date<-as.Date(dataset_margins$Date)
complete_df_out$Date<-as.Date(complete_df_out$Date)

dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane"))

dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs"))


complete_df_out<-left_join(complete_df_out,dataset_margins,by=c("Date"="Date",  "Team"="Team"))

names(complete_df_out)
##  [1] "Date"               "Player"             "Season"            
##  [4] "Round.x"            "Team"               "CP"                
##  [7] "UP"                 "ED"                 "DE"                
## [10] "CM"                 "GA"                 "MI5"               
## [13] "One.Percenters"     "BO"                 "TOG"               
## [16] "K"                  "HB"                 "D"                 
## [19] "M"                  "G"                  "B"                 
## [22] "T"                  "HO"                 "GA1"               
## [25] "I50"                "CL"                 "CG"                
## [28] "R50"                "FF"                 "FA"                
## [31] "AF"                 "SC"                 "CCL"               
## [34] "SCL"                "SI"                 "MG"                
## [37] "TO"                 "ITC"                "T5"                
## [40] "Round.y"            "CP_sum"             "UP_sum"            
## [43] "ED_sum"             "DE_sum"             "CM_sum"            
## [46] "GA_sum"             "MI5_sum"            "One.Percenters_sum"
## [49] "BO_sum"             "TOG_sum"            "K_sum"             
## [52] "HB_sum"             "D_sum"              "M_sum"             
## [55] "G_sum"              "B_sum"              "T_sum"             
## [58] "HO_sum"             "GA1_sum"            "I50_sum"           
## [61] "CL_sum"             "CG_sum"             "R50_sum"           
## [64] "FF_sum"             "FA_sum"             "AF_sum"            
## [67] "SC_sum"             "CCL_sum"            "SCL_sum"           
## [70] "SI_sum"             "MG_sum"             "TO_sum"            
## [73] "ITC_sum"            "T5_sum"             "Round"             
## [76] "Points"             "Game"               "margin"
####create the new ratios
complete_df_ratio_out<-complete_df_out%>%
  mutate(kick.ratio=K/K_sum,
         Marks.ratio=M/M_sum,
         handball.ratio=HB/HB_sum,
         Goals.ratio=G/G_sum,
         behinds.ratio=B/B_sum,
         hitouts.ratio=HO/HO_sum,
         tackles.ratio=T/T_sum,
         rebounds.ratio=R50/R50_sum,
         inside50s.ratio=I50/I50_sum,
         clearances.ratio=(CCL+SCL)/(CCL_sum+SCL_sum),
         clangers.ratio=CL/CL_sum,
         freefors.ratio=FF/FF_sum,
         freesagainst.ratio=FA/FA_sum,
         Contested.Possessions.ratio=CP/CP_sum,
         Uncontested.Possessions.ratio=UP/UP_sum,
         contested.marks.ratio=CM/CM_sum,
         marksinside50.ratio=MI5/MI5_sum,
         one.percenters.ratio=One.Percenters/One.Percenters_sum,
         bounces.ratio=BO/BO_sum,
         goal.assists.ratio=GA/GA_sum,
         disposals.ratio=D/D_sum)




conforming<-complete_df_ratio_out%>%
  dplyr::select(Player, Date, Season, Round.x, Team, margin, 
                kick.ratio:disposals.ratio)

conforming$Brownlow.Votes<-0
out.sample=conforming

newdata   <- out.sample[ , -ncol(out.sample)]

pre.dict    <- predict(fm2,newdata=newdata, type='prob')
pre.dict.m  <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata)))
colnames(pre.dict.m) <- c("vote.0", "vote.1", "vote.2", "vote.3")

newdata.pred  <- cbind.data.frame(newdata, pre.dict.m)

Expected Votes

A proportional odds model makes the players total probabilities for each voting category sum to 1 for every player that the player plays in.

That is for Dustin Martin in round 1 2017, \(1= \sum_{k=0}^3 p_{k}\) which makes sense.

But we know that there is another harder restriction. That is only one player can get 3 votes, only one player can get 2 votes and only one player can get the 1 vote.

Now lets add in our much tougher restriction. This will mean the probability of polling 3 votes will sum to 1 across all players. We will call this the std (standardised) votes.

#### Step 1: Get expected value on Votes
newdata.pred$expected.votes <- newdata.pred$vote.1 + 2*newdata.pred$vote.2 + 3*newdata.pred$vote.3

####Join back on matchID whoops! 

get_match_ID<-fitzRoy::player_stats

xx<-get_match_ID%>%dplyr::select(Date, Player, Match_id)
newdata.pred<-left_join(newdata.pred, xx, by=c("Date"="Date",  "Player"="Player"))



newdata.pred<-filter(newdata.pred, Date<"2017-09-01")


sum1 <- aggregate(vote.1~Match_id, data = newdata.pred, FUN = sum ); names(sum1) <- c("Match_id", "sum.vote.1");
sum2 <- aggregate(vote.2~Match_id, data = newdata.pred, FUN = sum ); names(sum2) <- c("Match_id", "sum.vote.2");
sum3 <- aggregate(vote.3~Match_id, data = newdata.pred, FUN = sum ); names(sum3) <- c("Match_id", "sum.vote.3");

#### Step 3: Add sum of each vote by matchId to big table
newdata.pred <- merge(newdata.pred, sum1, by = "Match_id")
newdata.pred <- merge(newdata.pred, sum2, by = "Match_id")
newdata.pred <- merge(newdata.pred, sum3, by = "Match_id")

#### Step 4: Add std1/2/3
newdata.pred$std.1  <- (newdata.pred$sum.vote.1/newdata.pred$vote.1)^-1
newdata.pred$std.2  <- (newdata.pred$sum.vote.2/newdata.pred$vote.2)^-1
newdata.pred$std.3  <- (newdata.pred$sum.vote.3/newdata.pred$vote.3)^-1


#### Step 5: Expected standard game vote
newdata.pred$exp_std_game_vote <- newdata.pred$std.1 + 2*newdata.pred$std.2 + 3*newdata.pred$std.3  


#### Step 6: List of winners

newdata.pred$PlayerName<-paste(newdata.pred$Player," ",newdata.pred$Team)
winners.stdgame   <- aggregate(exp_std_game_vote~PlayerName, data = newdata.pred, FUN = sum );
winners.stdgame   <- winners.stdgame[order(-winners.stdgame$exp_std_game_vote), ]
winners.stdgame[1:10, ]
##                        PlayerName exp_std_game_vote
## 168      Dustin Martin   Richmond          34.26070
## 487 Patrick Dangerfield   Geelong          31.19050
## 618       Tom Mitchell   Hawthorn          27.40018
## 157        Dayne Zorko   Brisbane          17.68230
## 344            Joshua Kelly   GWS          16.82974
## 510         Rory Laird   Adelaide          15.98579
## 652    Zachary Merrett   Essendon          14.80721
## 512        Rory Sloane   Adelaide          14.53580
## 375       Lance Franklin   Sydney          14.02063
## 464       Nathan Fyfe   Fremantle          13.81357

But how good is this prediction?

We want our predicted order to be as close to correct as possible, especially for the top 10.

So lets use as a measure of accuracy ranking error which I will define as difference between actual place vs predicted place. For example, we have predicted Dustin Martin to finish first and he did so error is 0, if we predicted Dustin Martin to be runner up (like Champion Data did) then we would have a ranking error of 1.

library(tidyverse)
data.frame(stringsAsFactors=FALSE,
           Predictor = c("anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl",
                         "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "CD", "CD", "CD", "CD", "CD",
                         "CD", "CD", "CD", "CD", "CD", "CD", "CD", "CD", "CD"),
           Player = c("Dustin Martin", "Patrick DangerField", "Tom Mitchell",
                      "Dayne Zorko", "Josh Kelly", "Rory Laird", "Zach Merrett", "Rory Sloan",
                      "Lance Franklin", "Nat Fyfe", "Trent Cotchin", "Clayton Oliver",
                      "Callan Ward", "Marcus Bontempelli", "Patrick DangerField", "Dustin Martin",
                      "Clayton Oliver", "Tom Mitchell", "Rory Sloan", "Callan Ward", "Zach Merrett",
                      "Marcus Bontempelli", "Josh Kelly", "Trent Cotchin", "Rory Laird", "Lance Franklin",
                      "Nat Fyfe", "Dayne Zorko"),
           Ranking.Error = c(0L, 0L, 0L, 12L, 1L, 38L, 5L, 1L, 4L, 2L, 11L, 28L, 8L,
                             5L, 1L, 1L, 23L, 1L, 2L, 22L, 5L, 0L, 3L, 41L, 30L, 6L, 4L, 10L)
)%>%ggplot(aes(x=Ranking.Error, y=Player))+geom_point(aes(colour=Predictor), size=5)+ theme(axis.text.y=element_text(size=12)) +ylab("") 

So what we can see here is that by taking just the basic data and making them into game ratios and thinking a bit more about probability and with little thought into variable selection (backwards) we can come up with a prediction that does just as well (if not a little better) as champion data (although obviously biased).

So hopefully now you get the idea, this isn’t meant to be the very best method, but hopefully it does encourage you to instead of using just backwards aic to select your variables you can do something else instead.

What changes to make?

Probably the most obvious one is the variable selection step.

Now you could simply enter in the variables you thought might have an impact on polling brownlow votes and to be honest this is a good way to do things. As a football fan you would hope that your domain knowledge does improve things.

So how to do that?

Lets say you think that only contested possessions, inside 50s and clearances matter after all it is a midfielders award.

fm2<-clm(Brownlow.Votes~    Contested.Possessions.ratio+
           
           clearances.ratio +   inside50s.ratio,
         data = in.sample)

Then you can run everything through as normal and see how your model goes at predicting.

Why use AIC, why not BIC?

That’s a good point and exactly the point I would like to make lots of these choices can be changed and depends on use. The whole point if this exercise was to show that relatively quickly and with only freely available data you can do well in comparison to say CD neural net.

But if you favour interpretable models, you might want to use BIC, it penalises model complexity more so you will end up with models that have less variables.

That step is also quick to change

fm2<- stepAIC(fm1, direction='backward',type=BIC)

Lets do 2018

library(MASS)
library(ordinal)
library(fitzRoy)
library(tidyverse)

df<-fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date())
## Returning data from 1897-01-01 to 2018-09-20
## Downloading data
## 
## Finished downloading data. Processing XMLs
## Finished getting afltables data
df<-df%>%filter(Season>2010)

names(df)
##  [1] "Season"                  "Round"                  
##  [3] "Date"                    "Local.start.time"       
##  [5] "Venue"                   "Attendance"             
##  [7] "Home.team"               "HQ1G"                   
##  [9] "HQ1B"                    "HQ2G"                   
## [11] "HQ2B"                    "HQ3G"                   
## [13] "HQ3B"                    "HQ4G"                   
## [15] "HQ4B"                    "Home.score"             
## [17] "Away.team"               "AQ1G"                   
## [19] "AQ1B"                    "AQ2G"                   
## [21] "AQ2B"                    "AQ3G"                   
## [23] "AQ3B"                    "AQ4G"                   
## [25] "AQ4B"                    "Away.score"             
## [27] "First.name"              "Surname"                
## [29] "ID"                      "Jumper.No."             
## [31] "Playing.for"             "Kicks"                  
## [33] "Marks"                   "Handballs"              
## [35] "Goals"                   "Behinds"                
## [37] "Hit.Outs"                "Tackles"                
## [39] "Rebounds"                "Inside.50s"             
## [41] "Clearances"              "Clangers"               
## [43] "Frees.For"               "Frees.Against"          
## [45] "Brownlow.Votes"          "Contested.Possessions"  
## [47] "Uncontested.Possessions" "Contested.Marks"        
## [49] "Marks.Inside.50"         "One.Percenters"         
## [51] "Bounces"                 "Goal.Assists"           
## [53] "Time.on.Ground.."        "Substitute"             
## [55] "Umpire.1"                "Umpire.2"               
## [57] "Umpire.3"                "Umpire.4"               
## [59] "group_id"
team_stats<-df%>%
  dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists)%>%
  group_by(Date, Season, Round, Playing.for)%>%
  summarise_if(is.numeric,funs(sum=c(sum(.))))
## Adding missing grouping variables: `Home.team`, `Away.team`
player_stats<-df%>%
  dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists)
## Adding missing grouping variables: `Home.team`, `Away.team`
complete_df<-left_join(player_stats,team_stats, by=c("Date"="Date", "Season"="Season",  "Playing.for"="Playing.for"))

dataset_scores<-fitzRoy::match_results
names(dataset_scores)
##  [1] "Game"         "Date"         "Round"        "Home.Team"   
##  [5] "Home.Goals"   "Home.Behinds" "Home.Points"  "Away.Team"   
##  [9] "Away.Goals"   "Away.Behinds" "Away.Points"  "Venue"       
## [13] "Margin"       "Season"       "Round.Type"   "Round.Number"
dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game)
dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game)

colnames(dataset_scores1)[3]<-"Team"
colnames(dataset_scores1)[4]<-"Points"
colnames(dataset_scores2)[3]<-"Team"
colnames(dataset_scores2)[4]<-"Points"

df5<-rbind(dataset_scores1,dataset_scores2)
dataset_margins<-df5%>%group_by(Game)%>%
  arrange(Game)%>%
  mutate(margin=c(-diff(Points),diff(Points)))
# View(dataset_margins)
dataset_margins$Date<-as.Date(dataset_margins$Date)
complete_df$Date<-as.Date(complete_df$Date)

complete_df<-left_join(complete_df,dataset_margins,by=c("Date"="Date",  "Playing.for"="Team"))


complete_df_ratio<-complete_df%>%
  mutate(kick.ratio=Kicks/Kicks_sum,
         Marks.ratio=Marks/Marks_sum,
         handball.ratio=Handballs/Handballs_sum,
         Goals.ratio=Goals/Goals_sum,
         behinds.ratio=Behinds/Behinds_sum,
         hitouts.ratio=Hit.Outs/Hit.Outs_sum,
         tackles.ratio=Tackles/Tackles_sum,
         rebounds.ratio=Rebounds/Rebounds_sum,
         inside50s.ratio=Inside.50s/Inside.50s_sum,
         clearances.ratio=Clearances/Clearances_sum,
         clangers.ratio=Clangers/Clangers_sum,
         freefors.ratio=Frees.For/Frees.For_sum,
         freesagainst.ratio=Frees.Against/Frees.Against_sum,
         Contested.Possessions.ratio=Contested.Possessions/Contested.Possessions_sum,
         Uncontested.Possessions.ratio=Uncontested.Possessions/Uncontested.Possessions_sum,
         contested.marks.ratio=Contested.Marks/Contested.Marks_sum,
         marksinside50.ratio=Marks.Inside.50/Marks.Inside.50_sum,
         one.percenters.ratio=One.Percenters/One.Percenters_sum,
         bounces.ratio=Bounces/Bounces_sum,
         goal.assists.ratio=Goal.Assists/Goal.Assists_sum,
         disposals.ratio=(Kicks+Handballs)/(Kicks_sum+Handballs_sum))
df<-complete_df_ratio%>%dplyr::select(Date,Game, First.name, Surname, Season, Round.x, Playing.for,-Brownlow.Votes, Brownlow.Votes_sum,everything())
df<-df%>%dplyr::select(-Brownlow.Votes,everything())
df[is.na(df)] <- 0


in.sample  <- subset(df, Season %in% c(2014:2017))
# out.sample <- subset(df, Season == 2014)
in.sample$Brownlow.Votes <- factor(in.sample$Brownlow.Votes)

in.sample<-in.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8",
                                             "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24"))


names(in.sample)
##  [1] "Date"                          "Game"                         
##  [3] "First.name"                    "Surname"                      
##  [5] "Season"                        "Round.x"                      
##  [7] "Playing.for"                   "Brownlow.Votes_sum"           
##  [9] "Home.team"                     "Away.team"                    
## [11] "Kicks"                         "Marks"                        
## [13] "Handballs"                     "Goals"                        
## [15] "Behinds"                       "Hit.Outs"                     
## [17] "Tackles"                       "Rebounds"                     
## [19] "Inside.50s"                    "Clearances"                   
## [21] "Clangers"                      "Frees.For"                    
## [23] "Frees.Against"                 "Contested.Possessions"        
## [25] "Uncontested.Possessions"       "Contested.Marks"              
## [27] "Marks.Inside.50"               "One.Percenters"               
## [29] "Bounces"                       "Goal.Assists"                 
## [31] "Round.y"                       "Kicks_sum"                    
## [33] "Marks_sum"                     "Handballs_sum"                
## [35] "Goals_sum"                     "Behinds_sum"                  
## [37] "Hit.Outs_sum"                  "Tackles_sum"                  
## [39] "Rebounds_sum"                  "Inside.50s_sum"               
## [41] "Clearances_sum"                "Clangers_sum"                 
## [43] "Frees.For_sum"                 "Frees.Against_sum"            
## [45] "Contested.Possessions_sum"     "Uncontested.Possessions_sum"  
## [47] "Contested.Marks_sum"           "Marks.Inside.50_sum"          
## [49] "One.Percenters_sum"            "Bounces_sum"                  
## [51] "Goal.Assists_sum"              "Round"                        
## [53] "Points"                        "margin"                       
## [55] "kick.ratio"                    "Marks.ratio"                  
## [57] "handball.ratio"                "Goals.ratio"                  
## [59] "behinds.ratio"                 "hitouts.ratio"                
## [61] "tackles.ratio"                 "rebounds.ratio"               
## [63] "inside50s.ratio"               "clearances.ratio"             
## [65] "clangers.ratio"                "freefors.ratio"               
## [67] "freesagainst.ratio"            "Contested.Possessions.ratio"  
## [69] "Uncontested.Possessions.ratio" "contested.marks.ratio"        
## [71] "marksinside50.ratio"           "one.percenters.ratio"         
## [73] "bounces.ratio"                 "goal.assists.ratio"           
## [75] "disposals.ratio"               "Brownlow.Votes"
in.sample$Player<-paste(in.sample$First.name,in.sample$Surname)

in.sample<-in.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, margin:Brownlow.Votes)
## Adding missing grouping variables: `Home.team`, `Away.team`
# in.sample<-in.sample[-c(1,2)]







summary(in.sample)
##   Home.team          Away.team            Player         
##  Length:34804       Length:34804       Length:34804      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##       Date                Season       Round.x          Playing.for       
##  Min.   :2014-03-14   Min.   :2014   Length:34804       Length:34804      
##  1st Qu.:2014-08-31   1st Qu.:2014   Class :character   Class :character  
##  Median :2016-03-24   Median :2016   Mode  :character   Mode  :character  
##  Mean   :2015-12-12   Mean   :2016                                        
##  3rd Qu.:2017-03-23   3rd Qu.:2017                                        
##  Max.   :2017-08-27   Max.   :2017                                        
##      margin            kick.ratio       Marks.ratio      handball.ratio   
##  Min.   :-145.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.: -26.0000   1st Qu.:0.02985   1st Qu.:0.02597   1st Qu.:0.02646  
##  Median :   0.0000   Median :0.04367   Median :0.04274   Median :0.04132  
##  Mean   :  -0.3173   Mean   :0.04545   Mean   :0.04545   Mean   :0.04545  
##  3rd Qu.:  26.0000   3rd Qu.:0.05914   3rd Qu.:0.06173   3rd Qu.:0.06015  
##  Max.   : 145.0000   Max.   :0.15976   Max.   :0.19444   Max.   :0.19820  
##   Goals.ratio      behinds.ratio     hitouts.ratio     tackles.ratio    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.01923  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.03947  
##  Mean   :0.04545   Mean   :0.04545   Mean   :0.04545   Mean   :0.04545  
##  3rd Qu.:0.07692   3rd Qu.:0.08333   3rd Qu.:0.00000   3rd Qu.:0.06383  
##  Max.   :0.71429   Max.   :1.00000   Max.   :1.00000   Max.   :0.27273  
##  rebounds.ratio    inside50s.ratio   clearances.ratio  clangers.ratio   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.01887   1st Qu.:0.00000   1st Qu.:0.02128  
##  Median :0.03226   Median :0.04000   Median :0.02632   Median :0.04167  
##  Mean   :0.04545   Mean   :0.04545   Mean   :0.04545   Mean   :0.04545  
##  3rd Qu.:0.07143   3rd Qu.:0.06667   3rd Qu.:0.06818   3rd Qu.:0.06522  
##  Max.   :0.40909   Max.   :0.29412   Max.   :0.46154   Max.   :0.25000  
##  freefors.ratio    freesagainst.ratio Contested.Possessions.ratio
##  Min.   :0.00000   Min.   :0.00000    Min.   :0.00000            
##  1st Qu.:0.00000   1st Qu.:0.00000    1st Qu.:0.02703            
##  Median :0.04167   Median :0.04167    Median :0.04110            
##  Mean   :0.04545   Mean   :0.04545    Mean   :0.04545            
##  3rd Qu.:0.07143   3rd Qu.:0.07143    3rd Qu.:0.05932            
##  Max.   :0.50000   Max.   :0.50000    Max.   :0.20800            
##  Uncontested.Possessions.ratio contested.marks.ratio marksinside50.ratio
##  Min.   :0.00000               Min.   :0.00000       Min.   :0.00000    
##  1st Qu.:0.02899               1st Qu.:0.00000       1st Qu.:0.00000    
##  Median :0.04348               Median :0.00000       Median :0.00000    
##  Mean   :0.04545               Mean   :0.04543       Mean   :0.04545    
##  3rd Qu.:0.06000               3rd Qu.:0.08333       3rd Qu.:0.07143    
##  Max.   :0.15287               Max.   :1.00000       Max.   :1.00000    
##  one.percenters.ratio bounces.ratio     goal.assists.ratio
##  Min.   :0.00000      Min.   :0.00000   Min.   :0.00000   
##  1st Qu.:0.01471      1st Qu.:0.00000   1st Qu.:0.00000   
##  Median :0.03125      Median :0.00000   Median :0.00000   
##  Mean   :0.04545      Mean   :0.04454   Mean   :0.04545   
##  3rd Qu.:0.06452      3rd Qu.:0.00000   3rd Qu.:0.08333   
##  Max.   :0.40541      Max.   :1.00000   Max.   :1.00000   
##  disposals.ratio   Brownlow.Votes
##  Min.   :0.00000   0:32431       
##  1st Qu.:0.03161   1:  791       
##  Median :0.04323   2:  791       
##  Mean   :0.04545   3:  791       
##  3rd Qu.:0.05788                 
##  Max.   :0.14196
fm1<-clm(Brownlow.Votes~ handball.ratio +  Marks.ratio +  
           disposals.ratio+  hitouts.ratio+
           freefors.ratio +  freesagainst.ratio +  tackles.ratio +  Goals.ratio +   behinds.ratio + Contested.Possessions.ratio+
           Uncontested.Possessions.ratio +  clangers.ratio +    contested.marks.ratio + marksinside50.ratio +
           clearances.ratio +   rebounds.ratio +    inside50s.ratio +   one.percenters.ratio +  bounces.ratio+
           goal.assists.ratio  +margin, 
         data = in.sample)

fm2<- stepAIC(fm1, direction='backward',type=AIC)
## Start:  AIC=13509.69
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + behinds.ratio + Contested.Possessions.ratio + 
##     Uncontested.Possessions.ratio + clangers.ratio + contested.marks.ratio + 
##     marksinside50.ratio + clearances.ratio + rebounds.ratio + 
##     inside50s.ratio + one.percenters.ratio + bounces.ratio + 
##     goal.assists.ratio + margin
## 
##                                 Df   AIC
## - Uncontested.Possessions.ratio  1 13508
##                              13510
## - goal.assists.ratio             1 13510
## - inside50s.ratio                1 13510
## - behinds.ratio                  1 13510
## - rebounds.ratio                 1 13511
## - freesagainst.ratio             1 13514
## - Contested.Possessions.ratio    1 13514
## - clangers.ratio                 1 13518
## - clearances.ratio               1 13520
## - bounces.ratio                  1 13522
## - marksinside50.ratio            1 13525
## - freefors.ratio                 1 13525
## - handball.ratio                 1 13529
## - contested.marks.ratio          1 13538
## - Marks.ratio                    1 13538
## - one.percenters.ratio           1 13548
## - tackles.ratio                  1 13585
## - hitouts.ratio                  1 13687
## - disposals.ratio                1 13738
## - Goals.ratio                    1 14490
## - margin                         1 15524
## 
## Step:  AIC=13508.46
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + behinds.ratio + Contested.Possessions.ratio + 
##     clangers.ratio + contested.marks.ratio + marksinside50.ratio + 
##     clearances.ratio + rebounds.ratio + inside50s.ratio + one.percenters.ratio + 
##     bounces.ratio + goal.assists.ratio + margin
## 
##                               Df   AIC
##                            13508
## - inside50s.ratio              1 13509
## - goal.assists.ratio           1 13509
## - behinds.ratio                1 13509
## - rebounds.ratio               1 13510
## - freesagainst.ratio           1 13512
## - clangers.ratio               1 13517
## - clearances.ratio             1 13520
## - bounces.ratio                1 13520
## - freefors.ratio               1 13524
## - marksinside50.ratio          1 13524
## - handball.ratio               1 13529
## - Contested.Possessions.ratio  1 13535
## - Marks.ratio                  1 13536
## - contested.marks.ratio        1 13538
## - one.percenters.ratio         1 13546
## - tackles.ratio                1 13584
## - hitouts.ratio                1 13687
## - disposals.ratio              1 14403
## - Goals.ratio                  1 14489
## - margin                       1 15523
names(fitzRoy::player_stats)
##  [1] "Date"           "Season"         "Round"          "Venue"         
##  [5] "Player"         "Team"           "Opposition"     "Status"        
##  [9] "Match_id"       "CP"             "UP"             "ED"            
## [13] "DE"             "CM"             "GA"             "MI5"           
## [17] "One.Percenters" "BO"             "TOG"            "K"             
## [21] "HB"             "D"              "M"              "G"             
## [25] "B"              "T"              "HO"             "GA1"           
## [29] "I50"            "CL"             "CG"             "R50"           
## [33] "FF"             "FA"             "AF"             "SC"            
## [37] "CCL"            "SCL"            "SI"             "MG"            
## [41] "TO"             "ITC"            "T5"
df_2017<-fitzRoy::get_footywire_stats(9514:9709)
## Getting data from footywire.com
## Finished getting data
team_stats_out<-df_2017%>%
  dplyr::select(Date, Player,Season, Round, Team, CP:SC)%>%
  group_by(Date,Season, Round, Team)%>%
  summarise_if(is.numeric,funs(sum=c(sum(.))))

player_stats_out<-df_2017%>%
  dplyr::select(Date, Player,Season, Round, Team, CP:SC)


complete_df_out<-left_join(player_stats_out,team_stats_out, by=c("Date"="Date", "Season"="Season",  "Team"="Team"))



dataset_scores<-fitzRoy::match_results
names(dataset_scores)
##  [1] "Game"         "Date"         "Round"        "Home.Team"   
##  [5] "Home.Goals"   "Home.Behinds" "Home.Points"  "Away.Team"   
##  [9] "Away.Goals"   "Away.Behinds" "Away.Points"  "Venue"       
## [13] "Margin"       "Season"       "Round.Type"   "Round.Number"
dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game)
dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game)

colnames(dataset_scores1)[3]<-"Team"
colnames(dataset_scores1)[4]<-"Points"
colnames(dataset_scores2)[3]<-"Team"
colnames(dataset_scores2)[4]<-"Points"


df5<-rbind(dataset_scores1,dataset_scores2)
dataset_margins<-df5%>%group_by(Game)%>%
  arrange(Game)%>%
  mutate(margin=c(-diff(Points),diff(Points)))
dataset_margins$Date<-as.Date(dataset_margins$Date)
complete_df_out$Date<-as.Date(complete_df_out$Date)

dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane"))

dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs"))


complete_df_out<-left_join(complete_df_out,dataset_margins,by=c("Date"="Date",  "Team"="Team"))

names(complete_df_out)
##  [1] "Date"               "Player"             "Season"            
##  [4] "Round.x"            "Team"               "CP"                
##  [7] "UP"                 "ED"                 "DE"                
## [10] "CM"                 "GA"                 "MI5"               
## [13] "One.Percenters"     "BO"                 "CCL"               
## [16] "SCL"                "SI"                 "MG"                
## [19] "TO"                 "ITC"                "T5"                
## [22] "TOG"                "K"                  "HB"                
## [25] "D"                  "M"                  "G"                 
## [28] "B"                  "T"                  "HO"                
## [31] "GA1"                "I50"                "CL"                
## [34] "CG"                 "R50"                "FF"                
## [37] "FA"                 "AF"                 "SC"                
## [40] "Round.y"            "CP_sum"             "UP_sum"            
## [43] "ED_sum"             "DE_sum"             "CM_sum"            
## [46] "GA_sum"             "MI5_sum"            "One.Percenters_sum"
## [49] "BO_sum"             "CCL_sum"            "SCL_sum"           
## [52] "SI_sum"             "MG_sum"             "TO_sum"            
## [55] "ITC_sum"            "T5_sum"             "TOG_sum"           
## [58] "K_sum"              "HB_sum"             "D_sum"             
## [61] "M_sum"              "G_sum"              "B_sum"             
## [64] "T_sum"              "HO_sum"             "GA1_sum"           
## [67] "I50_sum"            "CL_sum"             "CG_sum"            
## [70] "R50_sum"            "FF_sum"             "FA_sum"            
## [73] "AF_sum"             "SC_sum"             "Round"             
## [76] "Points"             "Game"               "margin"
####create the new ratios
complete_df_ratio_out<-complete_df_out%>%
  mutate(kick.ratio=K/K_sum,
         Marks.ratio=M/M_sum,
         handball.ratio=HB/HB_sum,
         Goals.ratio=G/G_sum,
         behinds.ratio=B/B_sum,
         hitouts.ratio=HO/HO_sum,
         tackles.ratio=T/T_sum,
         rebounds.ratio=R50/R50_sum,
         inside50s.ratio=I50/I50_sum,
         clearances.ratio=(CCL+SCL)/(CCL_sum+SCL_sum),
         clangers.ratio=CL/CL_sum,
         freefors.ratio=FF/FF_sum,
         freesagainst.ratio=FA/FA_sum,
         Contested.Possessions.ratio=CP/CP_sum,
         Uncontested.Possessions.ratio=UP/UP_sum,
         contested.marks.ratio=CM/CM_sum,
         marksinside50.ratio=MI5/MI5_sum,
         one.percenters.ratio=One.Percenters/One.Percenters_sum,
         bounces.ratio=BO/BO_sum,
         goal.assists.ratio=GA/GA_sum,
         disposals.ratio=D/D_sum)




conforming<-complete_df_ratio_out%>%
  dplyr::select(Player, Date, Season, Round.x, Team, margin, 
                kick.ratio:disposals.ratio)

conforming$Brownlow.Votes<-0
out.sample=conforming
out.sample[is.na(out.sample)] <- 0

newdata   <- out.sample[ , -ncol(out.sample)]



pre.dict    <- predict(fm2,newdata=newdata, type='prob')
pre.dict.m  <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata)))
colnames(pre.dict.m) <- c("vote.0", "vote.1", "vote.2", "vote.3")

newdata.pred  <- cbind.data.frame(newdata, pre.dict.m)



#### Step 1: Get expected value on Votes
newdata.pred$expected.votes <- newdata.pred$vote.1 + 2*newdata.pred$vote.2 + 3*newdata.pred$vote.3

####Join back on matchID whoops!

get_match_ID<-fitzRoy::player_stats

xx<-get_match_ID%>%dplyr::select(Date, Player, Match_id)
newdata.pred<-left_join(newdata.pred, xx, by=c("Date"="Date",  "Player"="Player"))



# newdata.pred<-filter(newdata.pred, Date<"2018-09-01")


sum1 <- aggregate(vote.1~Match_id, data = newdata.pred, FUN = sum ); names(sum1) <- c("Match_id", "sum.vote.1");
sum2 <- aggregate(vote.2~Match_id, data = newdata.pred, FUN = sum ); names(sum2) <- c("Match_id", "sum.vote.2");
sum3 <- aggregate(vote.3~Match_id, data = newdata.pred, FUN = sum ); names(sum3) <- c("Match_id", "sum.vote.3");

#### Step 3: Add sum of each vote by matchId to big table
newdata.pred <- merge(newdata.pred, sum1, by = "Match_id")
newdata.pred <- merge(newdata.pred, sum2, by = "Match_id")
newdata.pred <- merge(newdata.pred, sum3, by = "Match_id")

#### Step 4: Add std1/2/3
newdata.pred$std.1  <- (newdata.pred$sum.vote.1/newdata.pred$vote.1)^-1
newdata.pred$std.2  <- (newdata.pred$sum.vote.2/newdata.pred$vote.2)^-1
newdata.pred$std.3  <- (newdata.pred$sum.vote.3/newdata.pred$vote.3)^-1


#### Step 5: Expected standard game vote
newdata.pred$exp_std_game_vote <- newdata.pred$std.1 + 2*newdata.pred$std.2 + 3*newdata.pred$std.3  


#### Step 6: List of winners

newdata.pred$PlayerName<-paste(newdata.pred$Player," ",newdata.pred$Team)
winners.stdgame   <- aggregate(exp_std_game_vote~PlayerName, data = newdata.pred, FUN = sum );
winners.stdgame   <- winners.stdgame[order(-winners.stdgame$exp_std_game_vote), ]
winners.stdgame[1:10, ]
##                            PlayerName exp_std_game_vote
## 623           Tom Mitchell   Hawthorn          30.71782
## 179          Dustin Martin   Richmond          19.29637
## 495     Patrick Dangerfield   Geelong          18.33302
## 30           Andrew Gaff   West Coast          16.92758
## 169            Dayne Beams   Brisbane          16.86057
## 119        Clayton Oliver   Melbourne          15.68180
## 193           Elliot Yeo   West Coast          15.65408
## 467           Nathan Fyfe   Fremantle          15.36249
## 254 Jackson Macrae   Western Bulldogs          15.15357
## 382           Lance Franklin   Sydney          14.14637

There you go, using the same method that gave a better predicted order vs Champion Data last year is now yours free and fully reproducible. (Find me another blog that does that for you!)

Betting tips

To evaluate the model or a model first lets build a quick one using the above code for 2017 and evaluate it according to the rough metrics that the Brownlow Professor uses.

library(fitzRoy)
library(tidyverse)
df<-fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date())
## Returning data from 1897-01-01 to 2018-09-20
## Downloading data
## 
## Finished downloading data. Processing XMLs
## Finished getting afltables data
df<-df%>%filter(Season>2010)
names(df)
##  [1] "Season"                  "Round"                  
##  [3] "Date"                    "Local.start.time"       
##  [5] "Venue"                   "Attendance"             
##  [7] "Home.team"               "HQ1G"                   
##  [9] "HQ1B"                    "HQ2G"                   
## [11] "HQ2B"                    "HQ3G"                   
## [13] "HQ3B"                    "HQ4G"                   
## [15] "HQ4B"                    "Home.score"             
## [17] "Away.team"               "AQ1G"                   
## [19] "AQ1B"                    "AQ2G"                   
## [21] "AQ2B"                    "AQ3G"                   
## [23] "AQ3B"                    "AQ4G"                   
## [25] "AQ4B"                    "Away.score"             
## [27] "First.name"              "Surname"                
## [29] "ID"                      "Jumper.No."             
## [31] "Playing.for"             "Kicks"                  
## [33] "Marks"                   "Handballs"              
## [35] "Goals"                   "Behinds"                
## [37] "Hit.Outs"                "Tackles"                
## [39] "Rebounds"                "Inside.50s"             
## [41] "Clearances"              "Clangers"               
## [43] "Frees.For"               "Frees.Against"          
## [45] "Brownlow.Votes"          "Contested.Possessions"  
## [47] "Uncontested.Possessions" "Contested.Marks"        
## [49] "Marks.Inside.50"         "One.Percenters"         
## [51] "Bounces"                 "Goal.Assists"           
## [53] "Time.on.Ground.."        "Substitute"             
## [55] "Umpire.1"                "Umpire.2"               
## [57] "Umpire.3"                "Umpire.4"               
## [59] "group_id"
team_stats<-df%>%
  dplyr::select(Date,Home.team, Away.team, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists)%>%
  group_by(Date, Season, Round, Playing.for)%>%
  summarise_if(is.numeric,funs(sum=c(sum(.))))

player_stats<-df%>%
  dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists)
## Adding missing grouping variables: `Home.team`, `Away.team`
complete_df<-left_join(player_stats,team_stats, by=c("Date"="Date", "Season"="Season",  "Playing.for"="Playing.for"))

#but we also need margins as per honours stuff

dataset_scores<-fitzRoy::match_results
names(dataset_scores)
##  [1] "Game"         "Date"         "Round"        "Home.Team"   
##  [5] "Home.Goals"   "Home.Behinds" "Home.Points"  "Away.Team"   
##  [9] "Away.Goals"   "Away.Behinds" "Away.Points"  "Venue"       
## [13] "Margin"       "Season"       "Round.Type"   "Round.Number"
dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game)
dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game)

colnames(dataset_scores1)[3]<-"Team"
colnames(dataset_scores1)[4]<-"Points"
colnames(dataset_scores2)[3]<-"Team"
colnames(dataset_scores2)[4]<-"Points"

df5<-rbind(dataset_scores1,dataset_scores2)
dataset_margins<-df5%>%group_by(Game)%>%
  arrange(Game)%>%
  mutate(margin=c(-diff(Points),diff(Points)))
# View(dataset_margins)
dataset_margins$Date<-as.Date(dataset_margins$Date)
complete_df$Date<-as.Date(complete_df$Date)

complete_df<-left_join(complete_df,dataset_margins,by=c("Date"="Date",  "Playing.for"="Team"))


complete_df_ratio<-complete_df%>%
  mutate(kick.ratio=Kicks/Kicks_sum,
         Marks.ratio=Marks/Marks_sum,
         handball.ratio=Handballs/Handballs_sum,
         Goals.ratio=Goals/Goals_sum,
         behinds.ratio=Behinds/Behinds_sum,
         hitouts.ratio=Hit.Outs/Hit.Outs_sum,
         tackles.ratio=Tackles/Tackles_sum,
         rebounds.ratio=Rebounds/Rebounds_sum,
         inside50s.ratio=Inside.50s/Inside.50s_sum,
         clearances.ratio=Clearances/Clearances_sum,
         clangers.ratio=Clangers/Clangers_sum,
         freefors.ratio=Frees.For/Frees.For_sum,
         freesagainst.ratio=Frees.Against/Frees.Against_sum,
         Contested.Possessions.ratio=Contested.Possessions/Contested.Possessions_sum,
         Uncontested.Possessions.ratio=Uncontested.Possessions/Uncontested.Possessions_sum,
         contested.marks.ratio=Contested.Marks/Contested.Marks_sum,
         marksinside50.ratio=Marks.Inside.50/Marks.Inside.50_sum,
         one.percenters.ratio=One.Percenters/One.Percenters_sum,
         bounces.ratio=Bounces/Bounces_sum,
         goal.assists.ratio=Goal.Assists/Goal.Assists_sum,
         disposals.ratio=(Kicks+Handballs)/(Kicks_sum+Handballs_sum))
df<-complete_df_ratio%>%dplyr::select(Date, First.name, Surname, Season, Round.x, Playing.for,-Brownlow.Votes, Brownlow.Votes_sum,everything())
df<-df%>%dplyr::select(-Brownlow.Votes,everything())
df[is.na(df)] <- 0
in.sample  <- subset(df, Season %in% c(2013:2016))
# out.sample <- subset(df, Season == 2014)
in.sample$Brownlow.Votes <- factor(in.sample$Brownlow.Votes)

in.sample<-in.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8",
                                             "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24"))



in.sample$Player<-paste(in.sample$First.name,in.sample$Surname)

in.sample<-in.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, Home.team, Away.team, margin:Brownlow.Votes)


temp1<-scale(in.sample[,8:29])
in.sample[,8:29]<-temp1
#attributes(temp1)
temp1.center<-attr(temp1,"scaled:center")
temp1.scale<-attr(temp1,"scaled:scale")





library(ordinal)

fm1<-clm(Brownlow.Votes~ handball.ratio +  Marks.ratio +  
           disposals.ratio+  hitouts.ratio+
           freefors.ratio +  freesagainst.ratio +  tackles.ratio +  Goals.ratio +   behinds.ratio + Contested.Possessions.ratio+
           Uncontested.Possessions.ratio +  clangers.ratio +    contested.marks.ratio + marksinside50.ratio +
           clearances.ratio +   rebounds.ratio +    inside50s.ratio +   one.percenters.ratio +  bounces.ratio+
           goal.assists.ratio  +margin, 
         data = in.sample)

library(MASS)

fm2<- stepAIC(fm1, direction='backward',type=AIC)
## Start:  AIC=13631.84
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + behinds.ratio + Contested.Possessions.ratio + 
##     Uncontested.Possessions.ratio + clangers.ratio + contested.marks.ratio + 
##     marksinside50.ratio + clearances.ratio + rebounds.ratio + 
##     inside50s.ratio + one.percenters.ratio + bounces.ratio + 
##     goal.assists.ratio + margin
## 
##                                 Df   AIC
## - Uncontested.Possessions.ratio  1 13630
## - rebounds.ratio                 1 13630
## - behinds.ratio                  1 13631
## - goal.assists.ratio             1 13631
## - inside50s.ratio                1 13631
##                              13632
## - freesagainst.ratio             1 13638
## - Contested.Possessions.ratio    1 13639
## - clearances.ratio               1 13639
## - freefors.ratio                 1 13641
## - bounces.ratio                  1 13647
## - clangers.ratio                 1 13650
## - marksinside50.ratio            1 13650
## - Marks.ratio                    1 13650
## - handball.ratio                 1 13653
## - contested.marks.ratio          1 13656
## - one.percenters.ratio           1 13669
## - tackles.ratio                  1 13690
## - hitouts.ratio                  1 13790
## - disposals.ratio                1 13863
## - Goals.ratio                    1 14583
## - margin                         1 15649
## 
## Step:  AIC=13630.1
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + behinds.ratio + Contested.Possessions.ratio + 
##     clangers.ratio + contested.marks.ratio + marksinside50.ratio + 
##     clearances.ratio + rebounds.ratio + inside50s.ratio + one.percenters.ratio + 
##     bounces.ratio + goal.assists.ratio + margin
## 
##                               Df   AIC
## - rebounds.ratio               1 13628
## - behinds.ratio                1 13629
## - goal.assists.ratio           1 13629
## - inside50s.ratio              1 13629
##                            13630
## - freesagainst.ratio           1 13636
## - clearances.ratio             1 13638
## - freefors.ratio               1 13639
## - bounces.ratio                1 13645
## - clangers.ratio               1 13648
## - Marks.ratio                  1 13648
## - marksinside50.ratio          1 13648
## - handball.ratio               1 13652
## - contested.marks.ratio        1 13655
## - Contested.Possessions.ratio  1 13661
## - one.percenters.ratio         1 13667
## - tackles.ratio                1 13688
## - hitouts.ratio                1 13789
## - Goals.ratio                  1 14581
## - disposals.ratio              1 14601
## - margin                       1 15647
## 
## Step:  AIC=13628.54
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + behinds.ratio + Contested.Possessions.ratio + 
##     clangers.ratio + contested.marks.ratio + marksinside50.ratio + 
##     clearances.ratio + inside50s.ratio + one.percenters.ratio + 
##     bounces.ratio + goal.assists.ratio + margin
## 
##                               Df   AIC
## - behinds.ratio                1 13627
## - goal.assists.ratio           1 13627
## - inside50s.ratio              1 13628
##                            13628
## - freesagainst.ratio           1 13635
## - clearances.ratio             1 13636
## - freefors.ratio               1 13638
## - bounces.ratio                1 13644
## - marksinside50.ratio          1 13646
## - clangers.ratio               1 13646
## - Marks.ratio                  1 13647
## - contested.marks.ratio        1 13654
## - handball.ratio               1 13655
## - Contested.Possessions.ratio  1 13660
## - one.percenters.ratio         1 13668
## - tackles.ratio                1 13686
## - hitouts.ratio                1 13787
## - Goals.ratio                  1 14591
## - disposals.ratio              1 14799
## - margin                       1 15645
## 
## Step:  AIC=13627.36
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + Contested.Possessions.ratio + clangers.ratio + 
##     contested.marks.ratio + marksinside50.ratio + clearances.ratio + 
##     inside50s.ratio + one.percenters.ratio + bounces.ratio + 
##     goal.assists.ratio + margin
## 
##                               Df   AIC
## - goal.assists.ratio           1 13626
## - inside50s.ratio              1 13627
##                            13627
## - freesagainst.ratio           1 13634
## - clearances.ratio             1 13634
## - freefors.ratio               1 13637
## - bounces.ratio                1 13643
## - Marks.ratio                  1 13646
## - clangers.ratio               1 13646
## - marksinside50.ratio          1 13651
## - contested.marks.ratio        1 13652
## - handball.ratio               1 13656
## - Contested.Possessions.ratio  1 13660
## - one.percenters.ratio         1 13666
## - tackles.ratio                1 13685
## - hitouts.ratio                1 13786
## - Goals.ratio                  1 14591
## - disposals.ratio              1 14806
## - margin                       1 15646
## 
## Step:  AIC=13626.24
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + Contested.Possessions.ratio + clangers.ratio + 
##     contested.marks.ratio + marksinside50.ratio + clearances.ratio + 
##     inside50s.ratio + one.percenters.ratio + bounces.ratio + 
##     margin
## 
##                               Df   AIC
## - inside50s.ratio              1 13626
##                            13626
## - clearances.ratio             1 13633
## - freesagainst.ratio           1 13633
## - freefors.ratio               1 13636
## - bounces.ratio                1 13642
## - Marks.ratio                  1 13644
## - clangers.ratio               1 13645
## - marksinside50.ratio          1 13651
## - contested.marks.ratio        1 13651
## - handball.ratio               1 13654
## - Contested.Possessions.ratio  1 13659
## - one.percenters.ratio         1 13664
## - tackles.ratio                1 13684
## - hitouts.ratio                1 13784
## - Goals.ratio                  1 14590
## - disposals.ratio              1 14804
## - margin                       1 15645
## 
## Step:  AIC=13625.93
## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + 
##     hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + 
##     Goals.ratio + Contested.Possessions.ratio + clangers.ratio + 
##     contested.marks.ratio + marksinside50.ratio + clearances.ratio + 
##     one.percenters.ratio + bounces.ratio + margin
## 
##                               Df   AIC
##                            13626
## - freesagainst.ratio           1 13633
## - clearances.ratio             1 13634
## - freefors.ratio               1 13635
## - bounces.ratio                1 13642
## - Marks.ratio                  1 13643
## - clangers.ratio               1 13644
## - marksinside50.ratio          1 13650
## - contested.marks.ratio        1 13650
## - Contested.Possessions.ratio  1 13658
## - handball.ratio               1 13662
## - one.percenters.ratio         1 13663
## - tackles.ratio                1 13683
## - hitouts.ratio                1 13784
## - Goals.ratio                  1 14600
## - disposals.ratio              1 15042
## - margin                       1 15645
out.sample  <- subset(df, Season %in% c(2017))
out.sample$Brownlow.Votes <- factor(out.sample$Brownlow.Votes)

out.sample<-out.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8",
                                             "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24"))


names(out.sample)
##  [1] "Date"                          "First.name"                   
##  [3] "Surname"                       "Season"                       
##  [5] "Round.x"                       "Playing.for"                  
##  [7] "Brownlow.Votes_sum"            "Home.team"                    
##  [9] "Away.team"                     "Kicks"                        
## [11] "Marks"                         "Handballs"                    
## [13] "Goals"                         "Behinds"                      
## [15] "Hit.Outs"                      "Tackles"                      
## [17] "Rebounds"                      "Inside.50s"                   
## [19] "Clearances"                    "Clangers"                     
## [21] "Frees.For"                     "Frees.Against"                
## [23] "Contested.Possessions"         "Uncontested.Possessions"      
## [25] "Contested.Marks"               "Marks.Inside.50"              
## [27] "One.Percenters"                "Bounces"                      
## [29] "Goal.Assists"                  "Round.y"                      
## [31] "Kicks_sum"                     "Marks_sum"                    
## [33] "Handballs_sum"                 "Goals_sum"                    
## [35] "Behinds_sum"                   "Hit.Outs_sum"                 
## [37] "Tackles_sum"                   "Rebounds_sum"                 
## [39] "Inside.50s_sum"                "Clearances_sum"               
## [41] "Clangers_sum"                  "Frees.For_sum"                
## [43] "Frees.Against_sum"             "Contested.Possessions_sum"    
## [45] "Uncontested.Possessions_sum"   "Contested.Marks_sum"          
## [47] "Marks.Inside.50_sum"           "One.Percenters_sum"           
## [49] "Bounces_sum"                   "Goal.Assists_sum"             
## [51] "Round"                         "Points"                       
## [53] "Game"                          "margin"                       
## [55] "kick.ratio"                    "Marks.ratio"                  
## [57] "handball.ratio"                "Goals.ratio"                  
## [59] "behinds.ratio"                 "hitouts.ratio"                
## [61] "tackles.ratio"                 "rebounds.ratio"               
## [63] "inside50s.ratio"               "clearances.ratio"             
## [65] "clangers.ratio"                "freefors.ratio"               
## [67] "freesagainst.ratio"            "Contested.Possessions.ratio"  
## [69] "Uncontested.Possessions.ratio" "contested.marks.ratio"        
## [71] "marksinside50.ratio"           "one.percenters.ratio"         
## [73] "bounces.ratio"                 "goal.assists.ratio"           
## [75] "disposals.ratio"               "Brownlow.Votes"
out.sample$Player<-paste(out.sample$First.name,out.sample$Surname)

out.sample<-out.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, Home.team, Away.team, margin:Brownlow.Votes)


out.sample[is.na(out.sample)] <- 0

newdata   <- out.sample[ , -ncol(out.sample)]

newdata[,8:29]<-scale(newdata[,8:29],center=temp1.center,scale=temp1.scale) 

pre.dict    <- predict(fm2,newdata=newdata, type='prob')
pre.dict.m  <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata)))
colnames(pre.dict.m) <- c("vote.0", "vote.1", "vote.2", "vote.3")

newdata.pred  <- cbind.data.frame(newdata, pre.dict.m)
newdata.pred$expected.votes <- newdata.pred$vote.1 + 2*newdata.pred$vote.2 + 3*newdata.pred$vote.3
newdata.pred$Match_id<-paste(newdata.pred$Season, newdata.pred$Round.x, newdata.pred$Home.team)
sum1 <- aggregate(vote.1~Match_id, data = newdata.pred, FUN = sum ); names(sum1) <- c("Match_id", "sum.vote.1");
sum2 <- aggregate(vote.2~Match_id, data = newdata.pred, FUN = sum ); names(sum2) <- c("Match_id", "sum.vote.2");
sum3 <- aggregate(vote.3~Match_id, data = newdata.pred, FUN = sum ); names(sum3) <- c("Match_id", "sum.vote.3");

#### Step 3: Add sum of each vote by matchId to big table
newdata.pred <- merge(newdata.pred, sum1, by = "Match_id")
newdata.pred <- merge(newdata.pred, sum2, by = "Match_id")
newdata.pred <- merge(newdata.pred, sum3, by = "Match_id")

#### Step 4: Add std1/2/3
newdata.pred$std.1  <- (newdata.pred$sum.vote.1/newdata.pred$vote.1)^-1
newdata.pred$std.2  <- (newdata.pred$sum.vote.2/newdata.pred$vote.2)^-1
newdata.pred$std.3  <- (newdata.pred$sum.vote.3/newdata.pred$vote.3)^-1


#### Step 5: Expected standard game vote
newdata.pred$exp_std_game_vote <- newdata.pred$std.1 + 2*newdata.pred$std.2 + 3*newdata.pred$std.3  


#### Step 6: List of winners

newdata.pred$PlayerName<-paste(newdata.pred$Player," ",newdata.pred$Playing.for)
winners.stdgame   <- aggregate(exp_std_game_vote~PlayerName, data = newdata.pred, FUN = sum );
winners.stdgame   <- winners.stdgame[order(-winners.stdgame$exp_std_game_vote), ]
winners.stdgame[1:10, ]
##                              PlayerName exp_std_game_vote
## 168            Dustin Martin   Richmond          33.78545
## 490       Patrick Dangerfield   Geelong          31.31288
## 617             Tom Mitchell   Hawthorn          28.73881
## 157        Dayne Zorko   Brisbane Lions          18.52044
## 513              Rory Sloane   Adelaide          16.85977
## 340 Josh Kelly   Greater Western Sydney          16.70946
## 653             Zach Merrett   Essendon          16.42835
## 424              Matt Crouch   Adelaide          16.00151
## 341               Josh Kennedy   Sydney          15.23837
## 483         Ollie Wines   Port Adelaide          15.08383

What you might notice about this is that the order is slightly different to what is presented earlier. But what is the different? The out.sample uses the afltables data only no footywire data is used at all. In other words the above script just showed that there are differences between the data on AFLtables and footywire!!!!!

Recall earlier, one of the ways that Brownlow Professor makes an assessment of his model is he looks at home many times that the player ranked within the top 5 get the 3 votes 95% of the time, 84% of the time for the 2 votes and 74% of the time for the one vote.

Well using the above method, we could check it for the 2017 AFL season.

To do this we need to join on the actual votes that happened in 2018 to the predicted probabilities based on our model

out.sample  <- subset(df, Season %in% c(2017))
out.sample$Brownlow.Votes <- factor(out.sample$Brownlow.Votes)

out.sample<-out.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8",
                                             "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24"))


names(out.sample)
##  [1] "Date"                          "First.name"                   
##  [3] "Surname"                       "Season"                       
##  [5] "Round.x"                       "Playing.for"                  
##  [7] "Brownlow.Votes_sum"            "Home.team"                    
##  [9] "Away.team"                     "Kicks"                        
## [11] "Marks"                         "Handballs"                    
## [13] "Goals"                         "Behinds"                      
## [15] "Hit.Outs"                      "Tackles"                      
## [17] "Rebounds"                      "Inside.50s"                   
## [19] "Clearances"                    "Clangers"                     
## [21] "Frees.For"                     "Frees.Against"                
## [23] "Contested.Possessions"         "Uncontested.Possessions"      
## [25] "Contested.Marks"               "Marks.Inside.50"              
## [27] "One.Percenters"                "Bounces"                      
## [29] "Goal.Assists"                  "Round.y"                      
## [31] "Kicks_sum"                     "Marks_sum"                    
## [33] "Handballs_sum"                 "Goals_sum"                    
## [35] "Behinds_sum"                   "Hit.Outs_sum"                 
## [37] "Tackles_sum"                   "Rebounds_sum"                 
## [39] "Inside.50s_sum"                "Clearances_sum"               
## [41] "Clangers_sum"                  "Frees.For_sum"                
## [43] "Frees.Against_sum"             "Contested.Possessions_sum"    
## [45] "Uncontested.Possessions_sum"   "Contested.Marks_sum"          
## [47] "Marks.Inside.50_sum"           "One.Percenters_sum"           
## [49] "Bounces_sum"                   "Goal.Assists_sum"             
## [51] "Round"                         "Points"                       
## [53] "Game"                          "margin"                       
## [55] "kick.ratio"                    "Marks.ratio"                  
## [57] "handball.ratio"                "Goals.ratio"                  
## [59] "behinds.ratio"                 "hitouts.ratio"                
## [61] "tackles.ratio"                 "rebounds.ratio"               
## [63] "inside50s.ratio"               "clearances.ratio"             
## [65] "clangers.ratio"                "freefors.ratio"               
## [67] "freesagainst.ratio"            "Contested.Possessions.ratio"  
## [69] "Uncontested.Possessions.ratio" "contested.marks.ratio"        
## [71] "marksinside50.ratio"           "one.percenters.ratio"         
## [73] "bounces.ratio"                 "goal.assists.ratio"           
## [75] "disposals.ratio"               "Brownlow.Votes"
out.sample$Player<-paste(out.sample$First.name,out.sample$Surname)

out.sample<-out.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, Home.team, Away.team, margin:Brownlow.Votes)

fulldataframe<-left_join(newdata.pred,out.sample,by=c("Date"="Date",  "Playing.for"="Playing.for", "Player"="Player"))

From there we might want to create a pivot table where we can see how often the highest ranked person by say expected standardised votes gets the 3 votes and so on.

To do that we can use this bit of script below, it requires a handy package called janitor

xx<-fulldataframe  %>%
  group_by(Match_id) %>%
  mutate(my_ranks = order(order(exp_std_game_vote, decreasing=TRUE)))
table(xx$my_ranks, xx$Brownlow.Votes)
##     
##        0   1   2   3
##   1   32  22  35 109
##   2   63  33  59  43
##   3  112  31  33  22
##   4  133  36  24   5
##   5  170  17   8   3
##   6  171  11   9   7
##   7  183   5   6   4
##   8  181  11   5   1
##   9  185   8   4   1
##   10 190   6   2   0
##   11 193   2   2   1
##   12 195   2   1   0
##   13 194   3   1   0
##   14 193   2   2   1
##   15 196   0   2   0
##   16 196   1   1   0
##   17 196   1   1   0
##   18 197   0   1   0
##   19 196   1   0   1
##   20 196   1   1   0
##   21 195   2   1   0
##   22 198   0   0   0
##   23 198   0   0   0
##   24 197   1   0   0
##   25 197   1   0   0
##   26 197   1   0   0
##   27 198   0   0   0
##   28 198   0   0   0
##   29 198   0   0   0
##   30 198   0   0   0
##   31 198   0   0   0
##   32 198   0   0   0
##   33 198   0   0   0
##   34 198   0   0   0
##   35 198   0   0   0
##   36 198   0   0   0
##   37 198   0   0   0
##   38 198   0   0   0
##   39 198   0   0   0
##   40 198   0   0   0
##   41 198   0   0   0
##   42 198   0   0   0
##   43 198   0   0   0
##   44 198   0   0   0
xx %>% janitor::tabyl(my_ranks, Brownlow.Votes)%>%
  adorn_totals(c("row", "col")) %>%
  adorn_percentages("col") %>% 
  adorn_pct_formatting(rounding = "half up", digits = 2) 
##  my_ranks       0       1       2       3   Total
##         1   0.39%  11.11%  17.68%  55.05%   2.27%
##         2   0.78%  16.67%  29.80%  21.72%   2.27%
##         3   1.38%  15.66%  16.67%  11.11%   2.27%
##         4   1.64%  18.18%  12.12%   2.53%   2.27%
##         5   2.09%   8.59%   4.04%   1.52%   2.27%
##         6   2.11%   5.56%   4.55%   3.54%   2.27%
##         7   2.25%   2.53%   3.03%   2.02%   2.27%
##         8   2.23%   5.56%   2.53%   0.51%   2.27%
##         9   2.28%   4.04%   2.02%   0.51%   2.27%
##        10   2.34%   3.03%   1.01%   0.00%   2.27%
##        11   2.38%   1.01%   1.01%   0.51%   2.27%
##        12   2.40%   1.01%   0.51%   0.00%   2.27%
##        13   2.39%   1.52%   0.51%   0.00%   2.27%
##        14   2.38%   1.01%   1.01%   0.51%   2.27%
##        15   2.41%   0.00%   1.01%   0.00%   2.27%
##        16   2.41%   0.51%   0.51%   0.00%   2.27%
##        17   2.41%   0.51%   0.51%   0.00%   2.27%
##        18   2.43%   0.00%   0.51%   0.00%   2.27%
##        19   2.41%   0.51%   0.00%   0.51%   2.27%
##        20   2.41%   0.51%   0.51%   0.00%   2.27%
##        21   2.40%   1.01%   0.51%   0.00%   2.27%
##        22   2.44%   0.00%   0.00%   0.00%   2.27%
##        23   2.44%   0.00%   0.00%   0.00%   2.27%
##        24   2.43%   0.51%   0.00%   0.00%   2.27%
##        25   2.43%   0.51%   0.00%   0.00%   2.27%
##        26   2.43%   0.51%   0.00%   0.00%   2.27%
##        27   2.44%   0.00%   0.00%   0.00%   2.27%
##        28   2.44%   0.00%   0.00%   0.00%   2.27%
##        29   2.44%   0.00%   0.00%   0.00%   2.27%
##        30   2.44%   0.00%   0.00%   0.00%   2.27%
##        31   2.44%   0.00%   0.00%   0.00%   2.27%
##        32   2.44%   0.00%   0.00%   0.00%   2.27%
##        33   2.44%   0.00%   0.00%   0.00%   2.27%
##        34   2.44%   0.00%   0.00%   0.00%   2.27%
##        35   2.44%   0.00%   0.00%   0.00%   2.27%
##        36   2.44%   0.00%   0.00%   0.00%   2.27%
##        37   2.44%   0.00%   0.00%   0.00%   2.27%
##        38   2.44%   0.00%   0.00%   0.00%   2.27%
##        39   2.44%   0.00%   0.00%   0.00%   2.27%
##        40   2.44%   0.00%   0.00%   0.00%   2.27%
##        41   2.44%   0.00%   0.00%   0.00%   2.27%
##        42   2.44%   0.00%   0.00%   0.00%   2.27%
##        43   2.44%   0.00%   0.00%   0.00%   2.27%
##        44   2.44%   0.00%   0.00%   0.00%   2.27%
##     Total 100.00% 100.00% 100.00% 100.00% 100.00%

So what we can tell from this is over half the time the player we rank first based on our model presented (which I hope you have improved by now) gets the 3 votes. only 32 times out of the 198 games did they fail to poll a single vote. That is not too bad but hopefully your one is better!

That seems OK but what if I want to see how often a player ranks in x?

That’s not too bad from here now that we have gotten most things, we can do that using a filter

xx%>%
  dplyr::select(Player, exp_std_game_vote, my_ranks, Brownlow.Votes)%>%
  dplyr::filter(Player=="Dustin Martin")%>%
  janitor::tabyl(my_ranks, Brownlow.Votes)
## Adding missing grouping variables: `Match_id`
##  my_ranks 0 1 2  3
##         1 1 2 0 10
##         2 2 1 0  1
##         3 1 0 0  0
##         5 1 0 0  0
##         9 1 0 0  0
##        15 1 0 0  0
##        21 1 0 0  0

What we can see is that out of the 13 times we thought Dustin Martin should poll the most votes he did 10 times, 2 times he polled 1 vote and 1 time he didn’t poll.

These are the kinds of things you might want to look at with your model. For instance, this year everyone is talking up Max Gawn chances. A nice check to do, might be how often do rucks get the 3 votes etc.

Lets show a quick example with Maxy.

df<-fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date())
## Returning data from 1897-01-01 to 2018-09-20
## Downloading data
## 
## Finished downloading data. Processing XMLs
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 264 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected   actual file                                       expected                                                   actual 1  8713 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~ file 2  8714 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~ row 3  8715 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~ col 4  8716 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~ expected 5  8717 Round an integer QF     'https://afltables.com/afl/stats/2018_sta~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
## Warning: Unknown columns: `Substitute`
## Finished getting afltables data
df%>%filter(Season>2010)%>%filter(First.name=="Max", Surname=="Gawn")%>%janitor::tabyl(Brownlow.Votes, Season)
##  Brownlow.Votes 2011 2013 2014 2015 2016 2017 2018
##               0    4   13    9   11   16   11   18
##               1    0    0    0    1    0    1    0
##               2    0    0    0    0    2    1    0
##               3    0    0    0    1    4    0    0
##              NA    0    0    0    0    0    0    6

Interestingly what we can see is that in Max Gawns other All Aus year (if you believe its a good thing) he polled the 3 votes 4 times. Was this year his better and if so should he poll more?

Lets compare his season in 2016 to his season in 2018

library(tidyverse)
library(fitzRoy)
df<-fitzRoy::player_stats
df<-df%>%
  filter(Season != 2018) #filters out the 2018 data (incomeplete that was downloaded when installing fitzRoy for first time) 
df1<-fitzRoy::get_footywire_stats(9514:9709) 
## Getting data from footywire.com
## Finished getting data
df2<-rbind(df, df1)
df3<-df2%>%
  filter(Season %in% c(2016,2018))%>%
  filter(Player =="Max Gawn") 

df3<-df2%>%
  filter(Season %in% c(2016,2018))%>%
  filter(Player =="Max Gawn") %>%
  group_by(Season)%>%
  summarise(ave.ho=mean(HO),
            ave.CM=mean(CM),
            ave.SC=mean(SC),
            ave.MG=mean(MG),
            ave.ITC=mean(ITC), 
            ave.AF=mean(AF),
            ave.SC=mean(SC),
            ave.Mi5=mean(MI5))

df4<-gather(df3,variables, values, -Season)   
df4%>%   
  ggplot(aes(x=values, y=variables)) +geom_point(aes(colour=as.factor(df4$Season)))+ geom_line(aes(group = variables))

This doesn’t look very clear because of the different ranges of values, i.e. supercoach scores vs Marks inside 50.

We should facet this to make it look clearer.

df4%>%   
    ggplot(aes(x=as.factor(df4$Season), y=values)) +
  geom_point()+
  facet_wrap(~variables,scales = "free")

So what we can see is that for this group of statistics apart from Marks inside 50, Gawn has done better, does this make him a Brownlow smokey?

YOU be the judge.

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)