Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Something that gets to many a footy fan, is the feeling that your team has won the game in most areas expect on the scoreboard.

Thinking about this statement a little bit deeper has the following implication. That there are some areas of the game, that if you win, you tend to win the game.

The data we will use here is the game differentials of the various game statistics that are available using fitzRoy

The general idea is its fairly obvious if you score more than the opposition you win. Shocking I know! What we are trying to do here, is come up with a concept of we won other things but not the scoreboard, usually when we win these things we tend to win a game.

So we know thanks to fitzRoy we have access to the data on footywire, which has a few extra variables that are not on afltables.

So imagine this scenario, you don’t have access to the scores, only the in game statistics as per these pages, what variables/differentials would you look at to decide to will win?

We know that for example, looking only at free kick differential isn’t very predictive. Is there a combination of differentials that when a team wins they are more likely to win? Lets build a model to find out.

The model I will use here, is a logistic regression model with the binary outcome being win/loss.

# Step One Build the dataset

library(tidyverse)
## -- Attaching packages -------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ----------------------------------- tidyverse_conflicts() --
fitzRoy::player_stats%>%
filter(Season>2014)%>%
select(-Player)%>%
group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>%
summarise_all(.funs = sum)%>%
group_by(Match_id)%>%
arrange(Match_id)%>%
mutate(diff_cp=c(-diff(CP), diff(CP)))
## # A tibble: 1,462 x 43
## # Groups:   Match_id [731]
##    Date       Season Round  Team    Status Opposition Venue Match_id    CP
##
##  1 2015-04-02   2015 Round~ Carlton Home   Richmond   MCG       5964   127
##  2 2015-04-02   2015 Round~ Richmo~ Away   Carlton    MCG       5964   121
##  3 2015-04-04   2015 Round~ Gold C~ Away   Melbourne  MCG       5965   136
##  4 2015-04-04   2015 Round~ Melbou~ Home   Gold Coast MCG       5965   150
##  5 2015-04-04   2015 Round~ Essend~ Away   Sydney     ANZ ~     5966   150
##  6 2015-04-04   2015 Round~ Sydney  Home   Essendon   ANZ ~     5966   176
##  7 2015-04-04   2015 Round~ Brisba~ Home   Collingwo~ Gabba     5967   152
##  8 2015-04-04   2015 Round~ Collin~ Away   Brisbane   Gabba     5967   165
##  9 2015-04-04   2015 Round~ West C~ Away   Western B~ Etih~     5968   137
## 10 2015-04-04   2015 Round~ Wester~ Home   West Coast Etih~     5968   156
## # ... with 1,452 more rows, and 34 more variables: UP , ED ,
## #   DE , CM , GA , MI5 , One.Percenters ,
## #   BO , TOG , K , HB , D , M , G ,
## #   B , T , HO , GA1 , I50 , CL , CG ,
## #   R50 , FF , FA , AF , SC , CCL ,
## #   SCL , SI , MG , TO , ITC , T5 ,
## #   diff_cp 

Something to remember with fitzRoy is that footywire doesn’t have tackles inside 50, meters gained etc for games previous to 2015, so this becomes our first filter

The next thing we do, is we select all the columns except for Player, the reason being is that we don’t need it and also its nice to see how to “deselect” columns as well as select them.

Remember we want data at a game level for each team and we want to be able to come up with the in game differential, this is where our next group_by comes in handy and its related to the summarise_all.

You might be thinking jeez mate that’s a lot to group by can’t you just use Match_id as they are already unique. Yes that is true, I could have, but one thing about summarise_all is that is summarises every column that is not in the group_by. You can test this out by running below, it should spit out an error message. Error in summarise_impl(.data, dots) : Evaluation error: invalid 'type' (character) of argument.

fitzRoy::player_stats%>%
filter(Season>2014)%>%
select(-Player)%>%
group_by( Match_id)%>%
summarise_all(.funs = sum)

After the summarise_all, we group by Match_id so we can find the differentials by Match_id so we can come up with as an example the contested possession differential diff_cp. I like to arrange the dataset so I can do sanity checks vs footywire.

So the next thing you might be thinking, is jeez doing that for all variables finding their differentials seems a bit tedious.

That is where we can use mutate_if, we can just mutate all columns that is.numeric and is not in the group_by which is another way we could have done original summarise_all instead we could have used summarise_if BUT match_id is numeric!

You can check this by using

str(fitzRoy::player_stats)
## 'data.frame':    76296 obs. of  43 variables:
##  $Date : Date, format: "2010-03-25" "2010-03-25" ... ##$ Season        : num  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $Round : chr "Round 1" "Round 1" "Round 1" "Round 1" ... ##$ Venue         : chr  "MCG" "MCG" "MCG" "MCG" ...
##  $Player : chr "Daniel Connors" "Daniel Jackson" "Brett Deledio" "Ben Cousins" ... ##$ Team          : chr  "Richmond" "Richmond" "Richmond" "Richmond" ...
##  $Opposition : chr "Carlton" "Carlton" "Carlton" "Carlton" ... ##$ Status        : chr  "Home" "Home" "Home" "Home" ...
##  $Match_id : num 5089 5089 5089 5089 5089 ... ##$ CP            : int  8 11 7 9 8 6 7 7 6 7 ...
##  $UP : int 15 10 14 10 10 12 10 6 7 5 ... ##$ ED            : int  16 14 16 11 13 16 13 7 10 7 ...
##  $DE : num 66.7 60.9 76.2 57.9 68.4 88.9 76.5 50 76.9 53.8 ... ##$ CM            : int  0 1 0 0 1 0 0 0 1 0 ...
##  $GA : int 0 0 0 1 0 0 0 1 0 0 ... ##$ MI5           : int  0 0 0 0 0 0 0 2 0 0 ...
##  $One.Percenters: int 1 0 0 0 0 1 5 2 5 1 ... ##$ BO            : int  0 0 0 0 1 0 0 0 0 0 ...
##  $TOG : int 69 80 89 69 77 81 84 80 100 88 ... ##$ K             : int  14 11 12 13 11 5 7 9 6 7 ...
##  $HB : int 10 12 9 6 8 13 10 5 7 6 ... ##$ D             : int  24 23 21 19 19 18 17 14 13 13 ...
##  $M : int 3 2 5 1 6 4 2 3 4 2 ... ##$ G             : int  0 0 1 1 0 0 0 1 0 0 ...
##  $B : int 0 0 0 0 0 0 0 1 0 0 ... ##$ T             : int  1 5 6 1 1 3 2 5 4 4 ...
##  $HO : int 0 0 0 0 0 0 0 0 0 0 ... ##$ GA1           : int  0 0 0 1 0 0 0 1 0 0 ...
##  $I50 : int 2 8 4 1 2 2 1 5 0 3 ... ##$ CL            : int  2 5 3 2 3 3 4 4 1 1 ...
##  $CG : int 4 4 4 3 3 1 2 0 2 0 ... ##$ R50           : int  6 1 3 4 2 0 2 0 3 1 ...
##  $FF : int 2 2 1 1 0 0 1 4 1 1 ... ##$ FA            : int  0 0 2 0 2 1 0 0 0 0 ...
##  $AF : int 77 85 94 65 65 62 56 77 61 56 ... ##$ SC            : int  85 89 93 70 63 72 79 73 68 59 ...
##  $CCL : int NA NA NA NA NA NA NA NA NA NA ... ##$ SCL           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $SI : int NA NA NA NA NA NA NA NA NA NA ... ##$ MG            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $TO : int NA NA NA NA NA NA NA NA NA NA ... ##$ ITC           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $T5 : int NA NA NA NA NA NA NA NA NA NA ... So hopefully you have seen a couple of short cuts that come from knowing the dataset and this is an example of why its important to do your checks! Here is a great a guide to bad data which provides a nice series of steps in how to check things that might commonly go wrong. fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player,-Date)%>% group_by(Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.)))) ## # A tibble: 1,462 x 76 ## # Groups: Match_id [731] ## Season Round Team Status Opposition Venue Match_id CP UP ED ## ## 1 2015 Round~ Carlt~ Home Richmond MCG 5964 127 199 233 ## 2 2015 Round~ Richm~ Away Carlton MCG 5964 121 227 273 ## 3 2015 Round~ Gold ~ Away Melbourne MCG 5965 136 165 203 ## 4 2015 Round~ Melbo~ Home Gold Coast MCG 5965 150 198 246 ## 5 2015 Round~ Essen~ Away Sydney ANZ ~ 5966 150 171 226 ## 6 2015 Round~ Sydney Home Essendon ANZ ~ 5966 176 192 248 ## 7 2015 Round~ Brisb~ Home Collingwo~ Gabba 5967 152 211 246 ## 8 2015 Round~ Colli~ Away Brisbane Gabba 5967 165 179 231 ## 9 2015 Round~ West ~ Away Western B~ Etih~ 5968 137 236 278 ## 10 2015 Round~ Weste~ Home West Coast Etih~ 5968 156 214 275 ## # ... with 1,452 more rows, and 66 more variables: DE , CM , ## # GA , MI5 , One.Percenters , BO , TOG , ## # K , HB , D , M , G , B , T , ## # HO , GA1 , I50 , CL , CG , R50 , ## # FF , FA , AF , SC , CCL , SCL , ## # SI , MG , TO , ITC , T5 , ## # Season_difference , CP_difference , UP_difference , ## # ED_difference , DE_difference , CM_difference , ## # GA_difference , MI5_difference , ## # One.Percenters_difference , BO_difference , ## # TOG_difference , K_difference , HB_difference , ## # D_difference , M_difference , G_difference , ## # B_difference , T_difference , HO_difference , ## # GA1_difference , I50_difference , CL_difference , ## # CG_difference , R50_difference , FF_difference , ## # FA_difference , AF_difference , SC_difference , ## # CCL_difference , SCL_difference , SI_difference , ## # MG_difference , TO_difference , ITC_difference , ## # T5_difference  # Step Two – Joining Datasets Now we could just sum up the goals by each player and their behinds, but this would miss out on rushed behinds. So lets join our footywire dataset with the afltables match results df<-fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player)%>% group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.)))) df2<-fitzRoy::match_results df2<-df2%>%filter(Season>2014) df3<-select(df2, Date, Round, Home.Team, Home.Points) df4<-select(df2, Date, Round, Away.Team, Away.Points) colnames(df3)[3]<-"Team" colnames(df3)[4]<-"Points" colnames(df4)[3]<-"Team" colnames(df4)[4]<-"Points" df5<-rbind(df4,df3) df5<-df5 %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane")) df5<-df5 %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs")) df6<-inner_join(df,df5, by=c("Team","Date")) dataset_columns <- c(1,2,4,6,7,8,44:80,81) dataset<-df6%>%group_by(Match_id)%>% arrange(Match_id)%>% mutate(Margin=c(-diff(Points), diff(Points)))%>% mutate(Win_loss=if_else(Margin>0,1,0,NULL))%>% select(dataset_columns) # Taking Submissions. I asked twitter, what areas do you think are important to win, to win a game. I got a response and here we go! So from our above script, all we need to be able to add in is goal accuracy as a predictor or goal accuracy differential? df<-fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player)%>% group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.)))) df2<-fitzRoy::match_results df2<-df2%>%filter(Season>2014) df3<-select(df2, Date, Round, Home.Team, Home.Points,Home.Goals,Home.Behinds) df4<-select(df2, Date, Round, Away.Team, Away.Points,Away.Goals,Away.Behinds) df3$Accuracy<-(df3$Home.Goals/(df3$Home.Goals+df3$Home.Behinds)) colnames(df3)[3]<-"Team" colnames(df3)[4]<-"Points" colnames(df3)[5]<-"Goals" colnames(df3)[6]<-"Behinds" df4$Accuracy<-(df4$Away.Goals/(df4$Away.Goals+df4$Away.Behinds)) colnames(df4)[3]<-"Team" colnames(df4)[4]<-"Points" colnames(df4)[5]<-"Goals" colnames(df4)[6]<-"Behinds" df5<-rbind(df4,df3) df5<-df5 %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane")) df5<-df5 %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs")) df6<-inner_join(df,df5, by=c("Team","Date")) dataset_columns <- c(1,2,4,6,7,8,44:77,82:84) dataset<-df6%>%group_by(Match_id)%>% arrange(Match_id)%>% mutate(Margin=c(-diff(Points), diff(Points)))%>% mutate(Win_loss=if_else(Margin>0,1,0,NULL))%>% select(dataset_columns) # Building the Logistic Regression Model library(aod) library(ordinal) ## ## Attaching package: 'ordinal' ## The following object is masked from 'package:dplyr': ## ## slice library(lme4) ## Loading required package: Matrix ## ## Attaching package: 'Matrix' ## The following object is masked from 'package:tidyr': ## ## expand ## ## Attaching package: 'lme4' ## The following objects are masked from 'package:ordinal': ## ## ranef, VarCorr in.sample <- subset(dataset, Season %in% c(2015:2017)) #in.sample <- subset(mydata, year ==2008) out.sample <- subset(dataset, Season == 2018) in.sample$Win_loss <- factor(in.sample$Win_loss) out.sample$Win_loss<-factor(out.sample$Win_loss) To know which columsn we want to scale an easy way is to go names(dataset) which should print out the column names. temp1<-scale(in.sample[,7:40]) in.sample[,7:40]<-temp1 #attributes(temp1) temp1.center<-attr(temp1,"scaled:center") temp1.scale<-attr(temp1,"scaled:scale") m <- glm(Win_loss ~I50_difference+ Accuracy+ R50_difference+ CCL_difference+ SCL_difference+ MI5_difference , data = in.sample, family =binomial) summary(m) ## ## Call: ## glm(formula = Win_loss ~ I50_difference + Accuracy + R50_difference + ## CCL_difference + SCL_difference + MI5_difference, family = binomial, ## data = in.sample) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.43728 -0.28649 -0.00212 0.26964 2.63087 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.45340 0.58793 -5.874 4.26e-09 *** ## I50_difference 7.17314 0.51782 13.853 < 2e-16 *** ## Accuracy 6.43892 1.09304 5.891 3.84e-09 *** ## R50_difference 5.11442 0.38536 13.272 < 2e-16 *** ## CCL_difference 0.16421 0.11256 1.459 0.145 ## SCL_difference 0.04502 0.10581 0.425 0.671 ## MI5_difference 0.87950 0.16978 5.180 2.22e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1718.92 on 1239 degrees of freedom ## Residual deviance: 620.62 on 1233 degrees of freedom ## AIC: 634.62 ## ## Number of Fisher Scoring iterations: 7 newdata <- out.sample[ , -ncol(out.sample)] newdata[,7:40]<-scale(newdata[,7:40],center=temp1.center,scale=temp1.scale) pre.dict <- predict(m,newdata=newdata, type="response") pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("prob.win") newdata.pred <- cbind.data.frame(newdata, pre.dict.m) newdata.pred%>% select(Team, Opposition, Venue, Margin, prob.win)%>% filter(Margin<0)%>% arrange(desc(prob.win))%>% top_n(10) ## Selecting by prob.win ## Team Opposition Venue Margin prob.win ## 1 Sydney North Melbourne SCG -2 0.9368923 ## 2 Sydney Adelaide SCG -10 0.8578722 ## 3 Essendon Carlton MCG -13 0.8239973 ## 4 Gold Coast St Kilda Metricon Stadium -2 0.7796465 ## 5 Essendon Fremantle Optus Stadium -16 0.5778875 ## 6 St Kilda West Coast Optus Stadium -13 0.5625316 ## 7 North Melbourne Richmond Etihad Stadium -10 0.4065246 ## 8 Port Adelaide Hawthorn UTAS Stadium -3 0.4063489 ## 9 Brisbane Collingwood Gabba -7 0.3765427 ## 10 Adelaide Port Adelaide Adelaide Oval -5 0.3601077 The good thing about having a template sorted out, is that you can make quick changes as you think of other variables you want to test. For example JT asked about kicks, inside 50s, marks in 50, tackles in 50 and meters gained. Well lets look at his list of games m <- glm(Win_loss ~I50_difference+ K_difference+ I50_difference+ MI5_difference+ T5_difference+ MG_difference, data = in.sample, family =binomial) summary(m) ## ## Call: ## glm(formula = Win_loss ~ I50_difference + K_difference + I50_difference + ## MI5_difference + T5_difference + MG_difference, family = binomial, ## data = in.sample) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.68051 -0.29449 -0.00427 0.29706 2.71729 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.05097 0.10101 -0.505 0.613808 ## I50_difference -0.50237 0.17403 -2.887 0.003892 ** ## K_difference 0.47670 0.15356 3.104 0.001907 ** ## MI5_difference 0.63284 0.17320 3.654 0.000258 *** ## T5_difference -0.03405 0.11939 -0.285 0.775503 ## MG_difference 4.19630 0.31214 13.444 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1718.92 on 1239 degrees of freedom ## Residual deviance: 631.05 on 1234 degrees of freedom ## AIC: 643.05 ## ## Number of Fisher Scoring iterations: 7 newdata <- out.sample[ , -ncol(out.sample)] newdata[,7:40]<-scale(newdata[,7:40],center=temp1.center,scale=temp1.scale) pre.dict <- predict(m,newdata=newdata, type="response") pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("prob.win") newdata.pred <- cbind.data.frame(newdata, pre.dict.m) newdata.pred%>% select(Team, Opposition, Venue, Margin, prob.win)%>% filter(Margin<0)%>% arrange(desc(prob.win))%>% top_n(10) ## Selecting by prob.win ## Team Opposition Venue Margin prob.win ## 1 Sydney Adelaide SCG -10 0.9441431 ## 2 Essendon Carlton MCG -13 0.9373812 ## 3 Brisbane Gold Coast Gabba -5 0.8991898 ## 4 Western Bulldogs Sydney Etihad Stadium -7 0.7757990 ## 5 Collingwood Geelong MCG -21 0.7610453 ## 6 Port Adelaide Geelong Adelaide Oval -34 0.6678467 ## 7 Sydney North Melbourne SCG -2 0.5908486 ## 8 Geelong West Coast Optus Stadium -15 0.5313548 ## 9 Hawthorn West Coast Etihad Stadium -15 0.4912265 ## 10 GWS Sydney SCG -16 0.3831652 Another example Troy Wheatley. m <- glm(Win_loss ~I50_difference+ MG_difference+ CP_difference+ CM_difference+ CCL_difference+ CL_difference+ HO_difference+ ITC_difference+ D_difference+ CG_difference+ R50_difference+ One.Percenters_difference, data = in.sample, family =binomial) summary(m) ## ## Call: ## glm(formula = Win_loss ~ I50_difference + MG_difference + CP_difference + ## CM_difference + CCL_difference + CL_difference + HO_difference + ## ITC_difference + D_difference + CG_difference + R50_difference + ## One.Percenters_difference, family = binomial, data = in.sample) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.09976 -0.24111 -0.00167 0.23880 3.13712 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.05870 0.10841 -0.541 0.58818 ## I50_difference 3.63170 0.68743 5.283 1.27e-07 *** ## MG_difference 3.06213 0.41218 7.429 1.09e-13 *** ## CP_difference -0.13434 0.16628 -0.808 0.41915 ## CM_difference 0.18235 0.13033 1.399 0.16176 ## CCL_difference 0.40136 0.14838 2.705 0.00683 ** ## CL_difference 0.06447 0.19133 0.337 0.73613 ## HO_difference 0.15043 0.11475 1.311 0.18988 ## ITC_difference 0.40376 0.18444 2.189 0.02859 * ## D_difference -0.12950 0.17347 -0.747 0.45534 ## CG_difference -0.54681 0.12910 -4.235 2.28e-05 *** ## R50_difference 3.20051 0.48150 6.647 2.99e-11 *** ## One.Percenters_difference -0.07831 0.12351 -0.634 0.52606 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1718.92 on 1239 degrees of freedom ## Residual deviance: 556.62 on 1227 degrees of freedom ## AIC: 582.62 ## ## Number of Fisher Scoring iterations: 7 newdata <- out.sample[ , -ncol(out.sample)] newdata[,7:40]<-scale(newdata[,7:40],center=temp1.center,scale=temp1.scale) pre.dict <- predict(m,newdata=newdata, type="response") pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("prob.win") newdata.pred <- cbind.data.frame(newdata, pre.dict.m) newdata.pred%>% select(Team, Opposition, Venue, Margin, prob.win)%>% filter(Margin<0)%>% arrange(desc(prob.win))%>% top_n(10) ## Selecting by prob.win ## Team Opposition Venue Margin prob.win ## 1 Essendon Carlton MCG -13 0.9225119 ## 2 Sydney Adelaide SCG -10 0.8943902 ## 3 Sydney North Melbourne SCG -2 0.8911554 ## 4 Adelaide Fremantle Optus Stadium -3 0.5566460 ## 5 Collingwood Geelong MCG -21 0.5081333 ## 6 GWS Sydney SCG -16 0.4832603 ## 7 Gold Coast St Kilda Metricon Stadium -2 0.4386997 ## 8 North Melbourne Richmond Etihad Stadium -10 0.4137897 ## 9 Hawthorn West Coast Etihad Stadium -15 0.3802405 ## 10 Brisbane Port Adelaide Adelaide Oval -5 0.3625098 We have an idea from insightlane df<-fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player)%>% group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% mutate(disposaltoturnover=D/TO)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.)))) df2<-fitzRoy::match_results df2<-df2%>%filter(Season>2014) df3<-select(df2, Date, Round, Home.Team, Home.Points,Home.Goals,Home.Behinds) df4<-select(df2, Date, Round, Away.Team, Away.Points,Away.Goals,Away.Behinds) df3$Accuracy<-(df3$Home.Goals/(df3$Home.Goals+df3$Home.Behinds)) colnames(df3)[3]<-"Team" colnames(df3)[4]<-"Points" colnames(df3)[5]<-"Goals" colnames(df3)[6]<-"Behinds" df4$Accuracy<-(df4$Away.Goals/(df4$Away.Goals+df4$Away.Behinds)) colnames(df4)[3]<-"Team" colnames(df4)[4]<-"Points" colnames(df4)[5]<-"Goals" colnames(df4)[6]<-"Behinds" df5<-rbind(df4,df3) df5<-df5 %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane")) df5<-df5 %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs")) df6<-inner_join(df,df5, by=c("Team","Date")) dataset_columns <- c(1,2,4,6,7,8,45:79,81:86) dataset<-df6%>%group_by(Match_id)%>% arrange(Match_id)%>% mutate(Margin=c(-diff(Points), diff(Points)))%>% mutate(Win_loss=if_else(Margin>0,1,0,NULL))%>% select(dataset_columns) in.sample <- subset(dataset, Season %in% c(2015:2017)) #in.sample <- subset(mydata, year ==2008) out.sample <- subset(dataset, Season == 2018) in.sample$Win_loss <- factor(in.sample$Win_loss) out.sample$Win_loss<-factor(out.sample\$Win_loss)

temp1<-scale(in.sample[,7:41])
in.sample[,7:41]<-temp1
#attributes(temp1)
temp1.center<-attr(temp1,"scaled:center")
temp1.scale<-attr(temp1,"scaled:scale")

m <- glm(Win_loss ~K_difference+
MG_difference+
disposaltoturnover_difference,
data = in.sample, family =binomial)

summary(m)
##
## Call:
## glm(formula = Win_loss ~ K_difference + MG_difference + disposaltoturnover_difference,
##     family = binomial, data = in.sample)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -2.63495  -0.30591  -0.00522   0.30866   2.67106
##
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)
## (Intercept)                   -0.04937    0.09940  -0.497    0.619
## K_difference                   0.70737    0.17504   4.041 5.32e-05 ***
## MG_difference                  4.05987    0.29071  13.965  < 2e-16 ***
## disposaltoturnover_difference -0.13159    0.18617  -0.707    0.480
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 1716.15  on 1237  degrees of freedom
## Residual deviance:  651.09  on 1234  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 659.09
##
## Number of Fisher Scoring iterations: 7
newdata   <- out.sample[ , -ncol(out.sample)]

newdata[,7:41]<-scale(newdata[,7:41],center=temp1.center,scale=temp1.scale)

pre.dict    <- predict(m,newdata=newdata, type="response")
pre.dict.m  <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata)))
colnames(pre.dict.m) <- c("prob.win")

newdata.pred  <- cbind.data.frame(newdata, pre.dict.m)
newdata.pred%>%
select(Team, Opposition, Venue,  Margin, prob.win)%>%
filter(Margin<0)%>%
arrange(desc(prob.win))%>%
top_n(10)
## Selecting by prob.win
##                Team      Opposition          Venue Margin  prob.win
## 1            Sydney        Adelaide            SCG    -10 0.9249691
## 2          Essendon         Carlton            MCG    -13 0.8865205
## 4          Brisbane      Gold Coast          Gabba     -5 0.8617977
## 6       Collingwood         Geelong            MCG    -21 0.7254149
## 7            Sydney North Melbourne            SCG     -2 0.6606187
## 10         Brisbane     Collingwood          Gabba     -7 0.3885346