Football model

[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After reading Dutch football data (Eeredivisie 2011-2012) and making a predictions display it is time to look at a few simple models to predict goals. To reiterate the data setup, each game played consists of two rows in the data frame. One row for the number of goals the home playing team makes, another row for the away team. We start with four models. Two models I don’t believe in; A zero  model where the number of goals is independent of the clubs and everything, model 1 where the number of goals is only dependent on the team making the goals. Two other models are probable. Model 2, both the attacking and the defending team determine the number of goals, finally, model 3, both teams determine the number of goals, but also who is playing at home.

model0 <- glm(Goals ~ 1,data=StartData,family='poisson')
model1 <- glm(Goals ~OffenseClub,data=StartData,family='poisson')
model2 <- glm(Goals ~OffenseClub + DefenseClub,data=StartData,family='poisson')
model3 <- glm(Goals ~OffenseClub + DefenseClub +
                     OffThuis,data=StartData,family=’poisson’)
anova (model0,model1,model2,model3,test=’Chisq’)
Analysis of Deviance Table

Model 1: Goals ~ 1
Model 2: Goals ~ OffenseClub
Model 3: Goals ~ OffenseClub + DefenseClub
Model 4: Goals ~ OffenseClub + DefenseClub + OffThuis
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       611     865.23                          
2       594     754.17 17  111.064 7.610e-16 ***
3       577     699.13 17   55.043 6.743e-06 ***
4       576     668.96  1   30.172 3.955e-08 ***
It appears that modelling step which makes the model more complex is significant, we must reject the hypothesis that any of these terms is not relevant. Hence the number of goals is dependent on the teams plus a home team effect.

The twelfth man 

It does make a difference who is playing at home. In practical terms, due to the model used, this advantage  is difficult to interpret. In general, when two clubs of equal strength play each other, they each make 1.3 goals.
exp(coef(model2)[1])
(Intercept) 
   1.346538 
When one of these equally strong teams plays away, the other at home, the numbers change. A team playing at home makes 1.6 goals, while playing away only 1.1. 
exp(coef(model3)[length(coef(model3))] + coef(model3)[1])
OffThuis 
 1.58019 
exp(coef(model3)[1])
(Intercept) 
   1.112886 
This would make playing away or at home both statistical and practically significant. Note that the size of this effect can not be transferred to other circumstances.

The teams 

Each of the teams has two parameters in the model. These can be most easily be interpreted as offensive and defensive power. The following code plots these powers.
co <- coef(model3)
coO <- co[grep('Offense',names(co))]
coD <- co[grep('Defense',names(co))]
names(coO) <- gsub('OffenseClub','',names(coO))
names(coD) <- gsub('DefenseClub','',names(coD))
# Ado Den Haag is missing in the parameterization. so it is added.
coB <- rbind(cbind(coO,coD),matrix(c(0,0)
             ,nrow=1,,dimnames=list(‘Ado Den Haag’,c(‘coO’,’coD’))))
# scaled for relative strength 
coB <- as.data.frame(scale(coB,scale=FALSE)) 
# -coD to make more defensive power visually larger
plot(-coD ~coO, type=’n’, data=coB,xlab=’Offensive power’,ylab=’Defensive power’,axes=FALSE)
text(-coD ~coO,data=coB,labels=rownames(coB))
abline(a=0,b=1)
abline(v=0)
abline(h=0)
The plot shows the axes, a team close to the centre (NAC Breda, FC Utrecht) was average in both offensive and defensive strength. A diagonal line depicts the equal defense and offense strength region. Hence Feyenoord is equally strong in offense and defense, same for De Graafschap. The line is not quite diagonal, the range in in offense strength is larger than the range in defense strength. The best teams is top right; Ajax. The worst teams are bottom left; De Graafschap and Excelsior have relegated to eerste divisie. A few clubs are noticeable for their mismatch in offensive and defensive strengths. SC Heerenveen has almost the same goal making power as Ajax, but not enough defensive capacity. In contrast, Vitesse won’t receive many goals, but lacks the power to make the goals. Overall they have about the same strength.
Otherwise stated; if SC Heerenveen played against itself. ignoring home team advantage, it would probably make two or even three goals. 
fbpredict(model2,’SC Heerenveen’,’SC Heerenveen’)[[1]]
SC Heerenveen in rows against SC Heerenveen in columns 
  0      1      2      3      4      5      6      7      8      9     
0 0.0060 0.0153 0.0196 0.0167 0.0107 0.0055 0.0023 0.0009 0.0003 0.0001
1 0.0153 0.0391 0.0501 0.0428 0.0274 0.0140 0.0060 0.0022 0.0007 0.0002
2 0.0196 0.0501 0.0641 0.0548 0.0351 0.0180 0.0077 0.0028 0.0009 0.0003
3 0.0167 0.0428 0.0548 0.0467 0.0299 0.0153 0.0065 0.0024 0.0008 0.0002
4 0.0107 0.0274 0.0351 0.0299 0.0192 0.0098 0.0042 0.0015 0.0005 0.0001
5 0.0055 0.0140 0.0180 0.0153 0.0098 0.0050 0.0021 0.0008 0.0003 0.0001
6 0.0023 0.0060 0.0077 0.0065 0.0042 0.0021 0.0009 0.0003 0.0001 0     
7 0.0009 0.0022 0.0028 0.0024 0.0015 0.0008 0.0003 0.0001 0      0     
8 0.0003 0.0007 0.0009 0.0008 0.0005 0.0003 0.0001 0      0      0     
9 0.0001 0.0002 0.0003 0.0002 0.0001 0.0001 0      0      0      0     
If Vitesse played against itself it would make zero or one goal.
fbpredict(model2,’Vitesse’,’Vitesse’)[[1]]
Vitesse in rows against Vitesse in columns 
  0      1      2      3      4      5      6      7      8      9     
0 0.1165 0.1252 0.0673 0.0241 0.0065 0.0014 0.0002 0      0      0     
1 0.1252 0.1346 0.0724 0.0259 0.0070 0.0015 0.0003 0      0      0     
2 0.0673 0.0724 0.0389 0.0139 0.0037 0.0008 0.0001 0      0      0     
3 0.0241 0.0259 0.0139 0.0050 0.0013 0.0003 0.0001 0      0      0     
4 0.0065 0.0070 0.0037 0.0013 0.0004 0.0001 0      0      0      0     
5 0.0014 0.0015 0.0008 0.0003 0.0001 0      0      0      0      0     
6 0.0002 0.0003 0.0001 0.0001 0      0      0      0      0      0     
7 0      0      0      0      0      0      0      0      0      0     
8 0      0      0      0      0      0      0      0      0      0     
9 0      0      0      0      0      0      0      0      0      0    

model extensions

The Residual deviance of model3 is 668.96 on 576 degrees of freedom. That might mean some more effects can be found in the data. 

twelfth man and teams

The first extension is that home and away advantage is different between teams. Based on these data, this does not seem to be statistically significant.
model4a <- glm(Goals ~OffenseClub*OffThuis + DefenseClub 
                       ,data=StartData,family=’poisson’)
model4b <- glm(Goals ~OffenseClub + DefenseClub*OffThuis 
                       ,data=StartData,family=’poisson’)
model5 <- glm(Goals ~(OffenseClub + DefenseClub)*OffThuis 
                       ,data=StartData,family=’poisson’)
anova (model3,model4a,model5,test=’Chisq’)
Analysis of Deviance Table

Model 1: Goals ~ OffenseClub + DefenseClub + OffThuis
Model 2: Goals ~ OffenseClub * OffThuis + DefenseClub
Model 3: Goals ~ (OffenseClub + DefenseClub) * OffThuis
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       576     668.96                     
2       559     649.00 17   19.953   0.2766
3       542     626.77 17   22.236   0.1758
anova (model3,model4b,model5,test=’Chisq’)
Analysis of Deviance Table

Model 1: Goals ~ OffenseClub + DefenseClub + OffThuis
Model 2: Goals ~ OffenseClub + DefenseClub * OffThuis
Model 3: Goals ~ (OffenseClub + DefenseClub) * OffThuis
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       576     668.96                     
2       559     647.46 17   21.499   0.2048
3       542     626.77 17   20.690   0.2404

Before and after winter break

Winter break has the possibility to change players. It might be, that teams change in quality in this period. In these data, it seems this effect is not statistically significant.
StartData$year <- factor(c(substr(old$Datum,1,4),substr(old$Datum,1,4)))
model6 <- glm(Goals ~OffenseClub + DefenseClub  + year + OffThuis 
             ,data=StartData,family=’poisson’)
model7 <- glm(Goals ~(OffenseClub + DefenseClub)*year + OffThuis 
             ,data=StartData,family=’poisson’)
anova (model3,model6,model7,test=’Chisq’)
Analysis of Deviance Table

Model 1: Goals ~ OffenseClub + DefenseClub + OffThuis
Model 2: Goals ~ OffenseClub + DefenseClub + year + OffThuis
Model 3: Goals ~ (OffenseClub + DefenseClub) * year + OffThuis
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       576     668.96                     
2       575     668.82  1    0.135   0.7129
3       541     625.48 34   43.345   0.1308

To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)