# Football (Eredivisie) goals

August 27, 2012
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The football season has started in Netherlands, so I went and had a look at last year’s scores. I did not find downloadable data, at http://www.eredivisiestats.nl/wedstrijden.php I could copy last season’s data and paste into a spreadsheet. The default layout did not look good to me for statistical analysis (first lines shown below), so first part is reformatting to get how many goals a club did against another. This means that each game becomes two lines and an indicator variable is used for who was playing at home.

 Datum Thuisclub (home) Uitclub (away) Thuisscore Uitscore 2011-08-05 Excelsior Feyenoord 0 2 2011-08-06 RKC Waalwijk Heracles Almelo 2 2 2011-08-06 Roda JC FC Groningen 2 1

# reformat lengthwise
StartData <- with(old,data.frame(
OffenseClub = c(Thuisclub,Uitclub),
DefenseClub = c(Uitclub,Thuisclub),
Goals = c(Thuisscore,Uitscore),
OffThuis = rep(c(1,0),each=length(Datum))
))
#remove space as first character in names
levels(StartData\$OffenseClub) <- sub(‘^ ‘,”,levels(StartData\$OffenseClub))
levels(StartData\$DefenseClub) <- sub(‘^ ‘,”,levels(StartData\$DefenseClub))
# check clubs have same name and ordening in both factors.
all(levels(StartData\$OffenseClub)==levels(StartData\$DefenseClub))

#### Distribution of goals

To get to know the data the distribution of goals is most interesting:
xt <- xtabs(~Goals ,data=StartData)
plot(xt)
The first question is; is this approximately Poisson distributed? MASS has some functions.
library(MASS)
(fd <- fitdistr(StartData\$Goals, densfun=”Poisson”))
lambda
1.62908497
(0.05159364)
round(dpois(0:8,coef(fd))*sum(xt))
plot(xt,ylim=c(0,200),ylab=’Frequency of goals’)
points(x=0:8,y=dpois(0:8,coef(fd))*sum(xt),col=’green’)
It is also possible to do the same calculations with the fitdistrplus package and get some extra information. Two different algorithms give essentially the same information.
library(fitdistrplus)
(fd2 <- fitdist(StartData\$Goals, distr=’pois’,method=’mle’))
Fitting of the distribution ‘ pois ‘ by maximum likelihood

Parameters:
estimate Std. Error
lambda 1.629085 0.05159362

fitdist(StartData\$Goals, distr=’pois’,method=’mme’)

Fitting of the distribution ‘ pois ‘ by matching moments
Parameters:
estimate
lambda 1.629085

plot(fd2)
This is nice. I get the same information as my own plot, but also a CDF and have to do less. If only I knew all packages in R, I would not have to program anything. But there are more extra’s in fitdistrplus. Goodness of fit and bootstrap confidence interval.

gofstat(fd2,print.test=TRUE)
Chi-squared statistic:  18.46365
Degree of freedom of the Chi-squared distribution:  4
Chi-squared p-value:  0.001001433

bd <- bootdist(fd2)
summary(bd)

Parametric bootstrap medians and 95% percentile CI
Median     2.5%    97.5%
1.627451 1.527803 1.728736
plot(density(bd\$estim\$lambda))

Finally, if desired the same can be done the Bayesian way via Jags
library(R2jags)
calcData <- with(StartData,list(N=length(Goals),Goals=Goals))
mijnmodel <- function() {
for (i in 1:N) {
Goals[i] ~ dpois(mu)
}
mu ~ dunif(0,3)
}
write.model(mijnmodel,’model.file’)
jags.inits <- function() {
list(
mu = runif(1,0,3)
)
}
jags.params=c(‘mu’)
jags(data=calcData, inits=jags.inits, jags.params,model.file=’model.file’)
Inference for Bugs model at “model.file”, fit using jags,
3 chains, each with 2000 iterations (first 1000 discarded)
n.sims = 3000 iterations saved
mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat n.eff
mu         1.633   0.051    1.533    1.598    1.632    1.667    1.732 1.001  2100
devianc 2048.287   1.338 2047.308 2047.403 2047.757 2048.671 2052.118 1.006  1500

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 0.9 and DIC = 2049.2
DIC is an estimate of expected predictive error (lower deviance is better).
This leads to essentially the same result as before.

#### Conclusion

Three ways to get parameters of a distribution are used. The most obvious way, via MASS gives parameters and standard deviation. A dedicated package gives all that plus significance and distribution. The Bayesian way gives the distribution too and uses most code by far. The estimates do not differ in any relevant way.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.