The football season has started in Netherlands, so I went and had a look at last year’s scores. I did not find downloadable data, at http://www.eredivisiestats.nl/wedstrijden.php I could copy last season’s data and paste into a spreadsheet. The default layout did not look good to me for statistical analysis (first lines shown below), so first part is reformatting to get how many goals a club did against another. This means that each game becomes two lines and an indicator variable is used for who was playing at home.

Datum |
Thuisclub (home) |
Uitclub (away) |
Thuisscore |
Uitscore |

2011-08-05 |
Excelsior |
Feyenoord |
0 |
2 |

2011-08-06 |
RKC Waalwijk |
Heracles Almelo |
2 |
2 |

2011-08-06 |
Roda JC |
FC Groningen |
2 |
1 |

# read data

old <- read.csv(‘seizoen1112.csv’,stringsAsFactors=FALSE)

# reformat lengthwise

StartData <- with(old,data.frame(

OffenseClub = c(Thuisclub,Uitclub),

DefenseClub = c(Uitclub,Thuisclub),

Goals = c(Thuisscore,Uitscore),

OffThuis = rep(c(1,0),each=length(Datum))

))

#remove space as first character in names

levels(StartData$OffenseClub) <- sub(‘^ ‘,”,levels(StartData$OffenseClub))

levels(StartData$DefenseClub) <- sub(‘^ ‘,”,levels(StartData$DefenseClub))

# check clubs have same name and ordening in both factors.

all(levels(StartData$OffenseClub)==levels(StartData$DefenseClub))

#### Distribution of goals

To get to know the data the distribution of goals is most interesting:

xt <- xtabs(~Goals ,data=StartData)

plot(xt)

The first question is; is this approximately Poisson distributed? MASS has some functions.

library(MASS)

(fd <- fitdistr(StartData$Goals, densfun=”Poisson”))

lambda

1.62908497

(0.05159364)

round(dpois(0:8,coef(fd))*sum(xt))

plot(xt,ylim=c(0,200),ylab=’Frequency of goals’)

points(x=0:8,y=dpois(0:8,coef(fd))*sum(xt),col=’green’)

It is also possible to do the same calculations with the fitdistrplus package and get some extra information. Two different algorithms give essentially the same information.

library(fitdistrplus)

(fd2 <- fitdist(StartData$Goals, distr=’pois’,method=’mle’))

Fitting of the distribution ‘ pois ‘ by maximum likelihood

Parameters:

estimate Std. Error

lambda 1.629085 0.05159362

fitdist(StartData$Goals, distr=’pois’,method=’mme’)

Fitting of the distribution ‘ pois ‘ by matching moments

Parameters:

estimate

lambda 1.629085

plot(fd2)

This is nice. I get the same information as my own plot, but also a CDF and have to do less. If only I knew all packages in R, I would not have to program anything. But there are more extra’s in fitdistrplus. Goodness of fit and bootstrap confidence interval.

gofstat(fd2,print.test=TRUE)

Chi-squared statistic: 18.46365

Degree of freedom of the Chi-squared distribution: 4

Chi-squared p-value: 0.001001433

bd <- bootdist(fd2)

summary(bd)

Parametric bootstrap medians and 95% percentile CI

Median 2.5% 97.5%

1.627451 1.527803 1.728736

plot(density(bd$estim$lambda))

Finally, if desired the same can be done the Bayesian way via Jags

library(R2jags)

calcData <- with(StartData,list(N=length(Goals),Goals=Goals))

mijnmodel <- function() {

for (i in 1:N) {

Goals[i] ~ dpois(mu)

}

mu ~ dunif(0,3)

}

write.model(mijnmodel,’model.file’)

jags.inits <- function() {

list(

mu = runif(1,0,3)

)

}

jags.params=c(‘mu’)

jags(data=calcData, inits=jags.inits, jags.params,model.file=’model.file’)

Inference for Bugs model at “model.file”, fit using jags,

3 chains, each with 2000 iterations (first 1000 discarded)

n.sims = 3000 iterations saved

mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff

mu 1.633 0.051 1.533 1.598 1.632 1.667 1.732 1.001 2100

devianc 2048.287 1.338 2047.308 2047.403 2047.757 2048.671 2052.118 1.006 1500

For each parameter, n.eff is a crude measure of effective sample size,

and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)

pD = 0.9 and DIC = 2049.2

DIC is an estimate of expected predictive error (lower deviance is better).

This leads to essentially the same result as before.

#### Conclusion

Three ways to get parameters of a distribution are used. The most obvious way, via MASS gives parameters and standard deviation. A dedicated package gives all that plus significance and distribution. The Bayesian way gives the distribution too and uses most code by far. The estimates do not differ in any relevant way.

*Related*

To

**leave a comment** for the author, please follow the link and comment on his blog:

** Wiekvoet**.

R-bloggers.com offers

**daily e-mail updates** about

R news and

tutorials on topics such as: visualization (

ggplot2,

Boxplots,

maps,

animation), programming (

RStudio,

Sweave,

LaTeX,

SQL,

Eclipse,

git,

hadoop,

Web Scraping) statistics (

regression,

PCA,

time series,

trading) and more...

If you got this far, why not

__subscribe for updates__ from the site? Choose your flavor:

e-mail,

twitter,

RSS, or

facebook...