I’ve had a quick look at this article on R-bloggers $-$ I don’t think I’ve followed the whole exchange, but I believe they have discussed what models should/could be applied to estimate football scores (specifically, in this case they are using the Dutch league).

The main point of the post is that using ordinal regression models can improve the performance (I suppose in terms of prediction or validation of the probability associated with the observed frequency of the results).

At a very superficial level (since I’ve just read the article and have not thought about this a great deal), I think that assuming that the observed number of goals can be considered as an ordinal variable, much as you would do for a Likert scale, is not quite the best option.

This assumption *might* not have a huge impact on the actual results of this model; just as for an ordinal variable, the distance between the modalities is not linear (thus moving from scoring 0 to scoring 1 goal does not necessarily take the same effort required for moving from scoring 3 to scoring 4 goals). And ordinal regression can accommodate this situation. But I think this formulation is unnecessarily complicated and a bit confusing.

Moreover (and far more importantly, I think), if I understand it correctly, both the original models and those discussed in the post I’m considering seem to assume independence between the goals scored by the two teams competing in a single game. This is not realistic, I think, as we proved in our paper (of course drawing on other good examples in the literature).

In particular, we were considering a hierarchical structure in which the goals scored by the two competing teams are conditionally independent given a set of parameters (accounting for defence and attack, and home advantage); but because these were given exchangeable priors, correlation would be implied in the responses $-$ something like this:

The Bayesian machinery was very good at prediction, especially after we considered a slightly more complex structure in which we included information on each team’s propensity to be “good”, “average”, or “poor”. This helped avoid overshrinkage in the estimations and we did quite well.

An interesting point of the models discussed in the posts at R-bloggers is the introduction of a time effect (in this particular case to account for winter breaks in the Dutch league). In our experience, we have only considered the Italian, Spanish and English leagues (which, as far as I am aware of) do not have breaks.

But including external information is always good: for example, teams involved in European football (eg Champion’s or Europa League) may do worse on the league games immediately before (and/or immediately after) their European fixture. This would be easy enough to include and could perhaps increase the precision in the estimations.

*Related*

To

**leave a comment** for the author, please follow the link and comment on their blog:

** Gianluca Baio's blog**.

R-bloggers.com offers

**daily e-mail updates** about

R news and

tutorials on topics such as:

Data science,

Big Data, R jobs, visualization (

ggplot2,

Boxplots,

maps,

animation), programming (

RStudio,

Sweave,

LaTeX,

SQL,

Eclipse,

git,

hadoop,

Web Scraping) statistics (

regression,

PCA,

time series,

trading) and more...

If you got this far, why not

__subscribe for updates__ from the site? Choose your flavor:

e-mail,

twitter,

RSS, or

facebook...