Ordinal football

[This article was first published on Gianluca Baio's blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve had a quick look at this article on R-bloggers $-$ I don’t think I’ve followed the whole exchange, but I believe they have discussed what models should/could be applied to estimate football scores (specifically, in this case they are using the Dutch league).

The main point of the post is that using ordinal regression models can improve the performance (I suppose in terms of prediction or validation of the probability associated with the observed frequency of the results).

At a very superficial level (since I’ve just read the article and have not thought about this a great deal), I think that assuming that the observed number of goals can be considered as an ordinal variable, much as you would do for a Likert scale, is not quite the best option. 

This assumption might not have a huge impact on the actual results of this model; just as for an ordinal variable, the distance between the modalities is not linear (thus moving from scoring 0 to scoring 1 goal does not necessarily take the same effort required for moving from scoring 3 to scoring 4 goals). And ordinal regression can accommodate this situation. But I think this formulation is unnecessarily complicated and a bit confusing.

Moreover (and far more importantly, I think), if I understand it correctly, both the original models and those discussed in the post I’m considering seem to assume independence between the goals scored by the two teams competing in a single game. This is not realistic, I think, as we proved in our paper (of course drawing on other good examples in the literature).

In particular, we were considering a hierarchical structure in which the goals scored by the two competing teams are conditionally independent given a set of parameters (accounting for defence and attack, and home advantage); but because these were given exchangeable priors, correlation would be implied in the responses $-$ something like this:


The Bayesian machinery was very good at prediction, especially after we considered a slightly more complex structure in which we included information on each team’s propensity to be “good”, “average”, or “poor”. This helped avoid overshrinkage in the estimations and we did quite well.

An interesting point of the models discussed in the posts at R-bloggers is the introduction of a time effect (in this particular case to account for winter breaks in the Dutch league). In our experience, we have only considered the Italian, Spanish and English leagues (which, as far as I am aware of) do not have breaks. 

But including external information is always good: for example, teams involved in European football (eg Champion’s or Europa League) may do worse on the league games immediately before (and/or immediately after) their European fixture. This would be easy enough to include and could perhaps increase the precision in the estimations. 

To leave a comment for the author, please follow the link and comment on their blog: Gianluca Baio's blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)