Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model: Part one.

Posted on July 21, 2013 by Rasmus Bååth in R bloggers | 0 Comments

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a slightly modified version of my submission to the UseR 2013 Data Analysis Contest which I had the fortune of winning 🙂 The purpose of the contest was to do something interesting with a dataset consisting of the match results from the last five seasons of La Liga, the premium Spanish football (aka soccer) league. In total there were 1900 rows in the dataset each with information regarding which was the home and away team, what these teams scored and what season it was. I decided to develop a Bayesian model of the distribution of the end scores. Here we go…

Ok, first I should come clean and admit that I know nothing about football. Sure, I’ve watched Sweden loose to Germany in the World Cup a couple of times, but that’s it. Never the less, here I will show an attempt to to model the goal outcomes in the La Liga data set provided as part of the UseR 2013 data analysis contest. My goal is not only to model the outcomes of matches in the data set but also to be able to (a) calculate the odds for possible goal outcomes of future matches and (b) to produce a credible ranking of the teams. The model I will be developing is a Bayesian hierarchical model where the goal outcomes will be assumed to be distributed according to a Poisson distribution. I will focus more on showing all the cool things you can easily calculate in R when you have a fully specified Bayesian Model and focus less on model comparison and trying to find the model with highest predictive accuracy (even though I believe my model is pretty good). I really would like to see somebody try to do a similar analysis in SPSS (which most people uses in my field, psychology). It would be a pain!

This post assumes some familiarity with Bayesian modeling and Markov chain Monte Carlo. If you’re not into Bayesian statistics you’re missing out on something really great and a good way to get started is by reading the excellent Doing Bayesian Data Analysis by John Kruschke. The tools I will be using is R (of course) with the model implemented in JAGS called from R using the rjags package. For plotting the result of the MCMC samples generated by JAGS I’ll use the coda package, the mcmcplots package, and the plotPost function courtesy of John Kruschke. For data manipulation I used the plyr and stringr packages and for general plotting I used ggplot2. This report was written in Rstudio using knitr and xtable.

I start by loading libraries, reading in the data and preprocessing it for JAGS. The last 50 matches have unknown outcomes and I create a new data frame d holding only matches with known outcomes. I will come back to the unknown outcomes later when it is time to use my model for prediction.

library(rjags)
library(coda)
library(mcmcplots)
library(stringr)
library(plyr)
library(xtable)
source("plotPost.R")
set.seed(12345)  # for reproducibility

load("laliga.RData")

# -1 = Away win, 0 = Draw, 1 = Home win
laliga$MatchResult <- sign(laliga$HomeGoals - laliga$AwayGoals)

# Creating a data frame d with only the complete match results
d <- na.omit(laliga)
teams <- unique(c(d$HomeTeam, d$AwayTeam))
seasons <- unique(d$Season)

# A list for JAGS with the data from d where the strings are coded as
# integers
data_list <- list(HomeGoals = d$HomeGoals, AwayGoals = d$AwayGoals, HomeTeam = as.numeric(factor(d$HomeTeam,
    levels = teams)), AwayTeam = as.numeric(factor(d$AwayTeam, levels = teams)),
    Season = as.numeric(factor(d$Season, levels = seasons)), n_teams = length(teams),
    n_games = nrow(d), n_seasons = length(seasons))

# Convenience function to generate the type of column names Jags outputs.
col_name <- function(name, ...) {
    paste0(name, "[", paste(..., sep = ","), "]")
}

Modeling Match Results: Iteration 1

How are the number of goals for each team in a football match distributed? Well, lets start by assuming that all football matches are roughly equally long, that both teams have many chances at making a goal and that each team have the same probability of making a goal each goal chance. Given these assumptions the distribution of the number of goals for each team should be well captured by a Poisson distribution. A quick and dirty comparison between the actual distribution of the number of scored goals and a Poisson distribution having the same mean number of scored goals support this notion.

par(mfcol = c(2, 1), mar = rep(2.2, 4))
hist(c(d$AwayGoals, d$HomeGoals), xlim = c(-0.5, 8), breaks = -1:9 + 0.5, main = "Distribution of the number of goals\nscored by a team in a match.")
mean_goals <- mean(c(d$AwayGoals, d$HomeGoals))
hist(rpois(9999, mean_goals), xlim = c(-0.5, 8), breaks = -1:9 + 0.5, main = "Random draw from a Poisson distribution with\nthe same mean as the distribution above.")

m1_string <- "model { for(i in 1:n_games) { HomeGoals[i] ~ dpois(lambda_home[HomeTeam[i],AwayTeam[i]]) AwayGoals[i] ~ dpois(lambda_away[HomeTeam[i],AwayTeam[i]]) } for(home_i in 1:n_teams) { for(away_i in 1:n_teams) { lambda_home[home_i, away_i] <- exp(baseline + skill[home_i] - skill[away_i]) lambda_away[home_i, away_i] <- exp(baseline + skill[away_i] - skill[home_i]) } } skill[1] <- 0 for(j in 2:n_teams) { skill[j] ~ dnorm(group_skill, group_tau) } group_skill ~ dnorm(0, 0.0625) group_tau <- 1 / pow(group_sigma, 2) group_sigma ~ dunif(0, 3) baseline ~ dnorm(0, 0.0625) }"

# Compiling model 1 m1 <- jags.model(textConnection(m1_string), data = data_list, n.chains = 3, n.adapt = 5000) # Burning some samples on the altar of the MCMC god update(m1, 5000) # Generating MCMC samples s1 <- coda.samples(m1, variable.names = c("baseline", "skill", "group_skill", "group_sigma"), n.iter = 10000, thin = 2) # Merging the three MCMC chains into one matrix ms1 <- as.matrix(s1)

# Plots histograms over home_goals, away_goals, the difference in goals # and a barplot over match results. plot_goals <- function(home_goals, away_goals) { n_matches <- length(home_goals) goal_diff <- home_goals - away_goals match_result <- ifelse(goal_diff < 0, "away_win", ifelse(goal_diff > 0, "home_win", "equal")) hist(home_goals, xlim = c(-0.5, 10), breaks = (0:100) - 0.5) hist(away_goals, xlim = c(-0.5, 10), breaks = (0:100) - 0.5) hist(goal_diff, xlim = c(-6, 6), breaks = (-100:100) - 0.5) barplot(table(match_result)/n_matches, ylim = c(0, 1)) } plot_pred_comp1 <- function(home_team, away_team, ms) { # Simulates and plots game goals scores using the MCMC samples from the m1 # model. par(mfrow = c(2, 4)) baseline <- ms[, "baseline"] home_skill <- ms[, col_name("skill", which(teams == home_team))] away_skill <- ms[, col_name("skill", which(teams == away_team))] home_goals <- rpois(nrow(ms), exp(baseline + home_skill - away_skill)) away_goals <- rpois(nrow(ms), exp(baseline + away_skill - home_skill)) plot_goals(home_goals, away_goals) # Plots the actual distribution of goals between the two teams home_goals <- d$HomeGoals[d$HomeTeam == home_team & d$AwayTeam == away_team] away_goals <- d$AwayGoals[d$HomeTeam == home_team & d$AwayTeam == away_team] plot_goals(home_goals, away_goals) }