Another Prediction for the FIFA World Cup 2018
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Ava Yang, Data Scientist
Given that the UEFA Champion League final a few weeks ago between Real Madrid and Liverpool is the only match I’ve watched properly in over ten years, how dare I presume I can guess that Brazil is going to lift the trophy in the 2018 FIFA World Cup? Well, here goes…
By the way, if you find the below dry to read, it is because of my limited natural language on the subject matter…data science tricks to the rescue!
This blogpost is largely based on the prediction framework from an eRum 2018 talk by Claus Thorn Ekstrøm. For first hand materials please take a look at the slides, video and code.
The idea is that in each simulation run of a tournament, we find team winner, runnersup, third and fourth etc. N times of simulation runs e.g. 10k returns a list of winners with highest probability to be ranked top.
Apart from the winner question, this post seeks to answer which team will be top scorer and how many goals will they score. After following Claus’s analysis rmarkdown file, I collected new data, put functions in a package and tried another modelling approach. Whilst the model is too simplistic to be correct, it captures the trend and is a fair starting point to add complex layers on top.
Initialization
To begin with, we load packages including accompnying R package worldcup
where my utility functions reside. Package is a convenient way to share code, seal utility functions and speed up iteration. Global parameters normalgoals (the average number of goals scored in a world cup match) and nsim (number of simulations) are declared in the YAML section at the top of the RMarkdown document.
Next we load three datasets that have been tidied up from open source resource or updated from original version. Plenty of time was spent on gathering data, aligning team names and cleaning up features.
team_data
contains features associated with teamgroup_match_data
is match schedule, publicwcmatches_train
is a match dataset available on this Kaggle competition and can be used as training set to estimate parameter lamda i.e. the average goals scored in a match for a single team. Records from 1994 up to 2014 are kept in the training set.
library(tidyverse) library(magrittr) devtools::load_all("worldcup") normalgoals < params$normalgoals nsim < params$nsim data(team_data) data(group_match_data) data(wcmatches_train)
Play game
Claus proposed three working models to calculate single match outcome. The first is based on two independent poisson distributions, where two teams are treated equal and so the result is random regardless of their actual skills and talent. The second assumes the scoring event in a match are two possion events, the difference of two poisson events believed to have skellam distribution. The result turns out to be much more reliable as the parameters are estimated from actual bettings. The third one is based on World Football ELO Ratings rules. From current ELO ratings, we calculate expected result of one side in a match. It can be seen as the probability of success in a binomial distribution. It seems that this approach overlooked draw due to the nature of binomial distribution i.e. binary.
The fourth model presented here is my first attempt. To spell out: we assumed two independent poisson events, with lambdas predicted from a trained poisson model. Then predicted goal is simulated by rpois
.
Model candidate each has its own function, and it is specified by the play_fun parameter and provided to higher level wrapper function play_game
.
# Specify team Spain and Portugal play_game(play_fun = "play_fun_simplest", team1 = 7, team2 = 8, musthavewinner=FALSE, normalgoals = normalgoals) ## Agoals Bgoals ## [1,] 0 1 play_game(team_data = team_data, play_fun = "play_fun_skellam", team1 = 7, team2 = 8, musthavewinner=FALSE, normalgoals = normalgoals) ## Agoals Bgoals ## [1,] 1 4 play_game(team_data = team_data, play_fun = "play_fun_elo", team1 = 7, team2 = 8) ## Agoals Bgoals ## [1,] 0 1 play_game(team_data = team_data, train_data = wcmatches_train, play_fun = "play_fun_double_poisson", team1 = 7, team2 = 8) ## Agoals Bgoals ## [1,] 2 2
Estimate poisson mean from training
Let's have a quick look at the core of my training function. Target variable in the glm
function is the number of goals a team obtained in a match. Predictors are FIFA and ELO ratings at a point before the 2014 tournament started. Both are popular ranking systems  the difference being that the FIFA rating is official and the latter is in the wild, adapted from chess ranking methodology.
mod < glm(goals ~ elo + fifa_start, family = poisson(link = log), data = wcmatches_train) broom::tidy(mod) ## term estimate std.error statistic p.value ## 1 (Intercept) 3.5673415298 0.7934373236 4.4960596 6.922433e06 ## 2 elo 0.0021479463 0.0005609247 3.8292949 1.285109e04 ## 3 fifa_start 0.0002296051 0.0003288228 0.6982638 4.850123e01
From the model summary, the ELO rating is statistically significant whereas the FIFA rating is not. More interesting is that the estimate for the FIFA ratings variable is negative, inferring the effect is 0.9997704 relative to average. Overall, FIFA rating appears to be less predictive to the goals one may score than ELO rating. One possible reason is that ratings in 2014 alone are collected, and it may be worth future effort to go into history. Challenge to FIFA ratings' predictive power is not new after all.
Training set wcmatches_train has a home column, representing whether team X in match Y is the home team. However, it's hard to say in a third country whether a team/away position makes much difference comparing to league competetions. Also, I didn't find an explicit home/away split for the Russian World Cup. We could derive a similar feature  host advantage, indicating host nation or continent in future model interation. Home advantage stands no better chance for the time being.
Group and kickout stages
Presented below are examples showing how to find winners at various stages  from group to round 16, quarterfinals, semifinals and final.
find_group_winners(team_data = team_data, group_match_data = group_match_data, play_fun = "play_fun_double_poisson", train_data = wcmatches_train)$goals %>% filter(groupRank %in% c(1,2)) %>% collect() ## Warning: package 'bindrcpp' was built under R version 3.4.4 ## # A tibble: 16 x 11 ## number name group rating elo fifa_start points goalsFore ## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> ## 1 2 Russia A 41.0 1685 493 7.00 5 ## 2 3 Saudi Arabia A 1001 1582 462 5.00 4 ## 3 7 Portugal B 26.0 1975 1306 7.00 6 ## 4 6 Morocco B 501 1711 681 4.00 2 ## 5 12 Peru C 201 1906 1106 5.00 3 ## 6 11 France C 7.50 1984 1166 5.00 6 ## 7 13 Argentina D 10.0 1985 1254 9.00 8 ## 8 15 Iceland D 201 1787 930 6.00 4 ## 9 17 Brazil E 5.00 2131 1384 7.00 8 ## 10 20 Serbia E 201 1770 732 6.00 4 ## 11 21 Germany F 5.50 2092 1544 6.00 8 ## 12 24 Sweden F 151 1796 889 6.00 5 ## 13 27 Panama G 1001 1669 574 5.00 3 ## 14 25 Belgium G 12.0 1931 1346 5.00 4 ## 15 31 Poland H 51.0 1831 1128 4.00 2 ## 16 29 Colombia H 41.0 1935 989 4.00 1 ## # ... with 3 more variables: goalsAgainst <int>, goalsDifference <int>, ## # groupRank <int> find_knockout_winners(team_data = team_data, match_data = structure(c(3L, 8L, 10L, 13L), .Dim = c(2L, 2L)), play_fun = "play_fun_double_poisson", train_data = wcmatches_train)$goals ## team1 team2 goals1 goals2 ## 1 3 10 2 2 ## 2 8 13 1 2
Run the tournament
Here comes to the most exciting part. We made a functionsimulate_one()
to play the tournament one time and then replicate()
(literally) it many many times. To run an ideal number of simulations, for example 10k, you might want to turn on parallel. I am staying at 1000 for simplicity.
Finally, simulate_tournament()
is an ultimate wrapper for all the above bullet points. The returned resultX
object is a 32 by R params$nsim
matrix, each row representing predicted rankings per simulation. set.seed()
is here to ensure the result of this blogpost is reproducible.
# Run nsim number of times world cup tournament set.seed(000) result < simulate_tournament(nsim = nsim, play_fun = "play_fun_simplest") result2 < simulate_tournament(nsim = nsim, play_fun = "play_fun_skellam") result3 < simulate_tournament(nsim = nsim, play_fun = "play_fun_elo") result4 < simulate_tournament(nsim = nsim, play_fun = "play_fun_double_poisson", train_data = wcmatches_train)
Get winner list
get_winner()
reports a winner list showing who has highest probability. Apart from the random poisson model, Brazil is clearly the winner in three other models. The top two teams are between Brazil and Germany. With different seeds, the third and fourth places (in darker blue) in my model are more likely to change. Variance might be an interesting point to look at.
get_winner(result) %>% plot_winner()
get_winner(result2) %>% plot_winner()
get_winner(result3) %>% plot_winner()
get_winner(result4) %>% plot_winner()
Who will be top scoring team?
The skellum model seems more reliable, my double poisson model gives systematically lower scoring frequency than probable actuals. They both favour Brazil though.
get_top_scorer(nsim = nsim, result_data = result2) %>% plot_top_scorer()
get_top_scorer(nsim = nsim, result_data = result4) %>% plot_top_scorer()
Conclusion
The framework is pretty clear, all you need is to customise the play_game
function, such as game_fun_simplest
, game_fun_skellam
and game_fun_elo
.
Ticktock... Don't hesitate to send a pull request to ekstroem/socceR2018 on GitHub. Who is winning the guesswhowinsworldcup2018 game?
If you like this post, please leave your star, fork, issue or banana on the GitHub repository of the post, including all code (https://github.com/MangoTheCat/blog_worldcup2018). The analysis couldn't have been done without help from Rich, Doug, Adnan and all others who have kindly shared ideas. I have passed on your knowledge to the algorithm.
Notes

Data collection. I didn't get to feed the models with the most updated betting odds and ELO ratings in the team_data dataset. If you would like to, they are available via the below three sources. FIFA rating is the easiest and can be scraped by rvest in the usual way. The ELO ratings and betting odds tables seem to have been rendered by javascript and I haven't found a working solution. For betting information, Betfair, an online betting exchange has an API and R package
abettor
which helps to pull those odds which are definetly interesting for anyone who are after strategy beyond predction:  Model enhancement. This is probably where it matters most. For example, previous research has suggested various bivariate poissons for football predictions.
 Feature engineering. Economic factors such as national GDP, market information like total player value or insurance value, and player injure data may be useful to improve accuracy.
 Model evaluation. One way to understand if our model has good prediction capibility or not is to evaluate the predictions against actual outcomes after 15 July 2018. Current odds from bookies can also be referred to. It is not imporssible to run the whole thing on historical data e.g. Year 2014. and perform model selection and tuning.
 Functions and package could be better parameterized; code to be tidied up.
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.