Machine learning of a 2022 FIFA World Cup multiverse

[This article was first published on Achim Zeileis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Probabilistic forecasts for the 2022 FIFA World Cup are obtained by using a hybrid model that combines data from three advanced statistical models through random forests. The favorite is Brazil, followed by Argentina, Netherlands, Germany, and France.

The 2022 FIFA World Cup will take place in Qatar from 20 November to 18 December 2022. 32 of the best teams from all around the world compete to determine the new World Champion. Although the event is overshadowed by many issues, both ethical and sportive, we decided for scientific purposes to employ our machine learning approach that we successfully used in previous tournaments for making probabilistic forecasts. More specifically, our approach yields probabilistic forecasts for all possible matches which can then be used to explore the likely course of the tournament along with its most likely champion by simulation.
2022 FIFA World Cup logo

Winning probabilities

The forecast is based on a conditional inference random forest learner that blends information capturing the past, present, and future of the competing football teams: Insights from the past are captured in an ability estimate for every team based on historic matches. Expectations about the the future in the upcoming tournament are captured in an ability estimate for every team based on odds from international bookmakers. The present status of the teams (and their countries) is represented by covariates such as market value or the types of players in the team as well as country-specific socio-economic factors like population or GDP. The random forest model is learned using the previous five FIFA World Cup tournaments from 2002 to 2018 as training data and then applied to current information to obtain a forecast for the 2022 FIFA World Cup. More precisely, the random forest is calibrated to predict the likely distribution of goals for each team in all possible matches in the tournament. This allows to simulate the outcome of each match in normal time as well as potential extra time and penalties in order to obtain probabilities for a win, draw, or loss. Moreover, because every individual match can be simulated like that, a “multiverse” of potential courses of the entire tournament can be created yielding overall winning probabilities for each team. The results show that – 20 years after winning the title the last time – Brazil is the clear favorite for the World Cup with a winning probability of 15.0%, followed by Argentina with 11.2%, the Netherlands with 9.7%, Germany with 9.2%, and France with 9.1%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.

Interactive full-width graphic

Barchart: Winning probabilities

The full study has been conducted by an international team of researchers: Andreas Groll, Neele Hormann, Christophe Ley, Gunther Schauberger, Hans Van Eetvelde, Achim Zeileis. The core of the contribution is a hybrid approach that starts out from three state-of-the-art forecasting methods, based on disparate sets of information, and lets an adaptive machine learning model decide how to blend the different sources of information.

  • Historic information: Match abilities.
    An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years. A bivariate Poisson model with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain average team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of current team abilities. More details can be found in Ley, Van de Wiele, Van Eetvelde (2019).

  • Future expectation: Bookmaker consensus abilities.
    Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 28 international bookmakers that reflect their expert expectations for the tournament. Using an enhanced version of the bookmaker consensus model from Leitner, Zeileis, Hornik (2010), the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To correct for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to these winning probabilities.

  • Combination with present status: Hybrid random forests.
    Finally, machine learning is used to combine these highly aggregated ability estimates with a broad range of further relevant covariates reflecting the current states of the different teams and the countries they come from. Such a hybrid approach was first suggested by Groll, Ley, Schauberger, Van Eetvelde (2019). A random forest learner is trained to decide how to blend the different ability estimates with team-specific features that are typically less informative but still powerful enough to enhance the forecasts. The features considered comprise team-specific details (e.g., market value, FIFA rank, team structure) as well as country-specifc socio-economic factors (population and GDP per capita). By combining a large ensemble of rather weakly informative regression trees in a random forest, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.

Match probabilities

Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. The covariate information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in market values (on a log scale), etc. Assuming a bivariate Poisson distribution with the expected numbers of goals for both teams, we can compute the probability that a certain match ends in a win, a draw, or a loss. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.

The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. purple to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss after normal time.

Interactive full-width graphic

Heatmap: Match probabilities

Performance throughout the tournament

Based on the simulation of individual pairwise matches, as described above, we can create a “multiverse” of potential courses of the entire tournament (here: 100,000). The chances of the teams’ “survival” throughout the tournament can then be described by the proportions of multiverses in which they reach the different stages from the round of 16 to winning the overall title.

Interactive full-width graphic

Line plot: Survival probabilities

Odds and ends

All our forecasts are probabilistic, clearly below 100%, and by no means certain. Thus, although we can quantify this uncertainty in terms of probabilities from a multiverse of tournaments, it is far from being predetermined which of these possible tournaments we will see in our universe.

Unfortunately, the experience of observing the actual tournament will be far less exciting and joyful than usual for us as researchers/forecasters and also as football fans due to the special circumstances. In addition to the widely discussed ethical problems regarding this FIFA World Cup, there are also sportive issues that are absolutely critical: The climate in Qatar is extraordinarily hot which necessitated shifting the event to the winter months. Therefore, all major football leagues in Europe and South America have to interrupt their usual schedule in order to accomodate the tournament. This gives the national teams less time for preparation and the players less time for recovery before and after the World Cup. In combination with the extreme climate conditions this also increases the risk of injuries. Hence, having a team with many players in the international European leagues (Champions League, Europa League, Europa Conference League) might actually be a handicap rather than a strength this year.

All of these factors make the forecast of the tournament outcome more difficult as variables that have been highly predictive in previous World Cups might not work or work differently.

Finally, more from the perspective of football fans (rather than professional forecasters) we are sad that all the usual joy and anticipation of a football World Cup has been crushed by the terrible circumstances this year: starting from the alleged bribery and corruption in the FIFA assignment process, to the human rights and working conditions in Qatar, and the lack of sustainability in the construction and operation of the stadiums.

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)