Football meets machine learning: Forecasting the 2026 FIFA World Cup
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Probabilistic forecasts for the 2026 FIFA World Cup are obtained by using a hybrid model that combines data, expert insights, and advanced statistical models. The favorite is Spain, closely followed by England, France, and Germany.
Winning probabilities
The forecast is based on a machine learning algorithm that blends a variety of different sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 24 bookmakers; average ratings of the players in each team based on their individual performances in their home clubs and national teams; the average market value of all players in each team according to a wisdom-of-the-crowd approach; further team and country covariates (e.g., FIFA and Elo ratings or GDP). A machine learning algorithm is trained on the results of all major football tournaments (Men’s World Cups and Euros) between 2006 and 2024 and then applied to current information to obtain a forecast for the 2026 FIFA World Cup. More specifically, the algorithm estimates the predicted number of goals for all possible matches between all 48 teams in the tournament. Based on the predicted goals the probabilities for each potential outcome (i.e., 0-0, 1-0, 0-1, 2-0, etc.) in each of these matches can be computed from a bivariate Poisson distribution (here: assuming independence). This allows us to simulate all matches in the group phase and which teams proceed to the knockout stage and who eventually wins. Repeating the simulation 100,000 times yields winning probabilities for each team. The results show that Spain is the favorite for the title with a winning probability of 14.5%, closely followed by England and France, both with 12.4%, and Germany with 11.2%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.
Interactive full-width graphic
The study has been conducted by an international team of researchers: Andreas Groll, Agamyrat Hanekov, Lars Magnus Hvattum, Rouven Michels, Gunther Schauberger, Elina Sukhanova, Sebastian Witte, Achim Zeileis. The basic idea for the forecast is to proceed in two steps. In the first step, sophisticated statistical models as well as expert insights are employed to determine the strengths of all teams and their players using disparate sets of information. In the second step, a machine learning algorithm decides how to best combine the strength estimates with other information about the teams.
-
Historic information: Match abilities.
An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years (freely curated by Mart Jürisoo on Kaggle). A bivariate Poisson model with team-specific fixed effects and assuming independence is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain average team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of current team abilities. More details can be found in Ley, Van de Wiele, Van Eetvelde (2019). -
Future expectation: Bookmaker consensus abilities.
Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 24 international bookmakers that reflect their expert expectations for the tournament. Using the bookmaker consensus model of Leitner, Zeileis, Hornik (2010), the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To adjust for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to the consensus winning probabilities. -
Individual player contributions: Average player ratings.
To infer the “contributions of individual players” in a match, the plus-minus player ratings of Pantuso & Hvattum (2021) dissect all matches with a certain player (both on club and on national level) into segments, e.g., between substitutions. Subsequently, the goal difference achieved in these segments is linked to the presence of the individual players during that segment. This yields individual ratings for all players that can be aggregated to average player ratings for each team. -
Wisdom of the crowd: Average market values:
Another way to reflect the current quality and the future potential of each player in a team is to consider their expected market value. As the real market values are unknown, the Transfermarkt web portal employs a “wisdom-of-the-crowd” approach to determine current expected market values for all players. These are based on discussions relying on publicly available data among the online community members of the portal and moderated and consolidated by expert community members and the portal’s employees. -
Combination with present status: Hybrid random forests.
Finally, machine learning is used to combine these four highly aggregated and informative variables with a broad range of further relevant covariates reflecting the current states of the different teams and the countries they come from. Such a hybrid approach was first suggested by Groll, Ley, Schauberger, Van Eetvelde (2019). A random forest algorithm is trained to decide how to blend the different ability estimates with team-specific features that are typically less informative but still powerful enough to enhance the forecasts. The features considered comprise team-specific details (e.g., FIFA rank, Elo rating, number of Champions League players) as well as country-specifc socio-economic factors (such as GDP per capita). By combining a large ensemble of rather weakly informative regression trees in a random forest, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.
Match probabilities
Using the forecasts from the machine learning algorithm yields the predicted number of goals for both teams in each possible match. The explanatory information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in average player ratings of the teams, difference in log market values, etc. The predicted number of goals for the two teams in each match can then be plugged as expectations into two independent Poisson distributions, from which we can compute the probability that a certain match ends in a win, a draw, or a loss. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.
The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. purple to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss after normal time.
Interactive full-width graphic
Performance throughout the tournament
As the goals for both teams in every single match can be simulated with the approach described above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.
Interactive full-width graphic
Odds and ends
All our forecasts are probabilistic, clearly below 100%, and hence by no means certain. Although we can quantify this uncertainty in terms of probabilities from a multiverse of potential tournaments, it is far from being predetermined which of these potential tournaments we will eventually see during the actual tournament.
Nevertheless the probabilistic view provides us with some interesting insights: For example, compared to predictions for previous tournaments (see e.g., 2018, 2022), it is even more uncertain who will win the title as there are a number of teams with good (albeit none with very high) chances of winning the tournament. An important factor for this is the substantially increased size of the tournament with 48 teams (rather than the previous 32) and an additional knockout round. Also, the tournament draw is much more variable, because 8 of the 12 third-ranked teams proceed to the knockout stage with 495 (!) possible permutations for mapping groups to matches in the round of 32.
Moreover, comparing our forecasts to those based only on the bookmakers odds, it is striking that Germany is ranked 4th, closely behind the three top teams, while it is only ranked 7th by many bookmakers. Conversely, Brazil and Argentina are typically ranked higher by the bookmakers but perform worse in our machine-learning-calibrated simulation.
In any case, all of this means that the probabilistic forecasts leave a lot of room for surprises and excitement during the 2026 FIFA World Cup. But what is absolutely certain is that we look forward to an entertaining tournament as football fans (much more than as professional forecasters).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
