Site icon R-bloggers

Football meets machine learning: Forecasting the 2026 FIFA World Cup

[This article was first published on Achim Zeileis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Probabilistic forecasts for the 2026 FIFA World Cup are obtained by using a hybrid model that combines data, expert insights, and advanced statistical models. The favorite is Spain, closely followed by England, France, and Germany.

Football fans around the world are looking forward to the kick-off of the 2026 FIFA World Cup in Canada, Mexico, and the United States next week. 48 of the best teams from all around the world will compete from 11 June to 19 July to determine the new World Champion. In anticipation of the tournament the big question is who among the teams will succeed, who will drop out, and who will eventually prevail. While it is, of course, not yet possible to give definitive answers to these questions, we are able to provide probabilistic forecasts for all possible matches using a refined machine learning algorithm. This allows us to explore the likely course of the tournament by simulation.

Winning probabilities

The forecast is based on a machine learning algorithm that blends a variety of different sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 24 bookmakers; average ratings of the players in each team based on their individual performances in their home clubs and national teams; the average market value of all players in each team according to a wisdom-of-the-crowd approach; further team and country covariates (e.g., FIFA and Elo ratings or GDP). A machine learning algorithm is trained on the results of all major football tournaments (Men’s World Cups and Euros) between 2006 and 2024 and then applied to current information to obtain a forecast for the 2026 FIFA World Cup. More specifically, the algorithm estimates the predicted number of goals for all possible matches between all 48 teams in the tournament. Based on the predicted goals the probabilities for each potential outcome (i.e., 0-0, 1-0, 0-1, 2-0, etc.) in each of these matches can be computed from a bivariate Poisson distribution (here: assuming independence). This allows us to simulate all matches in the group phase and which teams proceed to the knockout stage and who eventually wins. Repeating the simulation 100,000 times yields winning probabilities for each team. The results show that Spain is the favorite for the title with a winning probability of 14.5%, closely followed by England and France, both with 12.4%, and Germany with 11.2%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.

Interactive full-width graphic

The study has been conducted by an international team of researchers: Andreas Groll, Agamyrat Hanekov, Lars Magnus Hvattum, Rouven Michels, Gunther Schauberger, Elina Sukhanova, Sebastian Witte, Achim Zeileis. The basic idea for the forecast is to proceed in two steps. In the first step, sophisticated statistical models as well as expert insights are employed to determine the strengths of all teams and their players using disparate sets of information. In the second step, a machine learning algorithm decides how to best combine the strength estimates with other information about the teams.

Match probabilities

Using the forecasts from the machine learning algorithm yields the predicted number of goals for both teams in each possible match. The explanatory information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in average player ratings of the teams, difference in log market values, etc. The predicted number of goals for the two teams in each match can then be plugged as expectations into two independent Poisson distributions, from which we can compute the probability that a certain match ends in a win, a draw, or a loss. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.

The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. purple to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss after normal time.

Interactive full-width graphic

Performance throughout the tournament

As the goals for both teams in every single match can be simulated with the approach described above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.

Interactive full-width graphic

Odds and ends

All our forecasts are probabilistic, clearly below 100%, and hence by no means certain. Although we can quantify this uncertainty in terms of probabilities from a multiverse of potential tournaments, it is far from being predetermined which of these potential tournaments we will eventually see during the actual tournament.

Nevertheless the probabilistic view provides us with some interesting insights: For example, compared to predictions for previous tournaments (see e.g., 2018, 2022), it is even more uncertain who will win the title as there are a number of teams with good (albeit none with very high) chances of winning the tournament. An important factor for this is the substantially increased size of the tournament with 48 teams (rather than the previous 32) and an additional knockout round. Also, the tournament draw is much more variable, because 8 of the 12 third-ranked teams proceed to the knockout stage with 495 (!) possible permutations for mapping groups to matches in the round of 32.

Moreover, comparing our forecasts to those based only on the bookmakers odds, it is striking that Germany is ranked 4th, closely behind the three top teams, while it is only ranked 7th by many bookmakers. Conversely, Brazil and Argentina are typically ranked higher by the bookmakers but perform worse in our machine-learning-calibrated simulation.

In any case, all of this means that the probabilistic forecasts leave a lot of room for surprises and excitement during the 2026 FIFA World Cup. But what is absolutely certain is that we look forward to an entertaining tournament as football fans (much more than as professional forecasters).

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version