Forecasting the UEFA Women’s Euro 2025 with enhanced statistical learning

[This article was first published on Achim Zeileis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Probabilistic forecasts for the UEFA Women’s Euro 2025 are obtained by using a machine learning ensemble that combines statistically-enhanced features and other information about the teams. The favorite is Spain, followed by Germany, France, and England.

The UEFA Women’s Euro 2025 will start tomorrow, hosted by Switzerland. An increasing number of football fans around the world are not just following men’s but also women’s tournaments. They look forward to seeing how 16 of the best European teams compete from 2 to 27 July to determine the new European Champion. In anticipation of the tournament the big question is who among the teams will succeed, who will drop out, and who will eventually prevail. While, of course, it is not yet possible to give definitive answers to these questions, we are able to provide probabilistic forecasts for all possible matches based on a combination of machine learning, statistics, and computing. This allows us to explore the likely course of the tournament by simulation.
UEFA Women's Euro 2025 logo

Winning probabilities

The forecast is based on an ensemble of machine learners that blend three main sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 24 bookmakers; and further team and country features (e.g., FIFA rank or GDP). An ensemble of machine learners is trained on the results of the UEFA Women’s Euro tournaments from 2013 to 2022 and then applied to obtain a forecast for the UEFA Women’s Euro 2025. More specifically, the ensemble estimates the predicted number of goals for all possible matches between all 16 teams in the tournament. Based on the predicted goals the probabilities for a win, draw, or loss in each of these matches can be computed from a bivariate Poisson distribution. This allows us to simulate all matches in the group phase and which teams proceed to the knockout stage and who eventually wins the tournament. Repeating the simulation 100,000 times yields winning probabilities for each team. The results show that reigning World Champion Spain is also the favorite for the European title with a winning probability of 27.2%, followed by eight-time winner Germany with 23.0%, France with 17.6%, and defending champion England with 17.2%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.

Interactive full-width graphic

Barchart: Winning probabilities

The methodology for this study was developed by an international collaboration of teams around Andreas Groll (TU Dortmund), Christophe Ley (University of Luxembourg), Gunther Schauberger (TU München), Achim Zeileis (Universität Innsbruck). In this year, Marjan Farahani and Rouven Michels also contributed to the study.

The basic idea for the forecast is to proceed in two steps. In the first step, two sophisticated statistical models are employed to determine the strengths of all teams using disparate sets of information. In the second step, a machine learner ensemble decides how to best combine the strength estimates with other information about the teams.

  • Historic match abilities:
    An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years. A bivariate Poisson model with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain average team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of current team abilities. More details can be found in Ley, Van de Wiele, Van Eetvelde (2019).

  • Bookmaker consensus abilities:
    Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 24 international bookmakers that reflect their expert expectations for the tournament. Using the bookmaker consensus model of Leitner, Zeileis, Hornik (2010), the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To adjust for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to the consensus winning probabilities.

  • Machine learning ensemble:
    Finally, a machine learning ensemble, a so-called random forest, is used to combine these highly-aggregated and informative variables above along with various further relevant variables, yielding refined probabilistic forecasts for each match. Such an approach was first suggested by Groll, Ley, Schauberger, Van Eetvelde (2019) and subsequently improved collaboratively. The machine learning ensemble is trained to decide how to blend the different ability estimates with team-specific features that are typically less informative but still powerful enough to enhance the forecasts. The features considered comprise team- and country-specific details (e.g., FIFA rank, number of Champions League players, and GDP per capita). By combining a large ensemble of machine learners, each of which employs the available information somewhat differently, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.

Match probabilities

Using the forecasts from the machine learning ensemble yields the predicted number of goals for both teams in each possible match. The explanatory information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (again on a log scale), difference in average player ratings of the teams, etc. Assuming a bivariate Poisson distribution with the predicted numbers of goals for both teams, we can compute the probability that a certain match ends in a win, a draw, or a loss. The same can be repeated in overtime, if necessary, adjusting for the shorter time interval of 30 minutes and eventually a coin flip is used to decide penalties, if needed.

The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. purple to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss after normal time.

Interactive full-width graphic

Heatmap: Match probabilities

Performance throughout the tournament

As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.

Interactive full-width graphic

Line plot: Survival probabilities

Odds and ends

All our forecasts are probabilistic, clearly below 100%, and by no means certain. Thus, although we can quantify this uncertainty in terms of probabilities from an ensemble of potential tournaments, it is far from being predetermined which of these potential tournaments we will eventually see during the actual tournament.

Nevertheless the probabilistic view provides us with some interesting insights: For example, while most bookmakers clearly favor Spain over Germany, France, and England, the differences are much smaller in our model. In a match between Spain and any of the other three co-favorites the probability of winning or losing is very close to a fair coin flip. This shows that the main reason for Spain’s high winning probability for the tournament is not so much that they are so much stronger than their co-favorites but that they were a bit more lucky in the tournament draw. Spain starts out in the somewhat weaker group B and will very likely proceed to the quarterfinal and face a team from the weakest group A, including host Switzerland. Thus, the expected course of the tournament is very different from that of co-favorites France and England who have been drawn together in the toughest group D, also including former European Champion Netherlands.

The four top teams are also most likely to be the competitors in the semifinals. However, the predicted probability of reaching the semifinal for host Switzerland is also moderately high (39.3%). This reflects that they have very good chances to proceed to the knockout stage and with a little bit of luck might be good for a surprise, even if the probability of going all the way and winning the title is rather low (3.4%).

In any case, all of this means that the probabilistic forecasts leave a lot of room for surprises and excitement during the UEFA Women’s Euro 2025. But what is absolutely certain is that we look forward to an entertaining tournament as football fans (much more than as professional forecasters).

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)