Using a random forest ensemble learner we obtain probabilistic forecasts for the FIFA Women’s World Cup in France: The clear favorite is defending champion USA followed, with some margin, by host France, England, and Germany.
The forecast is based on a hybrid random forest learner that combines three main sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 18 bookmakers; further team covariates (e.g., age, team structure) and country-specific socio-economic factors (population, GDP). The random forest is learned using the FIFA Women’s World Cups in 2011 and 2015 as training data and then applied to current information to obtain a forecast for the 2019 FIFA Women’s World Cup. The random forest actually provides the predicted number of goals for each team in all possible matches in the tournament so that a bivariate Poisson distribution can be used to compute the probabilities for a win, draw, or loss in such a match. Based on these match probabilities the entire tournament can be simulated 100,000 times yielding winning probabilities for each team. The results show that defending champions United States are the clear favorite with a winning probability of 28.1% followed by host France with a winning probability of 14.3%, England with 13.3%, and Germany with 12.9%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.
The full study is available in a recent working paper which has been conducted by an international team of researchers: Andreas Groll, Christophe Ley, Gunther Schauberger, Hans Van Eetvelde, Achim Zeileis. It actually provides a hybrid approach that combines three state-of-the-art forecasting methods:
Historic match abilities:
An ability estimate is obtained for every team based on “retrospective” data, namely 3418 historic matches of 167 international women’s teams over the last 8 years. A bivariate Poisson model with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain average team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of current team abilities. More details can be found in Ley, Van de Wiele, Van Eetvelde (2019).
Bookmaker consensus abilities:
Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 18 international bookmakers that reflect their expert expectations for the tournament. Using the bookmaker consensus model of Leitner, Zeileis, Hornik (2010) the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To adjust for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to these winning probabilities.
Hybrid random forest:
Finally, machine learning is used to combine the two ability estimates above along with a broad range of further relevant covariates, yielding refined probabilistic forecasts for each match. Specifically, the hybrid random forest approach of Groll, Ley, Schauberger, Van Eetvelde (2019) is used to combine the two highly-informative ability estimates with further team-specific information that may or may not be relevant to the team’s performance. The covariates considered comprise team-specific details (e.g., FIFA rank, average age, confederation, team structure, …) as well as country-specifc socio-economic factors (population and GDP per capita). By learning a large ensemble of 5,000 regression trees, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team (averaged over all trees) can then finally be used to simulate the entire tournament 100,000 times.
Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. The covariate information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in mean age of the teams, etc. Assuming a bivariate Poisson distribution with the expected numbers of goals for both teams, we can compute the probability that a certain match ends in a win, a draw, or a loss.
The following heatmap shows the win probabilities in each possible match between a pair of teams with green vs. pink signalling probabilities above vs. below 50%, respectively. The corresponding loss probability is displayed when changing the roles of the teams (i.e., switching rows and columns in the matrix below). The tooltips for each match in the interactive version of the graphic also print the three win, draw, and loss probabilities.
Performance throughout the tournament
As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.
Odds and ends
All our forecasts are probabilistic, clearly below 100%, and thus by no means certain – even if the favorite United States has clearly the highest winning probability of all participating teams. However, recall that a single poor performance in the playoffs is sufficient to drop out of the tournament. For example, this happened to host and clear favorite Germany in 2011 (with a winning probability of almost 40% according to the bookmakers) when they lost to Japan 0-1 in extra time in the quarter-finals. Japan then went on to become FIFA Women’s World Champion for the first time.
Another interesting observation is that the bookmakers see both the United States and France almost on par with bookmaker consensus probabilities of 18.1% and 18.7%, respectively. Clearly, the bookmakers (and presumably their customers) expect that France’s home advantage will play an important role. In contrast, our hybrid random forest does not find the home advantage to be an important factor and hence forecasts a much higher winning probability for the United States (28.1%) than for France (14.3%). This is due to the home advantage not having played an important role in our learning data: Germany in 2011 and Canada in 2015 both dropped out in the quarter-finals.
Finally, when considering the bookmaker consensus, it is also worth pointing out that the bookmakers seem to be less confident about their odds for the Women’s World Cup compared to the Men’s World Cup. This is reflected by the increased overround that assures the bookmakers’ profit margins. While for men’s tournaments this overround is typically around 15% (that the bookmakers keep and do not pay out) while for the FIFA Women’s World Cup 2019 it is a sizeable 25% on average and thus ten percentage points higher.
This overround is also the main reason why we recommend against betting based on the results presented here. It assures that the best chances of making money based on sports betting lie with the bookmakers. Instead we recommend to bet only privately among friends and colleagues – or simply enjoy the exciting matches we are surely about to see in France!
Groll A, Ley C, Schauberger G, Van Eetvelde H, Zeileis A (2019). “Hybrid Machine Learning Forecasts for the FIFA Women’s World Cup 2019”, arXiv:1906.01131, arXiv.org E-Print Archive. https://arXiv.org/abs/1906.01131