World Cup Analysis

[This article was first published on R – Gradient Metrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Launch Consulting, with an assist from Gradient Metrics, developed a statistical model (have a peek under the hood in this white paper) to assess soccer team strength across all national teams to have ever played a match. Using a comprehensive dataset of all ~40,000 international matches dating back to 1872, data scientists Eric Thompson and Tom Vladeck constructed a multilevel regression model to evaluate team strength.  This could be a powerful tool for predicting World Cup matches, and we will try to do just that!


Prior to the tournament, our model predicted 13 of 16 of the final teams, missing only entrants Switzerland and Mexico, as well as Cinderella story Japan.  In what many consider a whack and wild World Cup, our model performed quite well. Have a look at the table below to see which teams are rated highest heading into the final matches of the 2018 World Cup.
  So what’s our bet on the knockout stages?  Below is the complete bracket our model implies. Because this model provides an overall assessment of Team Strength for all international teams ever, controlling for opponent, the home/away effect, and unique “stadium effect”, the model can be a powerful baseline tool to begin to model teams from entirely different decades or generations.  The model uses the negative binomial distribution, commonly used in experimental biology (e.g. RNA sequencing) and in retail (e.g. modeling customer purchase frequency when groups of buyers exhibit distinct heterogeneous behaviors). The model could also be naturally extended into applications for other team sports, with lacrosse and volleyball being interesting applications due to the structure of those sports.  (In our whitepaper, see Sections “4.0 – Model” and “7.0 – Extending this Work” for lengthy detail on these topics.) This model has already been showcased across South America at partner events for a major tech company, and now Launch is able to offer predictive modeling as a service.

Interesting findings from the model

See “Section 6.2 – World Cup 2018” in our whitepaper to easily evaluate the final teams’ strength. Pre-tournament Vegas favorite Germany stumbled early in its opening 0-1 loss to Mexico and was ultimately eliminated by the South Koreans.  While Germany was a favorite to many forecasters, our model found them to be only the #3 best team since 2016, behind Brazil and Spain. Netherlands and Italy must be kicking themselves (pun intended) because, although they didn’t even qualify for the 2018 World Cup, they grade out near the top of our rankings, as the #6 and #7 strongest national teams since 2016 according to our model. What else?  Let’s talk about home field advantage for a moment. Bolivia isn’t known for its football prowess, having been outscored by a total of 834-450 in the team’s history. But Bolivia boasts our model’s #1 home-field advantage, at almost double the #2 team. Our model estimates a baseline increase of +0.47 goals scored/game for a typical team’s home-field advantage, but on top of that, Bolivia receives an addition “stadium effect” of +0.42 goals/game. This is by far the largest in soccer, at almost 4.5 standard deviations above the mean. This unique advantage may be tied to the team’s high elevation of play. In 2007 FIFA temporarily banned World Cup qualifying games from being played in Bolivia, Ecuador and Columbia due to high elevations affecting players’ health. Perhaps not surprisingly, other high-elevation teams are littered throughout our ranking of top stadium effects, with Mexico and Ecuador ranked at #4 and #5 respectively.


In addition to having some fun, we hope this project opens your eyes to the possibilities of predictive modeling in your own business.  Do your customers all purchase with homogeneous frequencies? Of course not. Thus they perhaps need to be modeled as “nested” groups, using a hierarchical approach such as we’ve done here.  Give us a shout and we’ll chat! Whoever your team is, good luck / buena suerte / boa sorte / bonne chance / lycka till / sretno / held ogg lykke / がんばろう this weekend. SaveSave

To leave a comment for the author, please follow the link and comment on their blog: R – Gradient Metrics. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)