Football by the numbers

[This article was first published on RSS Feed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Salvino A. Salvaggio [1] [2] [3]
In this blog I publish data analysis cases based on the R statistical language. No statistical or mathematical theory here, no discussions of the R language, no software tutorials, but only concrete case studies using existing R tools. To download R code and dataset, click here (4.0 MB).
Over the last 40-50 years, the international spread of the passion for football has revealed as one of the most pandemic social phenomena. Something that was considered as a fun form of national crazyness typical of Brazilian, British and Italian people in the 1960s and ’70s is now commonly shared by a vast majority of the Earth population (including orbitating astronauts that are regularly kept informed of the matches results). As I am an absolute outsider to that trend, I randomly scraped the web[4] in search of results and scorings, and ended up with a dataset of approx. 400,000 first leagues matches (381,257 after a bit of cleaning) which I don’t really have a precise idea of what to do with. A clear advantage of this outsider positioning is that I can dig deeper into something while not having an ounce of positive or negative preconceived ideas on the topic. However, a clear disantvantage is that I may not even think to analytical approaches that would be obvious to a football fan or expert. …The dataset is very international comprising matches from 60 different countries spread over 6 continents representing all FIFA regions.
continentmatches          FIFAregionmatches          top_countriesmatches
Africa16748          AFC28065          united kingdom66454
Asia25062          CAF16748          france26417
Europe292733          CONCACAF13853          italy24716
North America10649          CONMEBOL25475          spain23140
Oceania7386          OFC560          netherlands17790
South America28679          UEFA296556          germany15854
From 65 matches in 1888 to more than 15,000 matches per annum from 2006 onwards,[5] the dataset shows a sort of exponential growth in the number of matches logged annually (with the exception of the two World Wars). Actually, this is not only due to an overall trend in the football industry but also to the way the original data sources I taped data from are fed. As a matter of fact, since the 1950s a growing number of countries have an official championship which results are made available (by their respective federations or fans communities). …Summary statistics confirm what most fans and non-fans say: …Football is a low scoring sport. The mean of total number of goals per match is 2.77 with an average difference in scoring between winner and loser of 0.57 goal only. To put it differently, there was 1 goal every 32 minutes 30 seconds across the dataset. …The pattern of matches results is quite predictable, with almost twice as many home wins as visiting wins or draws.
However, continent where matches are played seems to somehow impact the distribution of home wins, draws and visiting wins –the over-representation of Europe in the dataset (76.8% of all cases) forces to more cautiousness in comparing subsets; for example, the under-representation of visiting wins in Africa compared to the rest strongly contributes to the ChiSquare despite this is only a very small proportion (0.87%) of the whole dataset.
North America497626932980
South America1322680247429
ChiSquare: 985.26   —   df: 10   —   p.value: 2.797101e-205 …On average, more goals are scored by home teams than visiting teams. Overall, in the dataset 636,034 goals were scored by the home teams while 419,775 by the visiting teams. Not only the sum is significantly different,[6] but also the shape of the distribution. …As a further confirmation to the perception of football as a low scoring sport, approximately two third of all the results in the dataset (67.8 %) are within a 2:2 score (i.e., 0:0, 1:0, 2:0, 0:1, 0:2, 1:1, 2:2), and 86.4% if we consider all matches with score up to 3:3.[7] …Many times I heard football fans but, mostly, newsreaders and commentators stating that football is more offensive and more goals are scored in some specific countries, which make the games more entertaining overall. According to the same commentators other countries seem to have a mainly defensive football tradition characterized by a lower number of goals per match, and, ultimately, less fun watching the games. As topical examples of these 2 extreme ways of playing football, Brazil and Italy were always mentionned: a dynamic and high scoring football in Brazil, while chilly and defensive in Italy. … Football in New Zealand, Scandinavian countries, Germany, Holland, Canada and the UK offers its fans more goals scored overall, while Brazil and Italy both belong to a less-scoring category, with an average total number of goals per match in the 2.51 to 3.00 range.[8] …From a historical perspective, the average total number of goals per match tends to decrease over time. From the dataset, I filtered out all the countries which have less than 50 years (football seasons) of data and was left with 10 countries for a total of 226,671 matches. Plotting the average total number of goals per year against the season over time shows that “younger championships” generate more goals but go through quite a steep fall over the initial 25 years, then a slower descrease over the following 50 years.[9]  
[1] This document is the result of an analysis conducted by the author and exclusively reflects the author’s positions. Therefore, this document does not involve, either directly or indirectly, any of the employers, past and present, of the author. The author also declares not to have any conflict of interest with companies, institutions, organizations, authorities related to the football eco-system.   [2] Contact: salvino [dot] salvaggio [at] gmail [dot] com   [3] In this document, football refers to the European definition, which is soccer in the USA.   [4] Sites such as or   [5] Current football season is still ongoing, which explains the substantial drop in the number of matches of the last available year in the dataset.   [6] p-value of t.test < 2.2e-16   [7] If no colored tile is shown in the graph, it means no matches in the dataset ended with such score. If a colored tile reporting 0% is shown, it means that less than 0.005% (but more than 0) of all the matches ended with such score.   [8] Pr-value of one-way ANOVA < 2e-16.   [9] Stabilization in the average total number of goals per match after the 75th year does not mean a lot in this case because only one national football championship, the UK, has such longevity.  

To leave a comment for the author, please follow the link and comment on their blog: RSS Feed. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)