Football by the numbers

March 24, 2016

(This article was first published on RSS Feed, and kindly contributed to R-bloggers)

Salvino A. Salvaggio [1] [2] [3]

In this blog I publish data analysis cases based on the R statistical language. No statistical or mathematical theory here, no discussions of the R language, no software tutorials, but only concrete case studies using existing R tools.

To download R code and dataset, click here (4.0 MB).

Over the last 40-50 years, the international spread of the passion for football has revealed as one of the most pandemic social phenomena. Something that was considered as a fun form of national crazyness typical of Brazilian, British and Italian people in the 1960s and ’70s is now commonly shared by a vast majority of the Earth population (including orbitating astronauts that are regularly kept informed of the matches results).

As I am an absolute outsider to that trend, I randomly scraped the web[4] in search of results and scorings, and ended up with a dataset of approx. 400,000 first leagues matches (381,257 after a bit of cleaning) which I don’t really have a precise idea of what to do with. A clear advantage of this outsider positioning is that I can dig deeper into something while not having an ounce of positive or negative preconceived ideas on the topic. However, a clear disantvantage is that I may not even think to analytical approaches that would be obvious to a football fan or expert.

…The dataset is very international comprising matches from 60 different countries spread over 6 continents representing all FIFA regions.

continent matches            FIFAregion matches            top_countries matches
Africa 16748            AFC 28065            united kingdom 66454
Asia 25062            CAF 16748            france 26417
Europe 292733            CONCACAF 13853            italy 24716
North America 10649            CONMEBOL 25475            spain 23140
Oceania 7386            OFC 560            netherlands 17790
South America 28679            UEFA 296556            germany 15854

From 65 matches in 1888 to more than 15,000 matches per annum from 2006 onwards,[5] the dataset shows a sort of exponential growth in the number of matches logged annually (with the exception of the two World Wars). Actually, this is not only due to an overall trend in the football industry but also to the way the original data sources I taped data from are fed.

As a matter of fact, since the 1950s a growing number of countries have an official championship which results are made available (by their respective federations or fans communities).

…Summary statistics confirm what most fans and non-fans say:

…Football is a low scoring sport. The mean of total number of goals per match is 2.77 with an average difference in scoring between winner and loser of 0.57 goal only. To put it differently, there was 1 goal every 32 minutes 30 seconds across the dataset.

…The pattern of matches results is quite predictable, with almost twice as many home wins as visiting wins or draws.

Win Frequency
Home 189886
Draw 97563
Visiting 93808

However, continent where matches are played seems to somehow impact the distribution of home wins, draws and visiting wins –the over-representation of Europe in the dataset (76.8% of all cases) forces to more cautiousness in comparing subsets; for example, the under-representation of visiting wins in Africa compared to the rest strongly contributes to the ChiSquare despite this is only a very small proportion (0.87%) of the whole dataset.

Home Draw Visiting
Africa 8736 4706 3306
Asia 11141 6828 7093
Europe 148413 73557 70763
North America 4976 2693 2980
Oceania 3394 1755 2237
South America 13226 8024 7429

ChiSquare: 985.26   —   df: 10   —   p.value: 2.797101e-205

…On average, more goals are scored by home teams than visiting teams. Overall, in the dataset 636,034 goals were scored by the home teams while 419,775 by the visiting teams. Not only the sum is significantly different,[6] but also the shape of the distribution.

…As a further confirmation to the perception of football as a low scoring sport, approximately two third of all the results in the dataset (67.8 %) are within a 2:2 score (i.e., 0:0, 1:0, 2:0, 0:1, 0:2, 1:1, 2:2), and 86.4% if we consider all matches with score up to 3:3.[7]

…Many times I heard football fans but, mostly, newsreaders and commentators stating that football is more offensive and more goals are scored in some specific countries, which make the games more entertaining overall. According to the same commentators other countries seem to have a mainly defensive football tradition characterized by a lower number of goals per match, and, ultimately, less fun watching the games. As topical examples of these 2 extreme ways of playing football, Brazil and Italy were always mentionned: a dynamic and high scoring football in Brazil, while chilly and defensive in Italy. … Football in New Zealand, Scandinavian countries, Germany, Holland, Canada and the UK offers its fans more goals scored overall, while Brazil and Italy both belong to a less-scoring category, with an average total number of goals per match in the 2.51 to 3.00 range.[8]

…From a historical perspective, the average total number of goals per match tends to decrease over time. From the dataset, I filtered out all the countries which have less than 50 years (football seasons) of data and was left with 10 countries for a total of 226,671 matches. Plotting the average total number of goals per year against the season over time shows that “younger championships” generate more goals but go through quite a steep fall over the initial 25 years, then a slower descrease over the following 50 years.[9]


[1] This document is the result of an analysis conducted by the author and exclusively reflects the author’s positions. Therefore, this document does not involve, either directly or indirectly, any of the employers, past and present, of the author. The author also declares not to have any conflict of interest with companies, institutions, organizations, authorities related to the football eco-system.  

[2] Contact: salvino [dot] salvaggio [at] gmail [dot] com  

[3] In this document, football refers to the European definition, which is soccer in the USA.  

[4] Sites such as or  

[5] Current football season is still ongoing, which explains the substantial drop in the number of matches of the last available year in the dataset.  

[6] p-value of t.test < 2.2e-16   [7] If no colored tile is shown in the graph, it means no matches in the dataset ended with such score. If a colored tile reporting 0% is shown, it means that less than 0.005% (but more than 0) of all the matches ended with such score.  

[8] Pr-value of one-way ANOVA < 2e-16.   [9] Stabilization in the average total number of goals per match after the 75th year does not mean a lot in this case because only one national football championship, the UK, has such longevity.  

To leave a comment for the author, please follow the link and comment on their blog: RSS Feed. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)