# Football by the numbers

Salvino A. Salvaggio [1] [2] [3]
In this blog I publish data analysis cases based on the R statistical language. No statistical or mathematical theory here, no discussions of the R language, no software tutorials, but only concrete case studies using existing R tools. To download R code and dataset, click here (4.0 MB).

Over the last 40-50 years, the international spread of the passion for football has revealed as one of the most pandemic social phenomena. Something that was considered as a fun form of national crazyness typical of Brazilian, British and Italian people in the 1960s and ’70s is now commonly shared by a vast majority of the Earth population (including orbitating astronauts that are regularly kept informed of the matches results). As I am an absolute outsider to that trend, I randomly scraped the web[4] in search of results and scorings, and ended up with a dataset of approx. 400,000 first leagues matches (381,257 after a bit of cleaning) which I don’t really have a precise idea of what to do with. A clear advantage of this outsider positioning is that I can dig deeper into something while not having an ounce of positive or negative preconceived ideas on the topic. However, a clear disantvantage is that I may not even think to analytical approaches that would be obvious to a football fan or expert. …The dataset is very international comprising matches from 60 different countries spread over 6 continents representing all FIFA regions.

continent | matches | FIFAregion | matches | top_countries | matches | ||
---|---|---|---|---|---|---|---|

Africa | 16748 | AFC | 28065 | united kingdom | 66454 | ||

Asia | 25062 | CAF | 16748 | france | 26417 | ||

Europe | 292733 | CONCACAF | 13853 | italy | 24716 | ||

North America | 10649 | CONMEBOL | 25475 | spain | 23140 | ||

Oceania | 7386 | OFC | 560 | netherlands | 17790 | ||

South America | 28679 | UEFA | 296556 | germany | 15854 |

Win | Frequency |
---|---|

Home | 189886 |

Draw | 97563 |

Visiting | 93808 |

Home | Draw | Visiting | |
---|---|---|---|

Africa | 8736 | 4706 | 3306 |

Asia | 11141 | 6828 | 7093 |

Europe | 148413 | 73557 | 70763 |

North America | 4976 | 2693 | 2980 |

Oceania | 3394 | 1755 | 2237 |

South America | 13226 | 8024 | 7429 |

…[1] This document is the result of an analysis conducted by the author and exclusively reflects the author’s positions. Therefore, this document does not involve, either directly or indirectly, any of the employers, past and present, of the author. The author also declares not to have any conflict of interest with companies, institutions, organizations, authorities related to the football eco-system. [2] Contact: salvino [dot] salvaggio [at] gmail [dot] com [3] In this document, football refers to the European definition, which is soccer in the USA. [4] Sites such as http://www.calciostoria.it/ or http://www.calcio.com/ [5] Current football season is still ongoing, which explains the substantial drop in the number of matches of the last available year in the dataset. [6] p-value of t.test < 2.2e-16 [7] If no colored tile is shown in the graph, it means no matches in the dataset ended with such score. If a colored tile reporting 0% is shown, it means that less than 0.005% (but more than 0) of all the matches ended with such score. [8] Pr-value of one-way ANOVA < 2e-16. [9] Stabilization in the average total number of goals per match after the 75th year does not mean a lot in this case because only one national football championship, the UK, has such longevity.

