Clustering EPL teams using k-means clustering

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently got a hold of team rankings for the English Premier League (EPL) for the last 10 years (data was manually recorded from this Google sheet, available here in .csv format). I thought this would be a good opportunity to test out clustering of EPL teams and to answer the question: is there a group of teams which is a cut above the rest? Any longtime fan of the EPL will probably answer “yes”; now we have some analysis to back that up.

The R code I used for this analysis is available here.

First, let’s plot a line graph of each team’s ranking over the 10 seasons:

This is quite a mess! There is no clear pattern; in fact the figure looks more like a nice piece of modern art. Note also that even though only 20 teams play in the EPL each season, the graph above actually depicts 36 teams in total: not every team has played in every season of the EPL.

To make the figure more interpretable, let’s perform k-means clustering on the teams. Each team has 10 features associated with it (ranking each season); we treat this as a vector in \mathbb{R}^{10} and perform k-means clustering in that space.

Before we can do k-means, we need to make sure that each team has some ranking value for each of the 10 seasons. What value should we give teams in a season when they did not play in the EPL? For simplicity I gave them a rank of 21. (I tried a rank of 25 as well and the results were almost identical. A more principled way might be to use the rank that they achieved in the lower division, but getting that data would involve a fair amount of work.)

First I ran k-means clustering with 2 clusters. We plot the same figure as above, but with the line colors indicating cluster memberships:

7 teams belong to the cluster in blue, while the remaining 29 teams belong to the cluster in red. With this coloring, a pattern is obvious: these 7 teams frequently occupy most of the top 7 positions of the league table, and all of them played in all 10 seasons! (There is only one other team that played in all 10 seasons, can you guess which it is?) This observation lends credence to the hypothesis that these 7 teams are indeed a cut above the rest.

Another way to view the clustering is by looking at the teams on a two-dimensional plot where the x axis and the y axis represent the 1st and 2nd principal component scores:

In this space there is an obvious group of points on the right corresponding to the good teams. Here is the same plot with the names of the teams:

I guess there are no real surprises in the cluster on the right. We can repeat the analysis above with 3 clusters:

Again, the 7 best teams form a cluster of their own. The rest of the teams are now split into 2 groups, and from the line plot above, it seems like one of the groups consists of teams which are play in the EPL for only  a few seasons, while the other group consists of teams which are more consistently in the EPL but usually in the middle to lower part of the standings. The clusters are still pretty clear on the principal components plot:

I’ve only done the plots for 2 and 3 clusters. Feel free to take the code and see what happens with more clusters!

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)