Soccer is all about money (?) – Part 3: More plots & analyses

October 19, 2012
By

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

Let's play around a bit more with the dataset we built in Part 1 of this series.

Now we are going to compare data from more championships in Europe.

Let's check out the first divisions from the following countries:
- Germany (1. Bundesliga)
- England (Premier League)
- Spain (Primera División)
- Italy (Serie A)
- France (League 1)

If you want to replicate the following steps, I assume that you got all data from these championships using the code from Part 1.

I added a column "col" to every championship table. This column holds a character vector with a color name that's different for each championship:
Germany: Black
England: Blue
Spain: Red
Italy: Green
France: Grey

First, we create one big table with the data from all mentioned championships.
whole.tab <- rbind(GER.tab, UK.tab, ES.tab, IT.tab, FR.tab)

Now, let's have a look at the distribution of the teams from around Europe (using the previously mentioned color coding).

plot(whole.tab$Value, whole.tab$Goals.for, col = whole.tab$col, pch = 19, bty = "n", xlab = "Value", ylab = "Goals for")
whole.mod <- lm(Goals.for ~ Value, data = whole.tab)
abline(coef = coef(whole.mod), lty = "dashed")
whole.cor <- cor.test(whole.tab$Value, whole.tab$Goals.for)
title(sub = paste("r = ", round(whole.cor$estimate, 3), ", p = ", round(whole.cor$p.value, 8), sep = ""))


Isn't that nice - and so many colors :)

Guess, which team is represented by the red dot on the far right of the plot...

What I'm interested in is the following question: Can we compare the championships included in our dataset in terms of the Value-Goals correlation? Sure, that should work. I compute Pearson's correlations for each championship, put the correlation coefficients into a vector and then plot them.

tab.l <- list(GER.tab, UK.tab, ES.tab, IT.tab, FR.tab)
cors <- c()
for (tab in tab.l) {
    cors <- c(cors, cor.test(tab[,"Value"], tab[,"Goals.for"])$estimate) }
names(cors) <- c("GER", "UK", "ES", "IT", "FR")
dotplot(cors, xlab = "Value/Goals correlation coefficient (Pearson)", cex = 1.5)

cors.gp <- c()
for (tab in tab.l) {
    cors.gp <- c(cors.gp, cor.test(tab[,"Goals.for"], tab[,"Points"])$estimate) }
names(cors.gp) <- c("GER", "UK", "ES", "IT", "FR")
dotplot(cors.gp, xlab = "Goals/Points correlation coefficient (Pearson)", cex = 1.5)

The result:
Please note, that this is only a visual comparison of correlations between the value of a team and the amount of goals it scored so far in their national championships (with only 7 / 8 games into the season). This is by no means a water-proof statistical analysis! Nevertheless, let's merrily interpret this thing :)

In spain, the value-goals correlation seems is the highest one - closely followed by the Premier Leauge and the German Bundesliga. In Italy, however, the correlation between the value of a team and the goals it scored is very low. Only around 0.1! That's quite surprising for me.

I had an idea for another little analysis. The goal in soccer is to gain points in your national championship, right? You can only get points by scoring goals. But sometimes, if you score one or more goals, your opponent scores even more goals than yourself. Then, the goals you scored are "wasted" somehow (because you don't get any points for that game). So, as a measure of effectiveness, we can correlate the goals a team scored and the points it achieved so far. The higher this correlation is, the more effective is the team. Now let's have a look in which championship this correlation between scored goals and points is the highest:

cors.gp <- c()
for (tab in tab.l) {
    cors.gp <- c(cors.gp, cor.test(tab[,"Goals.for"], tab[,"Points"])$estimate) }
names(cors.gp) <- c("GER", "UK", "ES", "IT", "FR")
dotplot(cors.gp, xlab = "Goals/Points correlation coefficient (Pearson)", cex = 1.5)


Well, ain't that a surprise - the Germans are the most effective :) Of course, these correlation coefficients are much higher than the ones we observed in the value-goals correlations. In France, however, the number of goals is the least predictive for having many points.

Maybe, I will present some more analyses with the soccer dataset, soon. Or I'll present some other stuff I did, we'll see.






To leave a comment for the author, please follow the link and comment on his blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.