Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s get back to the age-value relationship from my last post. I did some more plotting to see on which position this inversed U-shaped relationship is strongest. Please note, that I use a dataframe called eu.players throughout this post, which holds downloaded football player information from transfermarkt.de.

But first, let us get back to the original graph.
(click to enlarge)

As you can see, very young players are not worth a lot of money, then the quadratic function peaks at 26 years (I limited the y-axis because very worthy players would cause the plot to be unreadable). Then it falls off again. The dashed line in the plot is regression line from a simple linear regression model I introduced in my last post.

Now, let us have a look at the age distributions for the different positions (no player market values are included in these plots, yet).

library(RColorBrewer)
library(plotrix)
hist.list <- list(eu.players[eu.players\$pos2 == “Goal”, “age”],
eu.players[eu.players\$pos2 == “Def”, “age”],
eu.players[eu.players\$pos2 == “Midf”, “age”],
eu.players[eu.players\$pos2 == “Forw”, “age”])
multhist(hist.list, beside = F, freq = F,
col = brewer.pal(4, “Paired”), border = “#00000000”,
main = “Age by Position”)
legend(x = “topright”,
legend = c(“Goal”, “Defence”, “Midfield”, “Forward”),
fill = brewer.pal(4, “Paired”),
box.col = “#00000000”, border = “#00000000”)

(click to enlarge)

We can visualize this distribution in many other different ways. Let us use density plots, now.

plot(density(eu.players[eu.players\$pos2 == “Goal”, “age”]),
ylim = c(0,0.09), col = brewer.pal(4, “Paired”)[1],
lwd = 5, main = “Age density by position”,
xlab = “Age”, bty = “n”)
lines(density(eu.players[eu.players\$pos2 == “Def”, “age”]),
col = brewer.pal(4, “Paired”)[2], lwd = 5)
lines(density(eu.players[eu.players\$pos2 == “Midf”, “age”]),
col = brewer.pal(4, “Paired”)[3], lwd = 5)
lines(density(eu.players[eu.players\$pos2 == “Forw”, “age”]),
col = brewer.pal(4, “Paired”)[4], lwd = 5)
legend(x = “topright”,
legend = c(“Goal”, “Defence”, “Midfield”, “Forward”),
col = brewer.pal(4, “Paired”),
box.col = “#00000000”, border = “#00000000”,
lty = “solid”, lwd = 5)

(click to enlarge)

And finally some box plots.

boxplot(age ~ pos2, data = eu.players,
col = brewer.pal(4, “Paired”), boxcol = “#00000000”,
notch = T, pch = 4, ylab = “Age”, xlab = “Position”,
names = c(“Goal”, “Defence”, “Midfield”, “Forward”),
main = “Age by Position”)

(click to enlarge)

Each of these plot types visualizes the same information. We can see that goalies are a little older overall and the age distribution is much “flatter” for goalies. You can see this quite clearly in the stacked histogram and the density graphs. Just as a side note: The outlier (marked by a cross) for defence players is Javier Zanetti (39 years old), playing for Inter Mailand. The two outliers for midfielders are Ryan Giggs (age 39) and Paul Scholes (age 38), both playing – of course – for Manchester United.

With the following boxplot, we’ll have a look at the different age distributions in the five major european championships.

boxplot(age ~ league, data = eu.players,
col = brewer.pal(4, “Paired”),
boxcol = “#00000000”, notch = T, pch = 4,
ylab = “Age”, xlab = “Championship”,
names = c(“DE”, “FR”, “UK”, “ES”, “IT”),
bty = “n”, main = “Age by Championship”)

(click to enlarge)

Obviously, the german Bundesliga has the youngest players. Only one player is older than 36 years – and counts as an outlier: Oka Nikolov, goalkeeper with Eintracht Frankfurt. He wouldn’t count as an outlier in the Premier League. Taken the whole Premier League together, Giggs and Scholes are not outliers anymore, but Brad Friedel, goalie with the Spurs, is.

So, we learned that age distributions are neither equal across positions nor championships. But what about the age / value relationship mentioned above?

What we’ll do: We will plot lowess lines (see ?lowess for details) which minimize deviations in a specific span of data, allowing for changing fits as x-values increase. Let’s first compare age-value-relationships for different positions.

First, we extract the data for every position and get rid of NAs.

goal <- eu.players[eu.players\$pos2 == “Goal” &
!is.na(eu.players\$val.mill) &
!is.na(eu.players\$age),]
def <- eu.players[eu.players\$pos2 == “Def” &
!is.na(eu.players\$val.mill) &
!is.na(eu.players\$age),]
midf <- eu.players[eu.players\$pos2 == “Midf” &
!is.na(eu.players\$val.mill) &
!is.na(eu.players\$age),]
forw <- eu.players[eu.players\$pos2 == “Forw” &
!is.na(eu.players\$val.mill) &
!is.na(eu.players\$age),]

Now, we plot the lowess lines for each position sub-dataset, all in the same plotting window.

plot(goal\$age, goal\$val.mill, bty = “n”,
type = “n”ylim = c(0,3.5), xlim = c(16, 42),
main = “Age by Value smoothers, divided by Position”,
xlab = “Age”, ylab = “Value”)
lines(lowess(goal\$age, goal\$val.mill), lwd = 4,
col = brewer.pal(4, “Paired”)[1])
lines(lowess(def\$age, def\$val.mill), lwd = 4,
col = brewer.pal(4, “Paired”)[2])
lines(lowess(midf\$age, midf\$val.mill), lwd = 4,
col = brewer.pal(4, “Paired”)[3])
lines(lowess(forw\$age, forw\$val.mill), lwd = 4,
col = brewer.pal(4, “Paired”)[4])
legend(x = “topright”,
legend = c(“Goal”, “Defence”, “Midfield”, “Forward”),
col = brewer.pal(4, “Paired”),
box.col = “#00000000”, border = “#00000000”,
lty = “solid”, lwd = 5)

(click to enlarge)

Several things can be derived from the graph:
• The sharpness of the inversed U-curve is very pronounced for forwards and midfielder, slightly less pronounced for defenders and least pronounced for goalies.
• Goalies “peak” later. For every field player, the peak ist aorund 26 years. Even defenders don’t have to be more experienced to be more valuable. Goalies, however, peak at around 30 years and are more valuable than field players in later years (from 35 to 40).
To see if the age-value relationships differ between the championships around Europe, I also divided the dataset in subdatasets for each championship. Here is the code for the respective plot:
plot(buli\$age, buli\$val.mill, bty = “n”,
type = “n”, ylim = c(0,5.5), xlim = c(16, 42),
main = “Age by Value smoothers, divided by Championship”,
xlab = “Age”, ylab = “Value”)
lines(lowess(buli\$age, buli\$val.mill), lwd = 4,
col = brewer.pal(5, palette)[1])
lines(lowess(ligue1\$age, ligue1\$val.mill), lwd = 4,
col = brewer.pal(5, palette)[2])
lines(lowess(preml\$age, preml\$val.mill), lwd = 4,
col = brewer.pal(5, palette)[3])
lines(lowess(primd\$age, primd\$val.mill), lwd = 4,
col = brewer.pal(5, palette)[4])
lines(lowess(serA\$age, serA\$val.mill), lwd = 4,
col = brewer.pal(5, palette)[5])
legend(x = “topright”,
legend = c(“Bundesliga”, “Ligue 1”,
“Premier League”, “Primera División”, “Serie A”),
col = brewer.pal(5, palette), box.col = “#00000000”,
border = “#00000000”, lty = “solid”, lwd = 5)
(click to enlarge)

As for the different positions, there are quite distinct relationships for the different championships – especially regarding the Premier League which peaks much higher. Presumably, this is only the case because the mean value of players is also highest in the Premier League, meaning that the curve has more space to peak. In the Serie A (Italy), the peak is slightly moved to the right, in the Bundesliga slightly to the left. These championships seem to have differing preferences in terms of the age of their players. In France (Ligue 1), the relationship is least pronounced. However, this does not mean that older players are more valuable (as we saw for goalies).

Well, enough football for now. I’ll see if I come back to this dataset some other time… also, feel free to propose football-related analyses…