Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s been some time since my last post on football. And we’re talking about european soccer here.

So I finally managed to write some functions which allow me to extract player stats from www.transfermarkt.de. The site tracks lots of stats in the world of soccer. For each player, there is information about the dominant foot, height, age, the estimated market value of the player and a load more.

I extracted stats for all registered players from the five major national championships in Europe. Namely, the Bundesliga (Germany), Ligue 1 (France), Premier League (UK), Primera División (Spain) and the Serie A (Italy). Now I have information for 2628 players concerning position, dominant foot, age, height and estimated market value.

The information is in a dataframe called “eu.players”.

So let’s see if the age of a player predicts its market value and if so, in which way.

The first step is a simple linear regression model. We predict the value of a player (in million Euros) by his age:

> age.val.mod <- lm(val.mill ~ age, data = eu.players)
> anova(age.val.mod)

Analysis of Variance Table

Response: val.mill
Df Sum Sq Mean Sq F value Pr(>F)
age          1     61  60.836  1.1445 0.2848
Residuals 2619 139209  53.153

That’s a little bit disappointing. According to the linear model, there is no significant relation between the age and the value of a player. Let’s plot this relation.
(click to enlarge)

This plot strongly suggests another regression model. But it seems as if we need to include a non-linear term. Maybe, players get more valuable over time but then lose value again.

> age.val.mod2 <- lm(val.mill ~ age + I(age^2), data = eu.players)
> anova(age.val.mod2)
Analysis of Variance Table

Response: val.mill
Df Sum Sq Mean Sq  F value Pr(>F)
age          1     61    60.8   1.1979 0.2738
I(age^2)     1   6248  6248.5 123.0332 <2e-16 ***
Residuals 2618 132960    50.8
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> anova(age.val.mod2, age.val.mod3)
Analysis of Variance Table

Model 1: val.mill ~ age + I(age^2)
Model 2: val.mill ~ age + I(age^2) + I(age^3)
Res.Df    RSS Df Sum of Sq     F   Pr(>F)
1   2618 132960
2   2617 132459  1    501.54 9.909 0.001663 **
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The quadratic term for age is highly significant. Also, the model comparison done by anova(, ) shows that the inclusion of the quadratic term improves our model.

Now, let’s plot this relationship again, including model estimates for the quadratic term. Also, we use limits for the y-axis because Lionel Messi has an estimated worth of 120 million Euros and this “crushes” the majority of the players down to the x-axis.

(click to enlarge)

We get a quite clear result: The relationship between age and value of a football player seems to be a quadratic one. The “golden age” is 26 years. We can extract this value from the predicted values of model age.val.mod2:
> df <- data.frame(age = sort(unique(eu.players$age))) > df$age[which.max(predict(age.val.mod2, newdata = df))]
[1] 26

Obviously, there’s a lot more one can do with the kind of data we have. But this has to wait for some other time. I’ll start with a teaser: Let’s see if there are some clear relationships between “footedness” (dominant foot) and position…

(click to enlarge)

This mosaic plot and the Standardized Residuals obtained from a Chi-Square test suggest that there are more players with a dominant left foot in defense than would be expected if footedness is distributed equally over the different positions. Also, players with no real dominant foot (“both”) are overrepresented in midfield and forward positions. This should give them the opportunity for a more flexible style of play – a competence mostly needed in forward positions.

More insights from this dataset coming soon…