Who is the most complete athlete? – An insight with the Mahalanobis distance (sport & data analysis)

Posted on September 21, 2012 by Edwin Grappin in R bloggers | 0 Comments

[This article was first published on ProbaPerception, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Olympic Games have finished a couple of days ago. Two entire weeks of complete devotion for sport. Unfortunately I hadn’t got any ticket but I didn’t fail to watch many games on TV and internet. I was looking at decathlon men competition and I was very impressed by the general quality of these athletes. They have to be able to do everything: sprinting (100 m), jumping high, jumping fast (110 m hurdles), long, throw heavy (put shot) and light (javelin) things, running longer (400 m) and even longer (1500 m)… It became obvious in my mind that it was the quintessence of the sport, every athlete has to find the perfect balance between those different performances to compete efficiently. This sport induces all the quality of a strong man: power, endurance, flexibility, sprint…

Is it really true? Is it really the most balanced athlete who win the decathlon competition?

I decided to test this assumption with the results of the previous Olympic Games (Beijing, 2008). I only kept the athletes who have completed all the disciplines so that I can do the study on a data set without any missing values. I used the observations of the scores for each discipline which are calculated according to the time of the distance done by the athlete. If you are interested in those details, you can have a look at the way it is calculated on: http://www.iaaf.org/mm/Document/Competitions/ … _Tables_of_Athletics_2011_23299.pdf.

I have been very surprised to see that the winner, Bryan Clay who has an average of 879 points per discipline, did very poorly in 400 meter (865 points), high jump (794 points) and in the 1500 meters race (522 points). On the contrary, he performed very well in 100 meters, 110 meters hurdle and long jump disciplines. Thus, I started wondering if the decathlon was not about power rather than about my so-called balance capacity in all the different areas.

Sir Prasanta Chandra Mahalanobis answered to this question some decades ago. In 1936 he decided to create a new function to measure the distance separating two observations. The most common distance is the Euclidian distance. However, this distance does not take into account two important elements. The first element is the variance of the different variables. Indeed, let’s consider the high jump discipline and the pole vault, a gap of 30 centimeters between two athletes is huge in high jumping whereas it is a reasonable difference in pole vault. The reason is easy to understand, the variance in pole vaulting discipline is higher than in high jumping. Fortunately, most of the robustness to the variance is taken into account by the international athletic association (the federation who sets the scores) – although we will see that this is not perfectly true. But there is another problem which is even more important. The correlation of the different disciplines. For example the following graphic shows a positive correlation between shot put and disc throw, which, if we think about it, makes sense! Thus, if we look for the most complete athlete, there should be no cumulative rewards – we don’t want to give athletes too many points when they have performed well in two very similar disciplines. On the contrary, if two disciplines are negatively correlated such as 1500 meters and 100 meters we want to give extra points to athletes who perform well in both of the disciplines. The Mahalanobis distance has been created in this purpose.

If S is the matrix of variance-covariance of the data set, we can formally write the Mahalanobis distance between the vectors x and y as:

Once the matrix S is computed, we can calculate the Mahalanobis score for every athlete – say the distance between zero and the scores of the athlete in the different disciplines. It was unexpected to see that the gold medal would be claimed by Oleksiy Kasyanov who has finished 7^th during the Olympic Games. On the contrary, Bryan Clay the Olympic champion would now rank 5^th. You can find below two tables, the first one is the ranking of the athletes according to the Mahalanobis distance, and the second one is the official decathlon ranking. As you can see they are many differences. Therefore, decathlon is not the ultimate sport of complete athlete.

Mahalanobis Ranking	Athlete	Mahalanobis score
1	Oleksiy Kasyanov	790.60
2	Andrei Krauchanka	789.16
3	Maurice Smith	767.85
4	Leonel Suárez	754.27
5	Bryan Clay	742.40
6	Yordanis Garciá	737.40
7	Michael Shrade	723.31
8	Romain Barras	709.31
9	Aleksandr Pogorelov	701.18
10	Andres Raja	696.00
11	Roman Sebrle	693.79
12	Aleksey Drozdov	690.95
13	André Niklaus	687.12
14	Massimo Bertocchi	681.92
15	Jangy Addy	681.16
16	Mikk Pahapill	677.04
17	Mikalai Shubianok	667.82
18	Hadi Sepehrzad	653.71
19	Damjan Sitar	651.63
20	Eugene Martineau	637.66
21	Haifeng Qi	631.22
22	Aliaksandr Parkhomenka	630.64
23	Slaven Dizdarevic	607.92
24	Daniel Awde	607.78

Decathlon Ranking	Athlete	Decathlon Score
1	Bryan Clay	8791
2	Andrei Krauchanka	8551
3	Leonel Suárez	8527
4	Aleksandr Pogorelov	8328
5	Romain Barras	8253
6	Roman Sebrle	8241
7	Oleksiy Kasyanov	8238
8	André Niklaus	8220
9	Maurice Smith	8205
10	Michael Shrade	8194
11	Mikk Pahapill	8178
12	Aleksey Drozdov	8154
13	Andres Raja	8118
14	Eugene Martineau	8055
15	Yordanis Garciá	7992
16	Mikalai Shubianok	7906
17	Aliaksandr Parkhomenka	7838
18	Haifeng Qi	7835
19	Massimo Bertocchi	7714
20	Jangy Addy	7665
21	Daniel Awde	7516
22	Hadi Sepehrzad	7483
23	Damjan Sitar	7336
24	Slaven Dizdarevic	7021

The code (R):

#data and data3 are randomly generated for the example

a = rnorm(24)

data=data.frame(shotPut=a, discusThrow=0.5*a + 0.5 * rnorm(24))

data3=data.frame(X1=a, X2=0.5*a + 0.5 * rnorm(24), X3 = rnorm(24), X4 = rnorm(24), , X5 = rnorm(24), X6 = rnorm(24))

lm.shotPut = lm(data$shotPut~data$discusThrow)

plot(data$discusThrow, data$shotPut, axes=TRUE, ann=FALSE)

abline(lm.shotPut)

title(ylab=”Score at shot put”, xlab = ‘Score at discus throw’, col.lab=rgb(0,0,0))

Sigma = cov(data3)

distance = mahalanobis(data3,0 , Sigma, inverted = FALSE)

To leave a comment for the author, please follow the link and comment on their blog: ProbaPerception.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Who is the most complete athlete? – An insight with the Mahalanobis distance (sport & data analysis)

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)