plyr, ggplot2 and triathlon results, part II

Posted on October 13, 2011 by CL in R bloggers | 0 Comments

[This article was first published on Christophe Ladroue » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I ended my previous post by mentioning how one could imagine other ways of looking at the triathlon data with plyr and ggplot2. I couldn’t help but carry on playing with it so here are more stats and graphs from the same dataset: the results of a local sprint triathlon. This post will have a slightly more statistical bent to it.

First we load the data and turn the character-coded times into numbers (time in minutes):
(see the previous post for more information about this part)

Select All Code:

library("ggplot2")
library("stringr")
 
times<-read.table("stratford.csv",header=TRUE,sep="\t",stringsAsFactors=FALSE,na.strings="")
stringToMinutes<-colwise(
  function(s) unlist(lapply(str_split(s,":"),function(s) sum(as.numeric(s)*c(60,1,1/60)))),
  .(Swim,Cycle,Run,Total))
 
times<-ddply(times,.(Position,StartingPosition,Category),stringToMinutes)
times<-subset(times,Swim+Cycle+Run<Total+5)
head(times)
  Position StartingPosition Category Swim Cycle  Run Total
1        1              441        F 6.07  36.8 19.2  62.0
2        2                5        G 5.92  37.4 20.3  63.6
3        3               26        D 6.47  37.6 19.5  63.6
5        5              445        E 6.72  37.4 21.6  65.8
6        6               32        F 6.42  39.5 21.1  67.0
7        7                2        H 6.03  38.8 22.3  67.1

The data is then melted to be in a suitable format for ggplot2:

Select All Code:

meltedTimes<-melt(
  times,
  c("StartingPosition","Category","Position"),
  c("Swim","Cycle","Run","Total"),
  variable_name="Discipline")
head(meltedTimes)
  StartingPosition Category Position Discipline value
1              441        F        1       Swim  6.07
2                5        G        2       Swim  5.92
3               26        D        3       Swim  6.47
4              445        E        5       Swim  6.72
5               32        F        6       Swim  6.42
6                2        H        7       Swim  6.03

The thing I’ll focus on in this post is the athletes’ rank in each discipline, in particular the top athletes’.
I start by defining a function that’ll give the quartile the athlete is in:

Select All Code:

whichQuartile<-function(s) {
  q<-quantile(s,probs=c(0.25,0.5,0.75,1),names=FALSE)
  x<-sapply(s,function(y) which(y<=q)[1])
  x
}

I apply rank and whichQuartile to the melted data to create 2 new columns: ranking and Quartile.

Select All Code:

meltedTimes<-ddply(
	meltedTimes,
	.(Discipline),
	transform,ranking=rank(value,ties.method="random"),Quartile=whichQuartile(value))
meltedTimes$Quartile<-factor(meltedTimes$Quartile)
head(meltedTimes)
  StartingPosition Category Position Discipline value ranking Quartile
1              441        F        1       Swim  6.07       7        1
2                5        G        2       Swim  5.92       4        1
3               26        D        3       Swim  6.47      16        1
4              445        E        5       Swim  6.72      25        1
5               32        F        6       Swim  6.42      14        1
6                2        H        7       Swim  6.03       6        1

Now we can see how individual ranking changes across disciplines:

Select All Code:

1
2
3

p<-ggplot(meltedTimes,aes(x=Discipline,y=ranking))
p<-p+geom_line(aes(group=StartingPosition),colour="#431600",alpha=0.2)
print(p)

I have to say, the plot is aesthetically very pleasing; it has almost a 3D feel to it. We can also spot a couple of dramatic changes for some athletes.
However things get more interesting if we colour the trajectories by the athletes’ final quartile:

Select All Code:

1
2
3

p<-ggplot(meltedTimes,aes(x=Discipline,y=ranking,colour=Quartile[Discipline=='Total']))
p<-p+geom_line(aes(group=StartingPosition),alpha=0.8)
print(p)

There’s perfect colour separation on the Total column, by construction. The pattern is more or less respected across disciplines, suggesting that the discipline quartiles are predictive of the final quartile. We’ll say more about this in a minute but first let’s look at the same graphs when the colour is done with respect to the disciplines, because they’re informative and because they’re pretty.

Select All Code:

for(discipline in c("Swim","Cycle","Run")){
  p<-ggplot(meltedTimes,aes(x=Discipline,y=ranking,colour=Quartile[Discipline==discipline]))
  p<-p+geom_line(aes(group=StartingPosition),alpha=0.8)
  p<-p+scale_colour_discrete(name = paste(discipline,"Quartile"))
  print(p)
}

Ok, back to our questions. To what extent are the intra-discipline quartiles predictive of the final position? Which discipline makes the most difference? Answering those questions means calculating some conditional probabilities, which we’ll approximate by frequencies.

First we need redress the data to make the counting easier. We could add four quartile columns manually to the times data frame using ddplyr and the function whichQuartile but I’ll use the opportunity to show how to go from the melted data frame to a ‘wide’ format:

Select All Code:

allQuartiles<-reshape(
    subset(
        meltedTimes,
        select=c("StartingPosition","Discipline","Quartile")),
    idvar="StartingPosition",
    timevar="Discipline",
    direction="wide")
head(allQuartiles)
  StartingPosition Quartile.Swim Quartile.Cycle Quartile.Run Quartile.Total
1              441             1              1            1              1
2                5             1              1            1              1
3               26             1              1            1              1
4              445             1              1            1              1
5               32             1              1            1              1
6                2             1              1            1              1

Are disciplines equal?
We want to compare $P(in\, final\, top\, quartile\, |\, in\, top\, quartile\, for\, a\, discipline)$ for each of the three disciplines. A simple and readable way to do it is:

Select All Code:

conditionals=100*c(
  Swim=nrow(subset(allQuartiles,Quartile.Swim==1 & Quartile.Total==1))/nrow(subset(allQuartiles,Quartile.Swim==1)),
  Cycle=nrow(subset(allQuartiles,Quartile.Cycle==1 & Quartile.Total==1))/nrow(subset(allQuartiles,Quartile.Cycle==1)),
  Run=nrow(subset(allQuartiles,Quartile.Run==1 & Quartile.Total==1))/nrow(subset(allQuartiles,Quartile.Run==1))
  )
cat("\n Probability of being in the top quartile, given being in the top quartile in one discipline:\n")
print(conditionals,digits=4)
 Probability of being in the top quartile, given being in the top quartile in one discipline:
 Swim Cycle   Run 
58.89 84.44 76.67

It seems that cycling is really important: among the athletes in the first quartile for cycling, 84% finished in the top 25%. Being well-placed in the swim gives you a good chance to finish at the top as well but not as much. What if you’re in the first quartile for all disciplines? Well:

Select All Code:

1
2
3

# 100% chance of ending in top 25% if always been in top 25%
100*nrow(subset(allQuartiles,Quartile.Swim==1 & Quartile.Cycle==1 & Quartile.Run==1 & Quartile.Total==1))/nrow(subset(allQuartiles,Quartile.Swim==1 & Quartile.Cycle==1 & Quartile.Run==1))
[1] 100

Not too surprisingly.

How to get into the top 25%?
Who is in the top 25%? The count function from plyr is making the process very easy:

Select All Code:

f<-count(
    subset(allQuartiles,Quartile.Total==1),
    c("Quartile.Swim","Quartile.Cycle","Quartile.Run"))
f$Proportion<-100*f$freq/sum(f$freq)
f<-f[order(f$freq,decreasing=TRUE),]
cat("\nWho is in the top 25%?\n")
print(f,digits=3)
   Quartile.Swim Quartile.Cycle Quartile.Run freq Proportion
1              1              1            1   35      38.89
7              2              1            1   12      13.33
8              2              1            2   10      11.11
4              1              2            1    7       7.78
2              1              1            2    6       6.67
10             3              1            1    6       6.67
13             4              1            1    5       5.56
5              1              2            2    3       3.33
3              1              1            3    1       1.11
6              1              3            1    1       1.11
9              2              2            1    1       1.11
11             3              2            1    1       1.11
12             3              4            1    1       1.11
14             4              1            2    1       1.11

So 38.8% of the top athletes actually got there by being at the top for all disciplines. And more than 60% of them got there by excelling in only some disciplines, so there’s hope for us mere mortals.

Careful though: the list shows us the proportion of athletes in different quartiles that make up the final top 25%. In other words:
$P(Swim.Quartile, Cycle.Quartile, Run.Quartile | Total.Quartile=1)$
This is interesting but not what we really want: we’d like to know how likely it is to end up in the top 25% given our performance on each discipline i.e.
$P(Total.Quartile=1 | Swim.Quartile, Cycle.Quartile, Run.Quartile)$
(Aren’t things always easier to read when written down as probability?)
We could use the code we’ve seen in the previous section but it’d be really tedious as we’ll have to condition on all combinations of disciplines’ quartiles.
count and merge to the rescue!

Select All Code:

d<-count(r,c("Quartile.Swim","Quartile.Cycle","Quartile.Run"))
m<-merge(f,d,by=c("Quartile.Swim","Quartile.Cycle","Quartile.Run"))
m$ProbabilityTop25<-100*m$freq.x/m$freq.y
m<-m[order(m$ProbabilityTop25,decreasing=TRUE),]
cat("\nProbability of being in the top 25%?\n")
print(m,digits=3)

d has the count of all possibilities of each discipline’s quartiles. f has the count of all possibilities with the extra event that Quartile.Total==1. We merge the two data frames on the rows for which the discipline’s quartiles all agree and use f‘s counts to normalize the probability.
We end up with this:

Select All Code:

   Quartile.Swim Quartile.Cycle Quartile.Run freq.x Proportion freq.y ProbabilityTop25
1              1              1            1     35      38.89     35            100.0
2              1              1            2      6       6.67      6            100.0
3              1              1            3      1       1.11      1            100.0
7              2              1            1     12      13.33     12            100.0
10             3              1            1      6       6.67      6            100.0
13             4              1            1      5       5.56      5            100.0
14             4              1            2      1       1.11      1            100.0
4              1              2            1      7       7.78      8             87.5
8              2              1            2     10      11.11     12             83.3
6              1              3            1      1       1.11      2             50.0
11             3              2            1      1       1.11      3             33.3
5              1              2            2      3       3.33     10             30.0
9              2              2            1      1       1.11      4             25.0
12             3              4            1      1       1.11      4             25.0

That’s more informative. Be at the top for the swim and the cycling, at least 3rd quartile for the run and you’ll finish in the first quartile. Same for the cycling and the run. Actually, if you’re at the top for cycling and running, you can even allow yourself to be at the bottom quartile for the swim.

There’s still hope if you excel in only one discipline but it’s much harder. And if you’re not at the top for at least one discipline, well, no first quartile for you!

Amazingly, someone (see last line) managed to get to the top despite being in the bottom quartile for cycling (and third for swimming), which we saw earlier is a rather important discipline! He must have done a fantastic run to be able to catch up. Let’s check that out:

Select All Code:

id<-subset(allQuartiles,Quartile.Cycle==4 & Quartile.Total==1)$StartingPosition
print(subset(times,StartingPosition==id))
   Position StartingPosition Category Swim Cycle   Run Total
69       69              297        H  9.3  54.5 12.05 75.85

12 minutes for a 5km run! Either we’ve got ourselves a new world champion or something got wrong when the time got recorded. (let’s not suspect foul play please)

Well, that was fun. There’s still more to do with this dataset but I’ll stop here for now. Thank you for watching.

Caveat: these conclusions are made from a small, probably unrepresentative sample. Don’t take it too seriously and keep listening to your coach. I’d be curious to see how the analysis pans out for larger datasets though.

To leave a comment for the author, please follow the link and comment on their blog: Christophe Ladroue » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

plyr, ggplot2 and triathlon results, part II

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)