**Christophe Ladroue » R**, and kindly contributed to R-bloggers)

I ended my previous post by mentioning how one could imagine other ways of looking at the triathlon data with `plyr` and `ggplot2`. I couldn’t help but carry on playing with it so here are more stats and graphs from the same dataset: the results of a local sprint triathlon. This post will have a slightly more statistical bent to it.

First we load the data and turn the character-coded times into numbers (time in minutes):

(see the previous post for more information about this part)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
library("ggplot2") library("stringr") times<-read.table("stratford.csv",header=TRUE,sep="\t",stringsAsFactors=FALSE,na.strings="") stringToMinutes<-colwise( function(s) unlist(lapply(str_split(s,":"),function(s) sum(as.numeric(s)*c(60,1,1/60)))), .(Swim,Cycle,Run,Total)) times<-ddply(times,.(Position,StartingPosition,Category),stringToMinutes) times<-subset(times,Swim+Cycle+Run<Total+5) head(times) Position StartingPosition Category Swim Cycle Run Total 1 1 441 F 6.07 36.8 19.2 62.0 2 2 5 G 5.92 37.4 20.3 63.6 3 3 26 D 6.47 37.6 19.5 63.6 5 5 445 E 6.72 37.4 21.6 65.8 6 6 32 F 6.42 39.5 21.1 67.0 7 7 2 H 6.03 38.8 22.3 67.1 |

The data is then melted to be in a suitable format for `ggplot2`:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
meltedTimes<-melt( times, c("StartingPosition","Category","Position"), c("Swim","Cycle","Run","Total"), variable_name="Discipline") head(meltedTimes) StartingPosition Category Position Discipline value 1 441 F 1 Swim 6.07 2 5 G 2 Swim 5.92 3 26 D 3 Swim 6.47 4 445 E 5 Swim 6.72 5 32 F 6 Swim 6.42 6 2 H 7 Swim 6.03 |

The thing I’ll focus on in this post is the athletes’ rank in each discipline, in particular the top athletes’.

I start by defining a function that’ll give the quartile the athlete is in:

1 2 3 4 5 |
whichQuartile<-function(s) { q<-quantile(s,probs=c(0.25,0.5,0.75,1),names=FALSE) x<-sapply(s,function(y) which(y<=q)[1]) x } |

I apply `rank` and `whichQuartile` to the melted data to create 2 new columns: `ranking` and `Quartile`.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
meltedTimes<-ddply( meltedTimes, .(Discipline), transform,ranking=rank(value,ties.method="random"),Quartile=whichQuartile(value)) meltedTimes$Quartile<-factor(meltedTimes$Quartile) head(meltedTimes) StartingPosition Category Position Discipline value ranking Quartile 1 441 F 1 Swim 6.07 7 1 2 5 G 2 Swim 5.92 4 1 3 26 D 3 Swim 6.47 16 1 4 445 E 5 Swim 6.72 25 1 5 32 F 6 Swim 6.42 14 1 6 2 H 7 Swim 6.03 6 1 |

Now we can see how individual ranking changes across disciplines:

1 2 3 |
p<-ggplot(meltedTimes,aes(x=Discipline,y=ranking)) p<-p+geom_line(aes(group=StartingPosition),colour="#431600",alpha=0.2) print(p) |

I have to say, the plot is aesthetically very pleasing; it has almost a 3D feel to it. We can also spot a couple of dramatic changes for some athletes.

However things get more interesting if we colour the trajectories by the athletes’ final quartile:

1 2 3 |
p<-ggplot(meltedTimes,aes(x=Discipline,y=ranking,colour=Quartile[Discipline=='Total'])) p<-p+geom_line(aes(group=StartingPosition),alpha=0.8) print(p) |

There’s perfect colour separation on the `Total` column, by construction. The pattern is more or less respected across disciplines, suggesting that the discipline quartiles are predictive of the final quartile. We’ll say more about this in a minute but first let’s look at the same graphs when the colour is done with respect to the disciplines, because they’re informative and because they’re pretty.

1 2 3 4 5 6 |
for(discipline in c("Swim","Cycle","Run")){ p<-ggplot(meltedTimes,aes(x=Discipline,y=ranking,colour=Quartile[Discipline==discipline])) p<-p+geom_line(aes(group=StartingPosition),alpha=0.8) p<-p+scale_colour_discrete(name = paste(discipline,"Quartile")) print(p) } |

Ok, back to our questions. To what extent are the intra-discipline quartiles predictive of the final position? Which discipline makes the most difference? Answering those questions means calculating some conditional probabilities, which we’ll approximate by frequencies.

First we need redress the data to make the counting easier. We could add four quartile columns manually to the `times` data frame using `ddplyr` and the function `whichQuartile` but I’ll use the opportunity to show how to go from the melted data frame to a ‘wide’ format:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
allQuartiles<-reshape( subset( meltedTimes, select=c("StartingPosition","Discipline","Quartile")), idvar="StartingPosition", timevar="Discipline", direction="wide") head(allQuartiles) StartingPosition Quartile.Swim Quartile.Cycle Quartile.Run Quartile.Total 1 441 1 1 1 1 2 5 1 1 1 1 3 26 1 1 1 1 4 445 1 1 1 1 5 32 1 1 1 1 6 2 1 1 1 1 |

**Are disciplines equal?**

We want to compare for each of the three disciplines. A simple and readable way to do it is:

1 2 3 4 5 6 7 8 9 10 |
conditionals=100*c( Swim=nrow(subset(allQuartiles,Quartile.Swim==1 & Quartile.Total==1))/nrow(subset(allQuartiles,Quartile.Swim==1)), Cycle=nrow(subset(allQuartiles,Quartile.Cycle==1 & Quartile.Total==1))/nrow(subset(allQuartiles,Quartile.Cycle==1)), Run=nrow(subset(allQuartiles,Quartile.Run==1 & Quartile.Total==1))/nrow(subset(allQuartiles,Quartile.Run==1)) ) cat("\n Probability of being in the top quartile, given being in the top quartile in one discipline:\n") print(conditionals,digits=4) Probability of being in the top quartile, given being in the top quartile in one discipline: Swim Cycle Run 58.89 84.44 76.67 |

It seems that cycling is really important: among the athletes in the first quartile for cycling, 84% finished in the top 25%. Being well-placed in the swim gives you a good chance to finish at the top as well but not as much. What if you’re in the first quartile for all disciplines? Well:

1 2 3 |
# 100% chance of ending in top 25% if always been in top 25% 100*nrow(subset(allQuartiles,Quartile.Swim==1 & Quartile.Cycle==1 & Quartile.Run==1 & Quartile.Total==1))/nrow(subset(allQuartiles,Quartile.Swim==1 & Quartile.Cycle==1 & Quartile.Run==1)) [1] 100 |

Not too surprisingly.

**How to get into the top 25%?**

Who is in the top 25%? The `count` function from `plyr` is making the process very easy:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
f<-count( subset(allQuartiles,Quartile.Total==1), c("Quartile.Swim","Quartile.Cycle","Quartile.Run")) f$Proportion<-100*f$freq/sum(f$freq) f<-f[order(f$freq,decreasing=TRUE),] cat("\nWho is in the top 25%?\n") print(f,digits=3) Quartile.Swim Quartile.Cycle Quartile.Run freq Proportion 1 1 1 1 35 38.89 7 2 1 1 12 13.33 8 2 1 2 10 11.11 4 1 2 1 7 7.78 2 1 1 2 6 6.67 10 3 1 1 6 6.67 13 4 1 1 5 5.56 5 1 2 2 3 3.33 3 1 1 3 1 1.11 6 1 3 1 1 1.11 9 2 2 1 1 1.11 11 3 2 1 1 1.11 12 3 4 1 1 1.11 14 4 1 2 1 1.11 |

So 38.8% of the top athletes actually got there by being at the top for all disciplines. And more than 60% of them got there by excelling in only some disciplines, so there’s hope for us mere mortals.

Careful though: the list shows us the proportion of athletes in different quartiles that make up the final top 25%. In other words:

This is interesting but not what we really want: we’d like to know how likely it is to end up in the top 25% given our performance on each discipline *i.e.*

(Aren’t things always easier to read when written down as probability?)

We could use the code we’ve seen in the previous section but it’d be really tedious as we’ll have to condition on all combinations of disciplines’ quartiles.

`count` and `merge` to the rescue!

1 2 3 4 5 6 |
d<-count(r,c("Quartile.Swim","Quartile.Cycle","Quartile.Run")) m<-merge(f,d,by=c("Quartile.Swim","Quartile.Cycle","Quartile.Run")) m$ProbabilityTop25<-100*m$freq.x/m$freq.y m<-m[order(m$ProbabilityTop25,decreasing=TRUE),] cat("\nProbability of being in the top 25%?\n") print(m,digits=3) |

`d` has the count of all possibilities of each discipline’s quartiles. `f` has the count of all possibilities with the extra event that `Quartile.Total==1`. We merge the two data frames on the rows for which the discipline’s quartiles all agree and use `f`‘s counts to normalize the probability.

We end up with this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Quartile.Swim Quartile.Cycle Quartile.Run freq.x Proportion freq.y ProbabilityTop25 1 1 1 1 35 38.89 35 100.0 2 1 1 2 6 6.67 6 100.0 3 1 1 3 1 1.11 1 100.0 7 2 1 1 12 13.33 12 100.0 10 3 1 1 6 6.67 6 100.0 13 4 1 1 5 5.56 5 100.0 14 4 1 2 1 1.11 1 100.0 4 1 2 1 7 7.78 8 87.5 8 2 1 2 10 11.11 12 83.3 6 1 3 1 1 1.11 2 50.0 11 3 2 1 1 1.11 3 33.3 5 1 2 2 3 3.33 10 30.0 9 2 2 1 1 1.11 4 25.0 12 3 4 1 1 1.11 4 25.0 |

That’s more informative. Be at the top for the swim and the cycling, at least 3rd quartile for the run and you’ll finish in the first quartile. Same for the cycling and the run. Actually, if you’re at the top for cycling and running, you can even allow yourself to be at the bottom quartile for the swim.

There’s still hope if you excel in only one discipline but it’s much harder. And if you’re not at the top for at least one discipline, well, no first quartile for you!

Amazingly, someone (see last line) managed to get to the top despite being in the bottom quartile for cycling (and third for swimming), which we saw earlier is a rather important discipline! He must have done a fantastic run to be able to catch up. Let’s check that out:

1 2 3 4 |
id<-subset(allQuartiles,Quartile.Cycle==4 & Quartile.Total==1)$StartingPosition print(subset(times,StartingPosition==id)) Position StartingPosition Category Swim Cycle Run Total 69 69 297 H 9.3 54.5 12.05 75.85 |

12 minutes for a 5km run! Either we’ve got ourselves a new world champion or something got wrong when the time got recorded. (let’s not suspect foul play please)

Well, that was fun. There’s still more to do with this dataset but I’ll stop here for now. Thank you for watching.

Caveat: these conclusions are made from a small, probably unrepresentative sample. Don’t take it too seriously and keep listening to your coach. I’d be curious to see how the analysis pans out for larger datasets though.

**leave a comment**for the author, please follow the link and comment on their blog:

**Christophe Ladroue » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...