**R on datawookie**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How long does it take to cross the start line at the Comrades Marathon? If you’re lucky enough to be starting in one of the batches which is close to the front then this might be a matter of seconds to a couple of minutes. But if you’re in a batch closer to the back then this could be anything up to ten or eleven minutes. This is an agonising wait when all you want to do is start running.

Using data from the 2019 edition of the Comrades Marathon I set out to answer this question.

We’ll start off by looking at summary statistics broken down by batch.

```
batch min max avg median
1 Elite 00'12" 03'24" 00'18" 00'15"
2 A 00'11" 10'25" 00'30" 00'28"
3 B 00'14" 11'10" 01'07" 00'59"
4 C 00'21" 10'50" 02'15" 02'09"
5 CC 00'35" 10'34" 02'24" 02'11"
6 D 00'27" 10'33" 04'05" 04'03"
7 E 00'29" 10'49" 05'41" 05'44"
8 F 00'35" 10'52" 07'07" 07'05"
9 G 00'27" 10'53" 08'13" 08'36"
10 H 00'20" 11'08" 08'52" 09'23"
```

It’s apparent that the average delay increases consistently as you progress from the front of the field (the Elite and A batch) through to the back (batches G and H). What’s somewhat surprising is that there are runners who should ostensible be starting towards the back of the field who still manage to cross the starting mat with only a short delay (see the `min`

values for batches E through H).

The above table hides a lot of details. Below is a plot showing the distribution of start delays broken down by batch. As one would expect the delays for the first few batches are small and sharply peaked. However, the distribution of delays becomes broader for other batches. As hinted above, there are a significant number of runners who manage to cross the start mat very quickly given their nominal starting batch.

There’s a problem with the above plot: the scale of the y axis is linear and this means that small values are hard to see. If we apply a `sqrt()`

transform to this axis we get a much clearer view.

Now we can see that the start batches are really not being very strictly controlled: there are H batch runners who are evidently starting from very close to the front of the field. Conversely, there are also numerous runners who are starting further back in the field than they are entitled to based on their qualifying batch. Of course, the latter case is allowed (starting in a slower batch), while the former (starting in a faster batch) is not.

It’s important to note that these results are subject to significant selection bias: only those runners who finished the race are accounted for. It’d be great to have more extensive data which includes all runners who started the race.

**leave a comment**for the author, please follow the link and comment on their blog:

**R on datawookie**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.