[This article was first published on Apply R
, and kindly contributed to R-bloggers
]. (You can report issue about the content on this page here
Want to share your content on R-bloggers? click here
if you have a blog, or here
if you don't.
I am a regular participant of Prague International Half Marathon. In a mass event like this the horde of runners needs a long time to reach the starting line. To make the times mutually comparable the “start time” is measured and afterwards subtracted from the “finish time”. Also the crowd is organized to corridors in such a way that faster runners are ahead of the slower ones.
Sometimes everything goes wrong and that was the case of the year 2010. Imagine yourself to train for months then make your best – just to discover that your time of start was not recorded. Organizers apologized but claimed that only less than 2% of the participants were affected. Really?
Let us use R to scrape and compare histograms of 2009, 2010 and 2011 start times to see the truth (red dashed line at 20 is approximate capacity of starting line):
See? The peaks in 2010 data are actually a nice try of organizers to do some statistics and correct for missing measurements. Based on starting number mirrorring both the expected time and the position in corridors they tried to make estimates for each corridor starting time. The averages were imputed into ~25% of observations that were actually missing. Why is this so wrong? Because the ordering of runners was not under control. In 2009 and 2010 runners went wherever they wanted as you can see seen on the following graphs. Actually, in 2010 the slow runners just behind the Kenyans caused the jam.
Good news at the end? Yes! Even organizers were denying the truth they learned a lesson from their mistakes. In 2011 an extra care was devoted to time measuring and as you can see ordering to corridors got much better.
Finally, the code of all above: