Using segmented regression to analyse world record running times

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Andrie de Vries

A week ago my high school friend, @XLRunner, sent me a link to the article “How Zach Bitter Ran 100 Miles in Less Than 12 Hours“. Zach's effort was rewarded with the American record for the 100 mile event.

Zach Bitter holds the American record for the 100 mile

This reminded me of some analysis I did, many years ago, of the world record speeds for various running distances. The International Amateur Athletics Federation (IAAF) keeps track of world records for distances from 100m up to the marathon (42km). The distances longer than 42km do not fall in the IAAF event list, but these are also tracked by various other organisations.

You can find a list of IAAF world records at Wikipedia, and a list of ultramarathon world best times at Wikepedia.

I extracted only the mens running events from these lists, and used R to plot the average running speeds for these records:


You can immediately see that the speed declines very rapidly from the sprint events. Perhaps it would be better to plot this using a logarithmic x-scale, adding some labels at the same time. I also added some colour for what I call standard events – where “standard” is the type of distance you would see regularly at a world championships or olympic games. Thus the mile is “standard”, but the 2,000m race is not.


Now our data points are in somewhat more of a straight line, meaning we could consider fitting a linear regression.

However, it seems that there might be two kinks in the line:

  • The first kink occurs somewhere between the 800m distance and the mile. It seems that the sprinting distances (and the 800m is sometimes called a long sprint) has different dynamics from the events up to the marathon.
  • And then there is another kink for the ultra-marathon distances. The standard marathon is 42.2km, and distances longer than this are called ultramarathons.

Also, note that the speed for the 100m is actually slower than for the 200m. This indicates the transition effect of getting started from a standing start – clearly this plays a large role in the very short sprint distance.

Subsetting the data

For the analysis below, I exlcuded the data for:

  • The 100m sprint (transition effects play too large a role)
  • The ultramarahon distances (they get raced less frequently, thus something strange seems to be happening in the data for the 50km race in particular).


Using the segmented package

To fit a regression line with kinks, more properly known as a segmented regression (or sometimes called piecewise regression), you can use the segmented package, available on CRAN.

The segmented() function allows you to modify a fitted object of class lm or glm, specifying which of the independent variables should have segments (kinks).  In my case, I fitted a linear model with a single variable (log of distance), and allowed segmented() to find a single kink point.

My analysis indicates that there is a kink point at 1.13km (10^0.055 = 1.13), i.e. between the 800m event and the 1,000m event.

> summary(sfit)

***Regression Model with Segmented Relationship(s)***

segmented.lm(obj = lfit, seg.Z = ~logDistance)

Estimated Break-Point(s):
Est. St.Err
0.055 0.021

Meaningful coefficients of the linear terms:
              Estimate Std. Error  t value Pr(>|t|)
(Intercept)    27.2064     0.1755   155.04 < 2e-16 ***
logDistance -  15.1305     0.4332   -34.93 1.94e-13 ***
U1.logDistance 11.2046     0.4536    24.70 NA
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2373 on 12 degrees of freedom
Multiple R-Squared: 0.9981, Adjusted R-squared: 0.9976

Convergence attained in 4 iterations with relative change -4.922372e-16


The final plot shows the same data, but this time with the segmented regression line also displayed.



I conlude:

  1. It is really easy to fit a segmented linear regression model using the segmented package
  2. There seems to be a different physiological process for the sprint events and the middle distance events. The segmented regression finds this kink point between the 800m event and the 1,000m event
  3. The ultramarathon distances have a completely different dynamic. However, it's not clear to me whether this is due to inherent physiological constraints, or vastly reduced competition in these "non-standard" events.
  4. The 50km world record seems too "slow". Perhaps the competition for this event is less intense than for the marathon?
Dennis Kimetto holds the world record for the marathon

The code

Here is my code for the analysis:

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)