R-bloggers

Marathon Man: how to pace a marathon

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How does the average marathoner pace their race? In this post, we’ll use R to have a look at a large dataset of marathon times to try to answer this question.

The ideal strategy would be to “even split” the race. This is where you run continually at the same pace from kilometre 0 to the finish. Let’s forget about “negative splitting”. This is where you speed up through the race, usually by running at a constant pace for the first half or three-quarters and then increasing the pace. Negative splits are for the pros not mere mortals! The difficulty with even-splitting the race is that it is very hard to know what pace you can maintain. The marathon gets hard for everyone after 30 km, so a slow down is almost inevitable. Certainly if you have started too fast you will fade. This situation is known as “positive splitting”.

Why is it so hard to know what pace you can maintain? Well, you can predict a pace based on existing races e.g. half marathon, and there are various ways to do this, but it is difficult to tell if you can hold that pace for the marathon. It’s such a brutal event that training up to run one takes time and it equally takes a while to recover, so experimentation is limited. Running a full marathon (at pace) in training, is not advised. So determining an ideal pace involves quite a bit of guesswork.

Let’s take a look at a big dataset of marathon times – we’ll use the New York City Marathon from 2025 – to see if we can understand how to pace a marathon. There’s an available dataset of chip times (meaning we don’t have to worry about dodgy GPS data) and the course has similar first and second half profiles, allowing us to use these times to understand negative/even/positive splitting. Let’s dive in.

You can skip to the code to play along or just see the analysis here.

First we can see using histograms of the difference between second half and first half of the marathon, that most runners positive split the marathon. There are very few runners who run a negative-split (blue bars, left of the dashed line). More runners even-split (yellow), but the majority run positive (red) split times.

For marathoners with finishing in times of below 3 h, the modal split is only +2 minutes. Over 21.1 km this is only a loss of 6 s per km. For marathoners with finishes of over three hours, this loss gets more severe. Those finishing outside of 5 h, ship 20 minutes or more in the second half.

At first glance this looks like better pace management by the faster runners, but these positive splits could be proportional to the paces being run. In other words, a slower runner should ship more time in the second half, because they’re running more slowly.

We can look at this data a different way and directly compare the first and second half times for each runner. Again this highlights just how few runners negative- or even-split the marathon. Most are positive splitting and are in the upper left half of the plot. We can also see that the data veers away from the ideal even-split (dashed line) with the slower paces. This veering looks linear (straight line).

We can fit a line to this data, and constrain it to go through (1,1) i.e. a 2 h marathoner even-splitting the race. To do this in R we can use lm(formula = I(y - 60) ~ I(x - 60) + 0, data = fitting) and this gives the coefficient for I(x – 60) as 1.24. This is essentially the fade co-efficient for the average runner in the 2025 edition of this race.

What does that mean? Well, for a runner achieving a 90 minute first half, their second half would most likely be: 60 + 1.239 * (90 – 60) = 97.17 minutes, so this would be a finish time of 3:07:10.

For anyone looking to run a 3 h New York Marathon, the average runner would therefore need to run 60 / 2.239 + 60 = 86.8 minutes for the first half to anticipate the fade. So 1:26:48 for the first half, and then 1:33:12 for the second half.

A more simple calculation is to take the mean of the ratio between the two half times for everyone in the dataset. This gives a fade coefficient of 1.13. The difference between these two fade co-efficients is due to the lack of constraint used in the fit. The ratio predicts a positive split being inevitable for the fastest runners, which is probably not true. Anyhow, this puts the first half time at 88 minutes for folks looking to run 3 h. These fade co-efficients are good predictors for a range of times, and I suspect would be similar at other marathon events with a similar profile. You can use them to calculate your ideal pace for a target finish time.

Finally, for the most accurate answer about sub-3 h pacing, we can look directly at runners finishing between 02:50:00 and 03:00:00 and see what they actually ran. The median first half time was 86.3 min (IQR = 84.4 – 87.87) and the second half was 89.62 (88.07 – 91.12). This gives a median finish time of 2:56:00. So running a 1:26:18 first half would give someone their best chance of finishing in under 3 h, allowing for the inevitable fade.

The takeaway message is: to finish within a goal time, do not assume even splits. That is, if you want to run 3 hours 30 min and bank on 90 minutes per half (4:59/km), you will most likely fail to hit the target. Build in a buffer of time to allow for the inevitable fade. A pace of 4:45/km is a better target pace (see below).

Good luck!

Finish TimeEven split paceTarget pace
03:00:0000:04:1600:04:07
03:30:0000:04:5900:04:45
04:00:0000:05:4100:05:23
04:30:0000:06:2400:06:01
05:00:0000:07:0700:06:39
06:00:0000:08:3200:07:55

The code

This analysis was possible thanks to the uploader for making the chip time data available. Also, a shoutout to Nicola Rennie for sharing how to style social media handles in {ggplot2} graphics. This part of my code requires my {qBrand} library and should be skipped if you are running the code yourself (remove the caption = cap argument in the ggplot calls).

library(ggplot2)
library(ggtext)

syss::_add_google("Roboto", "roboto")
showtext::showtext_auto()

## data wrangling ----

# load csv file from url
url <- paste0("https://huggingface.co/datasets/donaldye8812/",
              "nyc-2025-marathon-splits/resolve/main/",
              "nyrr_marathon_2025_summary_56480_runners_WITH_SPLITS.csv")
df <- read.csv(url)

# the data frame is a long table
# we need to grab the time values where splitCode is "HALF" or "MAR"
df <- df[df$splitCode %in% c("HALF", "MAR"), c("RunnerID", "splitCode", "time")]
# reshape to wide format, values are in time
df <- reshape(df, idvar = "RunnerID", timevar = "splitCode", direction = "wide")
# calculate the split times in minutes
df$split_HALF <- as.numeric(
  as.difftime(df$time.HALF, format = "%H:%M:%S", units = "mins"))
df$split_MAR <- as.numeric(
  as.difftime(df$time.MAR, format = "%H:%M:%S", units = "mins"))
# calculate the second half time
df$split_SECOND_HALF <- df$split_MAR - df$split_HALF
# remove rows with NA values
df <- df[!is.na(df$split_SECOND_HALF), ]
# calculate the difference
df$Difference <- df$split_SECOND_HALF - df$split_HALF
# difference as a fraction of first half
df$Difference_Fraction <- df$Difference / df$split_HALF * 100
# classify into sub 3 hr, sub 4 hr, sub 5 hr, sub 6 hr, over 6 hr
df$Category <- cut(df$split_MAR,
                           breaks = c(0, 180, 210, 240, 300, Inf),
                           labels = c("Sub 3 h", "3:00-3:30", "3:30-4:00",
                                      "4:00-5:00", "Over 5 h"))

## plot styling ----

social <- qBrand::qSocial()
cap <-  paste0(
  "**Data:** New York City Marathon 2025 Results<br>**Graphic:** ",social
)

my_palette <- c("Sub 3 h" = "#cb2029",
                "3:00-3:30" = "#147f77",
                "3:30-4:00" = "#cf6d21",
                "4:00-5:00" = "#28a91b",
                "Over 5 h" = "#a31a6d")

## make the plots ----

ggplot(df, aes(x = Difference, fill = after_stat(x))) +
  # vertical line at x = 0
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  geom_histogram(breaks = seq(
    from = -59.5, to = 81.5, by = 1), color = "black") +
  scale_colour_gradient2(
    low = "#2b83ba",
    mid = "#ffffbf",
    high = "#d7191c",
    midpoint = 0,
    limits = c(-15,15),
    na.value = "#ffffffff",
    guide = "colourbar",
    aesthetics = "fill",
    oob = scales::squish
  ) +
  scale_x_continuous(breaks = seq(-45,90,15), limits = c(-40, 80)) +
  facet_wrap(~ Category, ncol = 1, scales = "free_y") +
  labs(caption = cap) +
  labs(title = "Most runners positive split the marathon",
       x = "Difference in minutes (Second Half - First Half)",
       y = "Number of Runners",
       caption = cap) +
  theme_classic() +
  # hide legend
  theme(legend.position = "none") +
  theme(
    plot.caption = element_textbox_simple(
      colour = "grey25",
      hjust = 0,
      halign = 0,
      margin = margin(b = 0, t = 5),
      size = rel(0.9)
    ),
    text = element_text(family = "roboto", size = 16),
    plot.title = element_text(size = rel(1.2),
                              face = "bold")
  )

ggsave("Output/Plots/nyc_marathon_2025_split_difference_histogram.png",
       width = 900, height = 1200, dpi = 72, units = "px", bg = "white")

ggplot() +
  geom_abline(slope = 1, linetype = "dashed", color = "black") +
  geom_point(data = df,
             aes(x = split_HALF, y = split_SECOND_HALF, colour = Category),
             shape = 16, size = 1.5, alpha = 0.1) +
  scale_x_continuous(breaks = seq(from = 0, to = 12 * 30, by = 30),
                     labels = seq(from = 0, to = 6, by = 0.5),
                     limits = c(1 * 60, 5 * 60)) +
  scale_y_continuous(breaks = seq(from = 0, to = 12 * 30, by = 30),
                     labels = seq(from = 0, to = 6, by = 0.5),
                     limits = c(1 * 60, 5 * 60)) +
  scale_colour_manual(values = my_palette) +
  labs(x = "First half time (h)",
       y = "Second half time (h)",
       caption = cap) +
  theme_bw() +
  theme(
    plot.caption = element_textbox_simple(
      colour = "grey25",
      hjust = 0,
      halign = 0,
      margin = margin(b = 0, t = 10),
      size = rel(0.9)
    ),
    text = element_text(family = "roboto", size = 16)
  ) +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

ggsave("Output/Plots/nyc_marathon_2025_split_difference_scatter.png",
       width = 1000, height = 800, dpi = 72, units = "px", bg = "white")

From this data we can also make some calculations to understand…

## fitting ----

# to fit, we'll constrain the line to go through (60,60), i.e. a
# 2 h marathoner who runs even splits
fitting <- data.frame(x = df$split_HALF,y = df$split_SECOND_HALF)
lm( I(y-60) ~ I(x-60) + 0, data = fitting)


# Call:
#   lm(formula = I(y - 60) ~ I(x - 60) + 0, data = fitting)
# 
# Coefficients:
#   I(x - 60)  
# 1.239  

# so for a 90 minute first half, second half would be:
# 60 + 1.239 * (90 - 60) = 97.17 minutes, a finish time of 3:07:10

# to run a 3 h New York Marathon, the average runner needs to run
# 60 / 2.239 + 60 = 86.8 minutes for the first half
# so 1:26:48 for the first half, and 1:33:12 for the second half

# a more simple approach is to calculate the mean of the ratios
mean_ratio <- mean(df$split_SECOND_HALF / df$split_HALF)
mean_ratio
# [1] 1.127581

# filter the df for finish times between 170 and 180 minutes
target <- df[df$split_MAR > 170 & df$split_MAR < 180,]
summary(target)


    RunnerID         time.HALF           time.MAR           split_HALF      split_MAR     split_SECOND_HALF   Difference     
 Min.   :48819892   Length:1289        Length:1289        Min.   :70.25   Min.   :170.0   Min.   : 82.70    Min.   :-7.6833  
 1st Qu.:48834548   Class :character   Class :character   1st Qu.:84.42   1st Qu.:173.6   1st Qu.: 88.07    1st Qu.: 0.7167  
 Median :48849752   Mode  :character   Mode  :character   Median :86.30   Median :176.0   Median : 89.62    Median : 3.0500  
 Mean   :48849498                                         Mean   :85.98   Mean   :175.7   Mean   : 89.73    Mean   : 3.7585  
 3rd Qu.:48864551                                         3rd Qu.:87.87   3rd Qu.:178.2   3rd Qu.: 91.12    3rd Qu.: 5.9000  
 Max.   :48878979                                         Max.   :92.87   Max.   :180.0   Max.   :106.02    Max.   :35.7667  
 Difference_Fraction      Category   
 Min.   :-8.5008     Sub 3 h  :1289  
 1st Qu.: 0.8051     3:00-3:30:   0  
 Median : 3.5390     3:30-4:00:   0  
 Mean   : 4.5178     4:00-5:00:   0  
 3rd Qu.: 7.0055     Over 5 h :   0  
 Max.   :50.9134   

The post title comes from “Marathon Man” by Ian Brown from his “My Way” album. He’s wearing a track suit on the cover but that’s not optimal wear for running a marathon.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version