Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How does the average marathoner pace their race? In this post, we’ll use R to have a look at a large dataset of marathon times to try to answer this question.
The ideal strategy would be to “even split” the race. This is where you run continually at the same pace from kilometre 0 to the finish. Let’s forget about “negative splitting”. This is where you speed up through the race, usually by running at a constant pace for the first half or three-quarters and then increasing the pace. Negative splits are for the pros not mere mortals! The difficulty with even-splitting the race is that it is very hard to know what pace you can maintain. The marathon gets hard for everyone after 30 km, so a slow down is almost inevitable. Certainly if you have started too fast you will fade. This situation is known as “positive splitting”.
Why is it so hard to know what pace you can maintain? Well, you can predict a pace based on existing races e.g. half marathon, and there are various ways to do this, but it is difficult to tell if you can hold that pace for the marathon. It’s such a brutal event that training up to run one takes time and it equally takes a while to recover, so experimentation is limited. Running a full marathon (at pace) in training, is not advised. So determining an ideal pace involves quite a bit of guesswork.
Let’s take a look at a big dataset of marathon times – we’ll use the New York City Marathon from 2025 – to see if we can understand how to pace a marathon. There’s an available dataset of chip times (meaning we don’t have to worry about dodgy GPS data) and the course has similar first and second half profiles, allowing us to use these times to understand negative/even/positive splitting. Let’s dive in.
You can skip to the code to play along or just see the analysis here.

First we can see using histograms of the difference between second half and first half of the marathon, that most runners positive split the marathon. There are very few runners who run a negative-split (blue bars, left of the dashed line). More runners even-split (yellow), but the majority run positive (red) split times.
For marathoners with finishing in times of below 3 h, the modal split is only +2 minutes. Over 21.1 km this is only a loss of 6 s per km. For marathoners with finishes of over three hours, this loss gets more severe. Those finishing outside of 5 h, ship 20 minutes or more in the second half.
At first glance this looks like better pace management by the faster runners, but these positive splits could be proportional to the paces being run. In other words, a slower runner should ship more time in the second half, because they’re running more slowly.
We can look at this data a different way and directly compare the first and second half times for each runner. Again this highlights just how few runners negative- or even-split the marathon. Most are positive splitting and are in the upper left half of the plot. We can also see that the data veers away from the ideal even-split (dashed line) with the slower paces. This veering looks linear (straight line).
We can fit a line to this data, and constrain it to go through (1,1) i.e. a 2 h marathoner even-splitting the race. To do this in R we can use lm(formula = I(y - 60) ~ I(x - 60) + 0, data = fitting) and this gives the coefficient for I(x – 60) as 1.24. This is essentially the fade co-efficient for the average runner in the 2025 edition of this race.
What does that mean? Well, for a runner achieving a 90 minute first half, their second half would most likely be: 60 + 1.239 * (90 – 60) = 97.17 minutes, so this would be a finish time of 3:07:10.
For anyone looking to run a 3 h New York Marathon, the average runner would therefore need to run 60 / 2.239 + 60 = 86.8 minutes for the first half to anticipate the fade. So 1:26:48 for the first half, and then 1:33:12 for the second half.
A more simple calculation is to take the mean of the ratio between the two half times for everyone in the dataset. This gives a fade coefficient of 1.13. The difference between these two fade co-efficients is due to the lack of constraint used in the fit. The ratio predicts a positive split being inevitable for the fastest runners, which is probably not true. Anyhow, this puts the first half time at 88 minutes for folks looking to run 3 h. These fade co-efficients are good predictors for a range of times, and I suspect would be similar at other marathon events with a similar profile. You can use them to calculate your ideal pace for a target finish time.
Finally, for the most accurate answer about sub-3 h pacing, we can look directly at runners finishing between 02:50:00 and 03:00:00 and see what they actually ran. The median first half time was 86.3 min (IQR = 84.4 – 87.87) and the second half was 89.62 (88.07 – 91.12). This gives a median finish time of 2:56:00. So running a 1:26:18 first half would give someone their best chance of finishing in under 3 h, allowing for the inevitable fade.
The takeaway message is: to finish within a goal time, do not assume even splits. That is, if you want to run 3 hours 30 min and bank on 90 minutes per half (4:59/km), you will most likely fail to hit the target. Build in a buffer of time to allow for the inevitable fade. A pace of 4:45/km is a better target pace (see below).
Good luck!
| Finish Time | Even split pace | Target pace |
| 03:00:00 | 00:04:16 | 00:04:07 |
| 03:30:00 | 00:04:59 | 00:04:45 |
| 04:00:00 | 00:05:41 | 00:05:23 |
| 04:30:00 | 00:06:24 | 00:06:01 |
| 05:00:00 | 00:07:07 | 00:06:39 |
| 06:00:00 | 00:08:32 | 00:07:55 |
The code
This analysis was possible thanks to the uploader for making the chip time data available. Also, a shoutout to Nicola Rennie for sharing how to style social media handles in {ggplot2} graphics. This part of my code requires my {qBrand} library and should be skipped if you are running the code yourself (remove the caption = cap argument in the ggplot calls).
library(ggplot2)
library(ggtext)
syss::_add_google("Roboto", "roboto")
showtext::showtext_auto()
## data wrangling ----
# load csv file from url
url <- paste0("https://huggingface.co/datasets/donaldye8812/",
"nyc-2025-marathon-splits/resolve/main/",
"nyrr_marathon_2025_summary_56480_runners_WITH_SPLITS.csv")
df <- read.csv(url)
# the data frame is a long table
# we need to grab the time values where splitCode is "HALF" or "MAR"
df <- df[df$splitCode %in% c("HALF", "MAR"), c("RunnerID", "splitCode", "time")]
# reshape to wide format, values are in time
df <- reshape(df, idvar = "RunnerID", timevar = "splitCode", direction = "wide")
# calculate the split times in minutes
df$split_HALF <- as.numeric(
as.difftime(df$time.HALF, format = "%H:%M:%S", units = "mins"))
df$split_MAR <- as.numeric(
as.difftime(df$time.MAR, format = "%H:%M:%S", units = "mins"))
# calculate the second half time
df$split_SECOND_HALF <- df$split_MAR - df$split_HALF
# remove rows with NA values
df <- df[!is.na(df$split_SECOND_HALF), ]
# calculate the difference
df$Difference <- df$split_SECOND_HALF - df$split_HALF
# difference as a fraction of first half
df$Difference_Fraction <- df$Difference / df$split_HALF * 100
# classify into sub 3 hr, sub 4 hr, sub 5 hr, sub 6 hr, over 6 hr
df$Category <- cut(df$split_MAR,
breaks = c(0, 180, 210, 240, 300, Inf),
labels = c("Sub 3 h", "3:00-3:30", "3:30-4:00",
"4:00-5:00", "Over 5 h"))
## plot styling ----
social <- qBrand::qSocial()
cap <- paste0(
"**Data:** New York City Marathon 2025 Results<br>**Graphic:** ",social
)
my_palette <- c("Sub 3 h" = "#cb2029",
"3:00-3:30" = "#147f77",
"3:30-4:00" = "#cf6d21",
"4:00-5:00" = "#28a91b",
"Over 5 h" = "#a31a6d")
## make the plots ----
ggplot(df, aes(x = Difference, fill = after_stat(x))) +
# vertical line at x = 0
geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
geom_histogram(breaks = seq(
from = -59.5, to = 81.5, by = 1), color = "black") +
scale_colour_gradient2(
low = "#2b83ba",
mid = "#ffffbf",
high = "#d7191c",
midpoint = 0,
limits = c(-15,15),
na.value = "#ffffffff",
guide = "colourbar",
aesthetics = "fill",
oob = scales::squish
) +
scale_x_continuous(breaks = seq(-45,90,15), limits = c(-40, 80)) +
facet_wrap(~ Category, ncol = 1, scales = "free_y") +
labs(caption = cap) +
labs(title = "Most runners positive split the marathon",
x = "Difference in minutes (Second Half - First Half)",
y = "Number of Runners",
caption = cap) +
theme_classic() +
# hide legend
theme(legend.position = "none") +
theme(
plot.caption = element_textbox_simple(
colour = "grey25",
hjust = 0,
halign = 0,
margin = margin(b = 0, t = 5),
size = rel(0.9)
),
text = element_text(family = "roboto", size = 16),
plot.title = element_text(size = rel(1.2),
face = "bold")
)
ggsave("Output/Plots/nyc_marathon_2025_split_difference_histogram.png",
width = 900, height = 1200, dpi = 72, units = "px", bg = "white")
ggplot() +
geom_abline(slope = 1, linetype = "dashed", color = "black") +
geom_point(data = df,
aes(x = split_HALF, y = split_SECOND_HALF, colour = Category),
shape = 16, size = 1.5, alpha = 0.1) +
scale_x_continuous(breaks = seq(from = 0, to = 12 * 30, by = 30),
labels = seq(from = 0, to = 6, by = 0.5),
limits = c(1 * 60, 5 * 60)) +
scale_y_continuous(breaks = seq(from = 0, to = 12 * 30, by = 30),
labels = seq(from = 0, to = 6, by = 0.5),
limits = c(1 * 60, 5 * 60)) +
scale_colour_manual(values = my_palette) +
labs(x = "First half time (h)",
y = "Second half time (h)",
caption = cap) +
theme_bw() +
theme(
plot.caption = element_textbox_simple(
colour = "grey25",
hjust = 0,
halign = 0,
margin = margin(b = 0, t = 10),
size = rel(0.9)
),
text = element_text(family = "roboto", size = 16)
) +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
ggsave("Output/Plots/nyc_marathon_2025_split_difference_scatter.png",
width = 1000, height = 800, dpi = 72, units = "px", bg = "white")
From this data we can also make some calculations to understand…
## fitting ----
# to fit, we'll constrain the line to go through (60,60), i.e. a
# 2 h marathoner who runs even splits
fitting <- data.frame(x = df$split_HALF,y = df$split_SECOND_HALF)
lm( I(y-60) ~ I(x-60) + 0, data = fitting)
# Call:
# lm(formula = I(y - 60) ~ I(x - 60) + 0, data = fitting)
#
# Coefficients:
# I(x - 60)
# 1.239
# so for a 90 minute first half, second half would be:
# 60 + 1.239 * (90 - 60) = 97.17 minutes, a finish time of 3:07:10
# to run a 3 h New York Marathon, the average runner needs to run
# 60 / 2.239 + 60 = 86.8 minutes for the first half
# so 1:26:48 for the first half, and 1:33:12 for the second half
# a more simple approach is to calculate the mean of the ratios
mean_ratio <- mean(df$split_SECOND_HALF / df$split_HALF)
mean_ratio
# [1] 1.127581
# filter the df for finish times between 170 and 180 minutes
target <- df[df$split_MAR > 170 & df$split_MAR < 180,]
summary(target)
RunnerID time.HALF time.MAR split_HALF split_MAR split_SECOND_HALF Difference
Min. :48819892 Length:1289 Length:1289 Min. :70.25 Min. :170.0 Min. : 82.70 Min. :-7.6833
1st Qu.:48834548 Class :character Class :character 1st Qu.:84.42 1st Qu.:173.6 1st Qu.: 88.07 1st Qu.: 0.7167
Median :48849752 Mode :character Mode :character Median :86.30 Median :176.0 Median : 89.62 Median : 3.0500
Mean :48849498 Mean :85.98 Mean :175.7 Mean : 89.73 Mean : 3.7585
3rd Qu.:48864551 3rd Qu.:87.87 3rd Qu.:178.2 3rd Qu.: 91.12 3rd Qu.: 5.9000
Max. :48878979 Max. :92.87 Max. :180.0 Max. :106.02 Max. :35.7667
Difference_Fraction Category
Min. :-8.5008 Sub 3 h :1289
1st Qu.: 0.8051 3:00-3:30: 0
Median : 3.5390 3:30-4:00: 0
Mean : 4.5178 4:00-5:00: 0
3rd Qu.: 7.0055 Over 5 h : 0
Max. :50.9134
—
The post title comes from “Marathon Man” by Ian Brown from his “My Way” album. He’s wearing a track suit on the cover but that’s not optimal wear for running a marathon.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.