Marathon Man II: how to pace a marathon
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It’s often the way. I posted recently about how to pace a marathon and very quickly received feedback that would’ve improved the original post. Oh well, no going back. This is take two.
So, we have a dataset of all runners from the 2025 New York City Marathon. We want to know how should you pace a marathon. What is the best strategy?
Determining your optimal pace is complex. There’s the theoretical pace that you can achieve – a mix of biomechanics, physiology and training – but it can be very hard to know what this pace is. Anyway, this theoretical pace is what you could achieve when all goes well. You need to factor in the conditions on the day – how you slept, how you fuel, mental attitude, is it windy? can you get in a group and work with others? and so on. A runner may toe the line in the shape to run a sub 3 h marathon, but by the 30 km mark, the story may be very different.
In the last post, we saw that positive splitting (otherwise known as slowing down) is inevitable. So it seems the best strategy is start out faster than your goal pace, bank some time so that you account for the fade.
A reader responded with this insightful comment:
What I question, though, is whether a (very thorough) analysis of how marathons get run tells us much about how they should be run? This seems to be saying, “Forget about an optimal pace, here’s how to compensate for the sub-optimal pace you’re going to run despite your plans.”
This is correct. Any post hoc analysis like this can only tells us how the marathon was run, not about how they should be run. This is because we don’t know the intention of any runner in the dataset. If we did, then we would know how a runner intended to run the race (i.e. what their pacing strategy was) and then we could ask: did that work out for them?
If only we knew their intention… hmm…
The idea
The sub-3 marathon is one of the big goals in running. That is, trying to run it in less than 3 hours. So we know that there are a bunch of runners in the dataset trying to do just that. We know the finish times too. So by definition, the runners finishing between 02:55:00 and 03:00:00 were the folks shooting for sub-3 and who achieved it, while those finishing between 03:00:00 and 03:05:00 were those who didn’t make it. Sure, there will be some in this window who were hoping for 02:50:00 and failed and some who were hoping to do 03:10:00 and ran amazingly well. But by narrowing the window to 5 min either side of 3 h, we have fewer of those than if we took 10 min either side.
If we assume that runners in the 02:55:00 to 03:05:00 finishing window intended to run for a finish time of 3 h, we can analyse how they paced the marathon and how it worked out for them.
Analysing this window also has the advantage that runners of this calibre know how to pace well, compared to those trying for 03:30:00 or 04:00:00. There’s also plenty of them too given that it is a popular goal.
So let’s take a look. Plots first and then the code below.
Going for sub-3
We’ll use difference from goal pace to visualise runners progress. The goal pace here is ~04:15/km. Below 0 is running ahead of pace (banking time) and running above 0 means being behind schedule.
We colour the runners by whether they made it, sub 3 (red) or failed, went over 3 (blue).

770 runners were sub 3 whereas 628 were over 3. This can be difficult to see, so let’s take a different view.


For both outcomes we have several runners who were clearly shooting for a faster time, but something went wrong and they ended up in our window. They are appear as U-shapes in the difference plots. Rather than remove them, we’ll accept these contaminants and assume that most folks in this window are shooting for a 3 h finish.
We can see different pacing as the race progresses for different runners. Some folks are behind schedule but end up making sub-3, others are ahead of time and fail. To answer our question we need to know: what is the best strategy?
On-pace, positive split, negative split?
We’ll take 10 km as our marker point. It’s almost one-quarter through. Any excitement of the start with all the crowds messing up pacing is done and we can see at this point who is intending to run at what pace. Let’s say that “on pace” is 2 s/km difference from goal pace. So at 10 km, an “on pace” runner could be ± 20 s from where they should be (00:21:20). If the difference is more than 20 s we’ll say they are behind pace, and if it is greater in the other direction, they are ahead of pace.
Knowing this, we can look at the outcome. Of the runners going for 3 h, what was the best strategy?

We can see that most people do not negative split a sub-3 marathon. The majority of people making the goal, run the first 10 km (and indeed most of the race) ahead of goal pace.
There’s a risk here though, going out at faster than goal pace means that you might fail. The yellow traces really show how, at 30-35 km, the race gets very tough and people can slow down significantly. Anyone who has run a marathon will tell you that “the race only starts at the 30 km mark”. It’s where people start to hit the wall and this plot really shows that. These folks could have misjudged their theoretical best pace or just struggled on this occasion.
I find the strategy for success interesting. A lot of advice out there is to start out a marathon at an easier pace and speed up if you can. While it’s true you shouldn’t go too fast and blow up, the advice should be to train to run at more than 2 s ahead of goal pace and try to maintain that.
Tell me the odds
With all the caveats in place, let’s try and get some individual-level probabilities from our population-level data.
We looked at the 10 km point, applying a ± 2 s/km threshold for goal pace, and the behind/ahead classifications. We can do this for every waypoint that we have data for. Now, we can say for a given waypoint: of the runners that were say, ahead of pace, how many finished sub-3 (succeeded) and how many were over-3 (failed). This gives us a probability of success for that strategy at that waypoint. We can then plot these probabilities out.

If your strategy was to go ahead of pace and you were ahead of pace at 5 km, you have a 65% chance of going sub-3. If you are ahead of pace at 30 km, it climbs to a 72% chance. Obviously it keeps climbing to certain success the further the race progresses.
Running at goal pace gives a 50/50 chance of making it if you’re on pace at 5 km. But if you are only on-pace at the halfway point, your chance of success drops to 37%.
If you are behind pace at 10 km, you have a 19% chance of success and this probability drops as the race continues. Eventually, we hit the point where it is not possible to make up the time that’s lost and it is 100% likely that you will fail.
The best strategy?
The best strategy is to go out faster than goal pace and this is what you should train for.
Negative splitting is rare. Slowing down after 30 km is highly likely. Failing to account for this means potentially missing out on your goal.
This message is not too different from the previous post, but we now have some probabilities to back up advice on how the race should be run.
Caveats
Most people running a marathon are first-timers who will run this one race and their goal is to simply finish. Let’s face it, most non-runners have no idea whether your finish time was good/bad/whatever. They will just be impressed that you finished! This post is intended for repeat offenders who strive to improve their time. Maybe the best advice is to just go out there, enjoy running your marathon and not worry about pacing. It’s the best feeling in the world to have achieved it whether it’s your first or fifth.
This analysis is obviously limited to one dataset, the 2025 New York City Marathon. It has a flat profile, so any of the probabilities will likely only apply over a similarly flat course in similar conditions. I also mentioned that we assume a 3 h goal for the runners in the window and we saw how this is not perfect, but it is the best we can do. Obviously, the pacing for other goal times may be different, but we saw in the previous analysis that positive splitting is the most likely scenario regardless of pace.
The code
library(ggplot2)
library(ggtext)
library(dplyr)
library(hms)
## plot styling ----
# qBrand plot styling used. This code should run OK without
my_colours <- c("Sub 3 - Behind" = "#003d5c",
"Sub 3 - Goal Pace" = "#954e9b",
"Sub 3 - Ahead" = "#ff6b59",
"Over 3 - Behind" = "#464c89",
"Over 3 - Goal Pace" = "#dd4d88",
"Over 3 - Ahead" = "#ffa600")
my_levels <- c("Sub 3 - Behind",
"Sub 3 - Goal Pace",
"Sub 3 - Ahead",
"Over 3 - Behind",
"Over 3 - Goal Pace",
"Over 3 - Ahead")
## data wrangling ----
# load csv file from url
# url <- paste0("https://huggingface.co/datasets/donaldye8812/",
# "nyc-2025-marathon-splits/resolve/main/",
# "nyrr_marathon_2025_summary_56480_runners_WITH_SPLITS.csv")
# df <- read.csv(url)
# save locally
# write.csv(df, "Output/Data/nyc_marathon_2025_splits.csv", row.names = FALSE)
## main script ----
df <- read.csv("Output/Data/nyc_marathon_2025_splits.csv")
times_df <- df %>%
select(RunnerID, splitCode, time)
runners_df <- df %>%
select(RunnerName, RunnerID, OverallTime, OverallPlace, Gender,
Age, City, Country, Bib) %>%
unique()
runners_df$OverallTime <- as_hms(runners_df$OverallTime)
# unique pairs of splitCode and distance -- and add distance in km
split_distances <- df %>%
select(splitCode, distance) %>%
unique()
split_distances$distance <- c(4.83,5.00,6.44,8.05,9.66,10.00,11.27,12.87,14.48,
15.00,16.09,17.70,19.31,20.00,20.92,21.08,22.53,
24.14,25.00,25.75,27.36,28.97,30.00,30.58,32.19,
33.80,35.00,35.41,37.01,38.62,40.00,40.23,41.84,
42.20)
# merge split distances with times_df
times_df <- merge(times_df, split_distances, by = "splitCode", sort = FALSE)
# order the table by RunnerID and then by distance
times_df <- times_df[order(times_df$RunnerID, times_df$distance), ]
row.names(times_df) <- NULL
# time is character, change it
times_df$time <- as_hms(times_df$time)
# make a df of RunnerID, OverallTime, and a new column called Category which is
# "Sub 3" or "Over 3"
category_df <- runners_df %>%
select(RunnerID, OverallTime) %>%
filter(OverallTime > as_hms("02:55:00") & OverallTime <= as_hms("03:05:00")) %>%
mutate(Category = ifelse(OverallTime <= as_hms("03:00:00"), "Sub 3", "Over 3"))
# merge category_df with times_df to get the pace for each runner in each
# category and drop any rows with NA values
times_df <- merge(times_df, category_df,
by = "RunnerID", all.x = TRUE, sort = FALSE) %>%
filter(!is.na(Category)) %>%
mutate(on_par = time - (as_hms("03:00:00") /42.19 * distance))
ggplot(times_df, aes(x = distance, y = on_par, group = RunnerID, color = Category)) +
geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
geom_line(alpha = 0.2) +
scale_color_manual(values = c("Sub 3" = "#ff6b59", "Over 3" = "#464c89")) +
labs(title = "Difference from Goal Pace for Sub-3 and Over-3 Runners in NYC Marathon 2025",
x = "Distance (km)",
y = "Difference from Goal Pace (seconds)",
color = "Category") +
theme_q() +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
ggsave("Output/Plots/sub_over_3_comparison.png", width = 7, height = 4, dpi = 300)
ggplot() +
geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
geom_line(data = times_df %>%
filter(Category == "Sub 3"),
aes(x = distance, y = on_par, group = RunnerID),
color = "grey", alpha = 0.2) +
geom_line(data = times_df %>%
filter(Category == "Over 3"),
aes(x = distance, y = on_par, group = RunnerID),
color = "#464c89", alpha = 0.2) +
labs(title = "Over-3 Runners in NYC Marathon 2025",
x = "Distance (km)",
y = "Difference from Goal Pace (seconds)") +
theme_q()
ggsave("Output/Plots/over_3_comparison.png", width = 7, height = 4, dpi = 300)
ggplot() +
geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
geom_line(data = times_df %>% filter(Category == "Over 3"),
aes(x = distance, y = on_par, group = RunnerID),
color = "grey", alpha = 0.2) +
geom_line(data = times_df %>% filter(Category == "Sub 3"),
aes(x = distance, y = on_par, group = RunnerID),
color = "#ff6b59", alpha = 0.2) +
labs(title = "Sub-3 Runners in NYC Marathon 2025",
x = "Distance (km)",
y = "Difference from Goal Pace (seconds)") +
theme_q()
ggsave("Output/Plots/sub_3_comparison.png", width = 7, height = 4, dpi = 300)
# classify on_par into three categories: "Ahead of Par" for values less than
# -20, "On Par" for values between -20 and 20, and "Behind Par" for values
# greater than 20 at the 10K mark, i.e. 2 seconds per km * 10 km = 20 seconds
class_df <- times_df %>%
mutate(par_category = case_when(
distance == 10 & on_par < -20 ~ "Ahead",
distance == 10 & on_par >= -20 & on_par <= 20 ~ "Goal Pace",
distance == 10 & on_par > 20 ~ "Behind",
TRUE ~ NA_character_
)) %>%
filter(!is.na(par_category))
# paste Category and par_category together to make a new column called final_category
class_df <- class_df %>%
mutate(final_category = paste(Category, par_category, sep = " - ")) %>%
select(RunnerID, final_category)
# merge class_df with times_df to get the final_category for each runner in each category and drop any rows with NA values
times_df <- merge(times_df, class_df, by = "RunnerID", all.x = TRUE) %>%
filter(!is.na(final_category))
# use my_levels to get facets in the order of my_level
times_df$final_category <- factor(times_df$final_category, levels = my_levels)
# ggplot of on_par by distance colored by final_category
ggplot(times_df, aes(x = distance, y = on_par, group = RunnerID, color = final_category)) +
geom_abline(slope = 0, intercept = 0, linetype = "dashed", color = "black") +
geom_line(alpha = 0.2) +
scale_color_manual(values = my_colours) +
labs(title = "Pacing at 10 km and Overall Outcome",
x = "Distance (km)",
y = "Difference from Par Time (seconds)",
color = "Category") +
theme(legend.position = "none") +
facet_wrap(~ final_category) +
theme_q() +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
ggsave("Output/Plots/pacing_by_final_category.png", width = 10, height = 6, dpi = 300)
# calculate the probability of success
# list of unique distances in numerical order
distance_list <- sort(unique(times_df$distance))
all_p_df <- tibble()
for(i in 1:length(distance_list)) {
dist <- distance_list[i]
par <- as_hms("00:00:02") * dist
class_df <- times_df %>%
mutate(par_category = case_when(
distance == dist & on_par < -par ~ "Ahead",
distance == dist & on_par >= -par & on_par <= par ~ "Goal Pace",
distance == dist & on_par > par ~ "Behind",
TRUE ~ NA_character_
)) %>%
filter(!is.na(par_category)) %>%
select(RunnerID, Category, par_category) %>%
group_by(Category, par_category) %>%
summarise(count = n()) %>%
group_by(par_category) %>%
mutate(percentage = count / sum(count) * 100) %>%
ungroup() %>%
mutate(distance = dist) %>%
select(distance, Category, par_category, percentage)
all_p_df <- rbind(all_p_df, class_df)
}
all_p_df$final_category <- paste(all_p_df$Category, all_p_df$par_category, sep = " - ")
all_p_df$final_category <- factor(all_p_df$final_category, levels = my_levels)
all_p_df %>%
filter(grepl("^Sub 3", final_category)) %>%
ggplot(aes(x = distance, y = percentage, colour = final_category)) +
geom_line() +
scale_color_manual(values = my_colours) +
labs(title = "Probability of Success for Pacing Strategies by Distance",
x = "Distance (km)",
y = "Probability of Sub-3 (%)",
color = "Category") +
theme_q()
ggsave("Output/Plots/probability_of_success_sub_3.png", width = 7, height = 4, dpi = 300)
—
The post title comes from “Marathon Man” by Ian Brown from his “My Way” album.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.