Pledging My Time VII

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here we go again! I ran the Mainova Frankfurt Marathon 2025 and wanted to look at the race results. How can we do this using R?

I couldn’t see an easy way to download the data, so I used R to scrape them. Note that these times are currently provisional, but they give us a good idea of what happened.

The results are available with a search function to find an individual’s results. If we leave everything blank and set the number of results to display to the maximum, we get the first page of 16 showing all the results. The rule is: if we can see it, we can scrape it!

We can use {rvest} we can scrape this data. The steps are: figure out the format of items to extract (in this case, each runner was a list item, and the fields of data were divs within each item), write a function to extract all runners on the page, write a function to process a page, call this function for each page! It’s perhaps easier to see the code:

library(rvest)
library(dplyr)
library(purrr)
library(stringr)
library(ggforce)


## Functions ----
# retrieves the data frame from the main function
scrape_results_page <- function(url) {
  webpage <- read_html(url)
  df <- scrape_startlist(webpage)
  df <- df[-1, ]
  return(df)
}

# scrapes the data
scrape_startlist <- function(page) {
  rows <- page %>% html_nodes("li.list-group-item.row")
  map_df(rows, function(row) {
    # helper to get text from a selector, remove small labels and trim
    get_text <- function(sel) {
      node <- row %>% html_node(sel)
      if (is.na(node) || length(node) == 0) return(NA_character_)
      # remove the mobile label nodes inside if present
      node %>% html_nodes(".visible-xs-block, .visible-sm-block, .list-label") %>% xml2::xml_remove()
      text <- node %>% html_text(trim = TRUE)
      if (length(text) == 0) return(NA_character_) else return(text)
    }
    
    # place primary/secondary
    place_primary <- get_text(".type-place.place-primary")
    place_secondary <- get_text(".type-place.place-secondary")
    
    # fullname and link
    fullname_a <- row %>% html_node("h4.type-fullname a")
    fullname <- if (length(fullname_a) == 0) NA_character_ else fullname_a %>% html_text(trim = TRUE)
    link <- if (length(fullname_a) == 0) NA_character_ else fullname_a %>% html_attr("href")
    
    # bib, club/city, age class (these are under second column)
    bib <- get_text(".type-field")
    club_city <- get_text(".type-priority")
    age_class <- get_text(".type-age_class")
    
    # finish and gun time: there are multiple .type-time entries; take them in order
    times <- row %>% html_nodes(".type-time") %>% html_text(trim = TRUE)
    times <- times[times != ""] # drop blanks
    finish <- if (length(times) >= 1) times[1] else NA_character_
    gun_time <- if (length(times) >= 2) times[2] else NA_character_
    
    # make data frame. We don't need gun time or link
    data.frame(
      place_primary = place_primary,
      place_secondary = place_secondary,
      fullname = fullname,
      bib = bib,
      club_city = club_city,
      age_class = age_class,
      finish = finish
    )
  })
}

# Specifying the base url for website to be scraped
url <- "https://live.frankfurt-marathon.com/2025/?page="

# the pages are like this:
# "https://live.frankfurt-marathon.com/2025/?page=2&event=L_HCH3BKLB3B8&num_results=1000&pid=startlist_list&pidp=startlist&search%5Bage_class%5D=%25&search%5Bsex%5D=%25&search%5Bnation%5D=%25&search_sort=name"
# we have 1000 results on a page and the first page shows there are 16 pages total
n_pages <- 16
# make a list of all urls to be scraped
urls <- paste0(url, seq(n_pages), "&event=L_HCH3BKLB3B8&num_results=1000&pid=startlist_list&pidp=startlist&search%5Bage_class%5D=%25&search%5Bsex%5D=%25&search%5Bnation%5D=%25&search_sort=name")
# scrape each page one by one and rbind into large df
result  <- do.call(rbind, lapply(urls, scrape_results_page))

Really the hardest part here is to figure out the names of the html nodes that contain the data. I just had a look at the source of the page in my browser and made a note of which div classes were needed.

So we now have a dataframe called result that has all the data. We need to tidy things up a bit first:

ages <- c("U18", "JU20", "U23", "H",
          "30", "35", "40", "45", "50", "55", "60", "65", "70", "75", "80", "85", "–")
# order the age_class factor levels
result$age_class <- factor(result$age_class, levels = ages)
# if the bib number starts with F add "Female" to new "gender" column, otherwise assume "Male
result$gender <- ifelse(startsWith(result$bib, "F"), "Female", "Male")
# remove "Finish" text from finish times
result$finish <- str_replace(result$finish, "Finish", "")
# convert string times to hh:mm:ss POSIXct
result$finish_time <- as.POSIXct(result$finish, format = "%H:%M:%S", tz = "UTC")

Here we just get the age classes into the correct order. There are two genders for this event and we can parse them from the bib numbers. Finally, the finish time is slightly borked after scraping so we needed to correct that. We left behind the gun time and the link to each runners details in the first function because we don’t need them.

Let’s find some facts and figures!

## Some facts and figures ----
# total runners
total_runners <- nrow(result)
cat("Total runners:", total_runners, "\n")
# total finishers (those without NA as finish_time)
total_finishers <- sum(!is.na(result$finish_time))
cat("Total finishers:", total_finishers, "\n")
# average finish time
avg_finish_time <- mean(result$finish_time, na.rm = TRUE)
cat("Average finish time:", format(avg_finish_time, "%H:%M:%S"), "\n")
# fastest finish time
fastest_finish_time <- min(result$finish_time, na.rm = TRUE)
cat("Fastest finish time:", format(fastest_finish_time, "%H:%M:%S"), "\n")
# slowest finish time
slowest_finish_time <- max(result$finish_time, na.rm = TRUE)
cat("Slowest finish time:", format(slowest_finish_time, "%H:%M:%S"), "\n")

# break down the same stats by gender
for (g in unique(result$gender)) {
  cat("Gender:", g, "\n")
  res_g <- result[result$gender == g, ]
  total_runners_g <- nrow(res_g)
  cat("  Total runners:", total_runners_g, "\n")
  total_finishers_g <- sum(!is.na(res_g$finish_time))
  cat("  Total finishers:", total_finishers_g, "\n")
  avg_finish_time_g <- mean(res_g$finish_time, na.rm = TRUE)
  cat("  Average finish time:", format(avg_finish_time_g, "%H:%M:%S"), "\n")
  fastest_finish_time_g <- min(res_g$finish_time, na.rm = TRUE)
  cat("  Fastest finish time:", format(fastest_finish_time_g, "%H:%M:%S"), "\n")
  slowest_finish_time_g <- max(res_g$finish_time, na.rm = TRUE)
  cat("  Slowest finish time:", format(slowest_finish_time_g, "%H:%M:%S"), "\n")
}

This gives us:

Total runners: 15456 
Total finishers: 12323 
Average finish time: 03:53:51 
Fastest finish time: 02:06:16 
Slowest finish time: 07:13:07

Gender: Male 
  Total runners: 11913 
  Total finishers: 9497 
  Average finish time: 03:48:09 
  Fastest finish time: 02:06:16 
  Slowest finish time: 07:13:07 
Gender: Female 
  Total runners: 3543 
  Total finishers: 2826 
  Average finish time: 04:12:59 
  Fastest finish time: 02:19:34 
  Slowest finish time: 06:47:55 

Assuming that each runner listed in the results started the event and that a lack of a finish time indicates DNF. This means the completion rate was 80% and it was the same for men and women. I am surprised that 20% of runners did not finish. The course is very flat and, although it was quite windy, it was not challenging as marathons go. It could be that the 20% includes people who DNS.

Let’s have a look at the finish times and how they break down.

## Plots ----

# filter out DNFs and "-" for age class
result <- result %>%
  filter(!is.na(finish_time)) %>%
  filter(age_class != "–")

mycolors <- c(rgb(218,63,65, maxColorValue = 255),
              rgb(11,46,114, maxColorValue = 255))

ggplot(result, aes(x = finish_time)) +
  geom_histogram(binwidth = 60, fill = mycolors[1]) +
  labs(x = "Finish Time",
       y = "Count") +
  # 20 minute ticks on x axis
  scale_x_datetime(date_breaks = "20 min", date_labels = "%H:%M") +
  theme_minimal()
# save plot
ggsave("Output/Plots/frankfurt_marathon_2025_finish_time_histogram.png", width = 10, height = 6, bg = "white")

# plot finish times by age class facet by gender
ggplot(result, aes(x = age_class, y = finish_time, colour = gender)) +
  geom_sina(alpha = 0.2, stroke = 0) +
  scale_y_datetime(date_breaks = "20 min", date_labels = "%H:%M") +
  stat_summary(fun = mean, geom = "point", size = 2, colour = "black", alpha = 0.8) +
  scale_colour_manual(values = mycolors) +
  facet_wrap(~ gender) +
  labs(x = "",
       y = "Finish Time") +
  theme_minimal() +
  theme(legend.position = "none")

# save plot
ggsave("Output/Plots/frankfurt_marathon_2025_finish_times_by_age_class.png", width = 10, height = 6)

This gives us two plots. Firstly, the finish times by age class and by gender:

The average (mean) time per category is shown as a black circle, otherwise each runner is a red or blue spot. The average time for Males seems to peak with the 35-40 age category, although the very fastest times are in below 35 year old categories. For Females, there’s a similar slowing of average finish times in older age groups, but there is less of a peak effect. The number of Female participants is lower though, so we might miss the effect for this reason.

This plot is quite nice because you can see the density of runners in each category to get a feel for participation. There’s also a striping effect that is clearest in the Male data.

This is a histogram of all participants’ finish times. There are peaks at just under 3 hours and 4 hours. There’s also an accumulation of runners finishing around 3:30 and 5:00. These round numbers are obviously goal times for many runners.

Congrats to all who participated and especially those who met the goals they set for themselves.

The post title is taken from “Pledging My Time” a track from Blonde on Blonde by Bob Dylan.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)