Pledging My Time VI: scraping and analysis of race results in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve posted in the past about analysing race results in R (most recently here). I ran the 2023 MK Marathon and wanted to have a look at the finishing times. The days of race results being made available as a csv or xls for easy analysis seem to be behind us. Instead they tend to be served up on multiple webpages of 50 athletes’ results at a time.
Oh no, 29 pages of results and now Download option…. let’s scrape the data!
The code
We need {rvest}
to do the scraping. We also need three custom functions. The first, will extract the data we need from a resultsbase.net page. We can call it multiple times for each page of 50 athletes at a time. The second and third functions allow us to look at how a single runner compares with the field.
library(rvest) require(tidyverse) require(ggforce) ## Functions ---- # performs the scrape of a webpage scrape_results_page <- function(url) { webpage <- read_html(url) # there are three table objects. We want the second table runner <- as.data.frame(html_table(html_nodes(webpage, "table")[2])) # this is an unnecessary column, remove runner$Name.1 <- NULL return(runner) } # discretely highlight person of interest by bib number in a plot label_plot <- function(plot, raceno) { plot <- plot + geom_point(data = result[result$Race.No == raceno,], colour = "dark grey") return(plot) } # lookup a runner by bib number and compare to the field lookup_runner <- function(df, raceno) { orig_row <- which(df$Race.No == raceno) all_df <- df %>% mutate(rank = rank(Time)) rankno <- all_df$rank[all_df$Race.No == raceno] denom <- nrow(all_df) cat("Runner",raceno,"is",rankno,"out of",denom,"Runners",":",rankno/denom*100,"%ile\n") gender_df <- df %>% arrange(Gender, Time) %>% group_by(Gender) %>% mutate(rank = rank(Time)) rankno <- gender_df$rank[gender_df$Race.No == raceno] key <- df$Gender[orig_row] denom <- length(which(gender_df$Gender == key)) cat("Runner",raceno,"is",rankno,"out of",denom,key,":",rankno/denom*100,"%ile\n") cat_df <- df %>% arrange(Category, Time) %>% group_by(Category) %>% mutate(rank = rank(Time)) rankno <- cat_df$rank[cat_df$Race.No == raceno] key <- df$Category[orig_row] denom <- length(which(gender_df$Category == key)) cat("Runner",raceno,"is",rankno,"out of",denom,key,":",rankno/denom*100,"%ile\n") }
Now we can move on to the business of scraping.
## Main script ---- #Specifying the url for website to be scraped url <- "https://results.resultsbase.net/Results.aspx?CId=8&RId=20209" # subsequent pages are like this # "https://results.resultsbase.net/results.aspx?CId=8&RId=20209&EId=1&dt=0&PageNo=2&adv=0" # total pages (displayed on 1st page) n_pages <- 29 # make a list of all urls to be scraped urls <- paste0(url, "&EId=1&dt=0&PageNo=", 1 + (seq(n_pages) - 1), "&adv=0") # scrape each page one by one and rbind into large df result <- do.call(rbind, lapply(urls, scrape_results_page))
Hopefully you can see how this works. There are several pages of data and we need to extract them all. We build a list of all urls to be scraped by first figuring out how the url should be formatted. Then, we can call our scrape_results_page()
function repeatedly, rbind
ing everything into a big data frame.
The function loads the url and then extracts one of the tables (that contains the data). If you are performing your own scrape of a different site, you may need to change this function a bit. The best way to do that is to load in the first page as an object and figure out the steps needed to get the data.
OK. Now we have all the data from all athletes in a large data frame called result
. Great. There are 1450 runners in the dataset, 61 runners were DNF or have queried data.
Now let’s have a look at the data.
## Plots ---- # format Date column to POSIXct result$Time <- as.POSIXct(strptime(result$Gun.Time, format = "%H:%M:%S")) orig_var <- as.POSIXct("00:00:00", format = "%H:%M:%S") # we need categories to be the same for men and women and then use the Gender column to differentiate result$cat <- ifelse(startsWith(result$Category, "Sen"),"Senior",result$Category) result$cat <- sub("VetF", "Vet", result$cat) p1 <- ggplot( data = result, aes(x = cat, y = Time, color = Gender)) + geom_sina(alpha = 0.5, stroke = 0) + scale_colour_brewer(palette = "Set1") + stat_summary(fun = mean, geom = "point", size=2, colour = "black", alpha = 0.5) + scale_y_datetime(date_labels = "%H:%M:%S", limits = c(orig_var,NA)) + labs(x = "Category", y = "Time") + theme_light() + theme(legend.position = "none") p2 <- ggplot( data = result, aes(x = cat, y = Time, color = Gender)) + geom_sina(alpha = 0.5, stroke = 0) + scale_colour_brewer(palette = "Set1") + stat_summary(fun = mean, geom = "point", size=2, colour = "black", alpha = 0.5) + scale_y_datetime(date_labels = "%H:%M:%S", limits = c(orig_var,NA)) + facet_wrap(. ~ Gender) + labs(x = "Category", y = "Time") + theme_light() + theme(legend.position = "none") # for example if we are interested in runner with bib number (raceno) 1413 p1 <- label_plot(p1,1413) p2 <- label_plot(p2,1413) ggsave("Output/Plots/times.png", p1, width = 7, height = 6, dpi = 300, units = "in") ggsave("Output/Plots/times_break.png", p2, width = 10, height = 6, dpi = 300, units = "in")
And this gives us a plot of finishing times per category. Red points are athletes identifying as Female, and Blue, Male. The light grey point is me! Dark grey points show the mean time for the category (Male and Female together).
We can facet the plots by gender to get a clearer view. Again I am the light grey spot.
My time is above average for my category but I didn’t exactly trouble the podium… How can we look at this? The handy function above helps us to see. With some dplyr magic:
> lookup_runner(result,1413) Runner 1413 is 212 out of 1450 Runners : 14.62069 %ile Runner 1413 is 189 out of 1053 Male : 17.94872 %ile Runner 1413 is 41 out of 235 Vet45 : 17.44681 %ile
The lookup_runner()
function does the business of calculating the percentile (%ile) per category.
Hmmm, I was a bit disappointed with my time on the day but I figure hey, I was well beyond the IQR and it means I have some room for improvement.
Just for fun
In large datasets of race results, you can often see some anomalies in the finishing time data. Instead of a smooth curve, athletes bunch up around “round number finishing times”, e.g. the 4 h mark or the 3:45 mark; because these are target times for groups of runners.
p3 <- ggplot(data = result, aes(x = Time)) + geom_histogram(breaks = as.POSIXct(seq(from = 2 * 60 * 60, to = 8 * 60 * 60, by = 300), origin = orig_var)) + scale_x_datetime(date_labels = "%H:%M:%S", date_breaks = "hour") + facet_wrap(. ~ Gender) + theme_light() + theme(legend.position = "none") ggsave("Output/Plots/times_histo.png", p3, width = 10, height = 6, dpi = 300, units = "in")
There do seem to be peaks, particularly in the Male data, at 3:00, 3:15, 3:45 and others. The 3:30 peak is less obvious. There are 1053 Male runners in this dataset and I suspect this is not enough to see this effect clearly.
—
The post title is taken from “Pledging My Time” a track from Blonde on Blonde by Bob Dylan.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.