Fast, Deep, Fast and Deep

Welcome to Swimming + Data Science on Swimming + Data Science

4 months ago

[This article was first published on Welcome to Swimming + Data Science on Swimming + Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Swimming + Data Science has been quiet for a while as I worked on other projects. Today though we’re back at it in an attempt to quantify the speed and depth of swim meets. It’s more than that though, because this is vintage me, back on my hobby horses. We will be using SwimmeR. We will be focusing, perhaps myopically, on swim meets from my local area. We will also be using a discussion I had with some friends about old swimming results as a jumping off point. It’s a perfect Swimming + Data Science trifecta!

So here’s the story. I was officiating at the Section IV Class C sectional finals meet, held recently. During an awards break I got into a conversation with a couple of people about swim meets past, as I am wont to do. These people, and I’m paraphrasing, remarked that

The meet was generally both slower and less deep than previous years, particularly 2006
That a specific team from 2006 that these people were associated with would have run rampant all over the current iteration of the meet

Now I enjoy a good “things were better in my day” bit as much as anyone – perhaps even more – but I also enjoy testing assertions. So that’s what we’re going to do. These tests will proceed in two parts, one for each assertion.

To start let’s load some packages. There’s a new version of SwimmeR, v0.13.0, released November 5th 2021. It’s great and has lots of new features, none of which are particularly relevant to the contents of this article. Still though, might as well update.

Setup

install.packages("SwimmeR")

We’ll use a few tidyverse packages plus flextable.

library(SwimmeR)
library(dplyr)
library(stringr)
library(tidyr)
library(flextable)

flextable_style <- function(x) {
  x %>%
    flextable() %>%
    bold(part = "header") %>% # bolds header
    bg(bg = "#D3D3D3", part = "header")  # puts gray background behind the header row
}

Collecting the meet results

Collecting meet results with SwimmeR is easy. Simply find a link to the results and pipe them into read_results followed by swim_parse. here we’ll do that with both the 2006 and 2021 Section VI Class C meet results.

C_2006 <- "http://www.section4swim.com/Results/GirlsHS/2006/Sec4/C/Single.htm" %>% 
  read_results() %>% 
  swim_parse() %>% 
  mutate(Year = 2006) %>% 
  mutate(Event = str_remove(Event, "Girls "))

C_2006 <- C_2006 %>% 
    mutate(Event = factor(Event, levels = unique(C_2006$Event)))
  
C_2021 <- "http://www.section4swim.com/Results/GirlsHS/2021/Sec4/C/Single.htm" %>% 
  read_results() %>% 
  swim_parse() %>% 
  mutate(Year = 2021) %>% 
  mutate(Event = str_remove(Event, "Girls "))

C_2021 <- C_2021 %>% 
    mutate(Event = factor(Event, levels = unique(C_2021$Event)))

Assertion #1 – Faster and Deeper

Fast?

First off let me apologize to our diving friends. “Fast” is a term commonly used by swimming people to describe an entire meet, including diving. Diving obviously isn’t timed though, so the better term might be “strength” – a strong meet, a weak meet etc. Still, I’m going to mostly stick with references to speed just because that’s the more common term.

“Fast” is almost always a relative measure. Is a 4:56.00 in the 500 free fast? Depends on the context. That time would have won the 500 free in a recent men’s dual meet at my alma mater – it would be a fast time in the context of that meet. On the other hand in this women’s dual meet at Stanford a 4:56.00 would have been good for 6th – i.e. last place. Katie Ledecky won that 500 going 4:33.94 – which is fast full stop. What we’d really like is a quantitative measure of the speed for each event, which taken together describe the strength of a meet.

I propose that what’s really meant by a statement like “the meet is fast this year” is that in some kind of weighted fashion the times are faster than years past for the same finish places. Sounds obvious I know. What’s the weighing though? We need it in order to be quantitative. Luckily it’s already been determined. It’s point value by place. First place in these meets is worth 20 points, 16th place is worth 1 point and it seems reasonable to me that the first place time is 20x more important in describing the speed of a meet than the 16th place time. The 16th place time does still matter a bit though.

Let’s try and compare the “fastness” between the 2006 and 2021 meets. Our goal is to use the weighed.mean function on times and place-points to get a time (or score for diving), describing the overall strength of each event. We’ll stick C_2006 and C_2021 together and then summarise using weighed.mean and then pivot_wider to format a nice table.

C_Combined <- C_2021 %>% 
  bind_rows(C_2006) %>% 
    filter(Exhibition < 1,
         DQ < 1) %>% 
  group_by(Event) %>% 
  select(Place, Name, Team, Event, Finals_Time, Points, Year) %>% 
  filter(Place <= 16) %>% 
  filter(is.na(Points) == FALSE)

C_Combined %>% 
  group_by(Event, Year) %>% 
  summarise(W_Avg = mmss_format(weighted.mean(sec_format(Finals_Time), Points, na.rm = TRUE))) %>% 
  mutate(W_Avg = case_when(str_detect(Event, "Diving") ~ as.character(sec_format(W_Avg)),
                           TRUE ~ W_Avg)) %>% 
  pivot_wider(names_from = "Year", values_from = "W_Avg") %>% 
  flextable_style()

< template id="8cdd122d-935a-459f-bbf0-1feadcfab1be">

Event	2006	2021
200 Yard Medley Relay	2:08.18	2:14.39
200 Yard Freestyle	2:14.29	2:18.60
200 Yard IM	2:32.42	2:36.67
50 Yard Freestyle	26.63	27.64
1 mtr Diving	238.13	260.45
100 Yard Butterfly	1:09.05	1:13.90
100 Yard Freestyle	59.21	1:02.58
500 Yard Freestyle	6:07.36	6:23.79
200 Yard Freestyle Relay	1:54.92	1:57.09
100 Yard Backstroke	1:08.89	1:12.08
100 Yard Breaststroke	1:18.93	1:22.85
400 Yard Freestyle Relay	4:04.83	4:21.62

The 2006 meet was faster than the 2021 meet in every event except diving. The margin was about 4 seconds per 100 – which is a lot. Diving is interesting too – let’s look closer.

C_Combined %>% 
  ungroup() %>% 
  filter(str_detect(Event, "Diving")) %>% 
  select(Place, "Score" = Finals_Time, Year) %>% 
  pivot_wider(names_from = "Year", values_from = "Score") %>% 
  select(Place, `2006`, `2021`) %>% 
  flextable_style()

< template id="fa44284c-e825-4e7d-90ec-a2a0209bb035">

Place	2006	2021
1	284.80	267.10
2	267.30	258.05
3	258.50	254.70
4	239.50
5	228.65
6	225.50
7	222.45
8	176.85
9	169.60

In 2006 there were 9 divers and a winning score of 284.80. In 2021 there were 3 divers and a winning score of 267.10. Yes the weighed average score was higher for 2021 but I ask you – do you think 2021 was a better year for the sport of diving in Section IV Class C than 2006? I sure don’t. As I discuss frequently in this space, models and quantitative analysis are great, but it does help to actually know things too.

Deep?

A quantitative description of meet depth is simpler – it’s the difference in time (or diving score) between 1st place and some other place of interest – usually the last scoring place. If the difference is small the meet is deep. Here we’ll compute the difference, in seconds, between the min and max scoring marks.

C_Combined %>% 
  group_by(Event, Year) %>% 
  mutate(Finals_Time = sec_format(Finals_Time)) %>% 
  summarise(Range = max(Finals_Time, na.rm = TRUE) - min(Finals_Time, na.rm = TRUE)) %>% 
    mutate(Range = case_when(str_detect(Event, "Diving", negate = TRUE) ~ mmss_format(Range),
                           TRUE ~ as.character(Range))) %>% 
  pivot_wider(names_from = "Year", values_from = "Range") %>% 
  flextable_style()

< template id="78213329-60d3-4c90-b0f8-6be6228b0c1a">

Event	2006	2021
200 Yard Medley Relay	21.56	51.07
200 Yard Freestyle	27.07	34.50
200 Yard IM	36.27	50.27
50 Yard Freestyle	04.69	04.79
1 mtr Diving	115.2	12.4
100 Yard Butterfly	18.06	43.41
100 Yard Freestyle	10.52	16.95
500 Yard Freestyle	56.91	2:27.39
200 Yard Freestyle Relay	22.53	27.86
100 Yard Backstroke	15.95	26.31
100 Yard Breaststroke	16.16	31.73
400 Yard Freestyle Relay	39.29	46.32

The range is much smaller in 2006 for all swimming events, meaning the meet was much deeper.

Faster and Deeper

Those people were right on both counts. The 2006 meet was indeed faster and deeper, by a wide margin, than the 2021 meet.

Assertion #2 – An old team returns

The team in question here is the 2006 Dryden Lady Lions, headed by eventual DIII All-American Sheila Rhoades.

To start we’ll need to remove all of the Dryden results from C_2021 – it wouldn’t be fair for Dryden to have two teams. Then we’ll collect only the Dryden results from 2006 and join the two sets of results together. The 2006 Dryden team will then be “swimming” in the 2021 meet.

C_2021_No_Dryden <- C_2021 %>% 
  filter(Team != "Dryden")

C_2006_Only_Dryden <- C_2006 %>% 
  filter(Team == "Dryden")

C_Comp <- C_2021_No_Dryden %>%
  bind_rows(C_2006_Only_Dryden) %>% 
  mutate(Team = str_replace(Team, "Susquehanna Vall$", "Susquehanna Valley"),
         Team = str_replace(Team, "Notre Dame-E$", "Notre Dame-Elmira"),
         Team = str_replace(Team, "Watkins Glen-OM", "Watkins Glen-Odessa Montour"))

It’s more complicated than that though. We need to reconstruct prelims to sort athletes into A Finals and B Finals before the finals themselves can be scored. Fortunately we have access to everyone’s Prelims_Time, although because of how SwimmeR works some of them are in the Finals_Time column. In SwimmeR the Finals_Time represents the final swim for each swimmer, which may or may not actually be in the finals. It’s confusing I know and I apologize, but it’s important to the underlying functioning of swim_parse so that’s the way it will stay. Anyway, swimmers who have Finals_Times but no Prelims_Time in a prelims-finals meet actually have their prelims performance listed in Finals_Time. Let’s fix our results.

We’ll also need to handle diving separately, because in diving the highest value (score) wins, whereas in swimming the lowest value (time) wins.

C_Comp <- C_Comp %>%
  mutate(Prelims_NA = case_when(is.na(Prelims_Time) ~ 1,
                                TRUE ~ 0)) %>%
  mutate(Prelims_Time = case_when(is.na(Prelims_Time) ~ Finals_Time,
                                  TRUE ~ Prelims_Time)) %>%
  mutate(Finals_Time = case_when(Prelims_NA > 0 ~ "NA",
                                 TRUE ~ Finals_Time)) %>%
  na_if("NA") %>% 
  select(-Prelims_NA)

C_Comp_Diving <- C_Comp %>% 
  filter(str_detect(Event, "Diving") == TRUE) %>% 
  mutate(Prelims_Time = as.numeric(Prelims_Time)) %>%
    mutate(
      Place = rank(desc(Prelims_Time), ties.method = "min"),
      # highest score gets rank 1
      Prelims_Time = as.character(Prelims_Time)
    ) 

C_Comp_Swimming <- C_Comp %>%
  filter(str_detect(Event, "Diving") == FALSE) %>%
  filter(DQ < 1) %>% 
  mutate(Prelims_Time_sec = SwimmeR::sec_format(Prelims_Time)) %>% # time as seconds
  group_by(Event) %>% 
  mutate(Place = rank(Prelims_Time_sec, ties.method = "min")) %>% # places, low number wins
  arrange(Place)

C_Comp_Prelims <- C_Comp_Diving %>% 
  bind_rows(C_Comp_Swimming)

Great, now we have our meet, with athletes sorted into A and B finals by place, with 8th place as the cutoff for the A finals. We can use results_score to rescore the meet on the basis of Finals_Time.

C_Finals_Score <- C_Comp_Prelims %>% 
  results_score(
    events = unique(C_2021$Event),
    meet_type = "prelims_finals",
    lanes = 8,
    point_values = c(20, 17, 16, 15, 14, 13, 12, 11, 9, 7, 6, 5, 4, 3, 2, 1),
    scoring_heats = 2
  )

C_Finals_Score %>% 
  group_by(Team) %>% 
  summarise(Score = sum(Points, na.rm = TRUE)) %>% 
  mutate(Place = round(rank(desc(Score), ties.method = "min"), 1)) %>% 
  arrange(desc(Score)) %>% 
  select(Place, everything()) %>% 
  flextable_style()

< template id="2c140ef3-2ce6-4ab2-a798-000e81781014">

Place	Team	Score
1	Dryden	447
2	Watkins Glen-Odessa Montour	435
3	Lansing	325
4	Chenango Forks	224
4	Southern Cayuga	224
6	Notre Dame-Elmira	174
7	Greene	101
8	Susquehanna Valley	88
9	Whitney Point	15

My friends are vindicated again. Dryden 2006 would have indeed won the 2021 meet – although only by 12 points. The real question is how did they know?

In Conclusion

We’ve come up with a quantitative framework for determining the speed and depth of a swim meet, which I’m sure will come in handy somewhere. We’ve also made use of some SwimmeR functions and lots of tidyverse functions and hopefully learned a little bit on the way. Join us next time back here at Swimming + Data Science where we’ll be doing something else fun.

To leave a comment for the author, please follow the link and comment on their blog: Welcome to Swimming + Data Science on Swimming + Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.