So it begins – the first match-up of the first round of the first ever Swimming + Data Science High School Swimming State-Off. If you missed the introductory post and are wondering what all this is about you can catch up here. Today’s match-up is between number 4 seeded New York and number 5 seeded Pennsylvania.
To begin let’s load some packages.
Swimmer will be our workhorse for getting hold of the results. Then we’ll use the various
purrr to facilitate our analysis and
flextable for nice tables of the results.
library(SwimmeR) library(dplyr) library(stringr) library(purrr) library(flextable)
Please note the following analysis was updated November 22nd 2020 to reflect changes beginning with
SwimmeR v0.6.0 released via CRAN on November 22nd 2020. Please make sure your version of SwimmeR is up-to-date.
New York State has two seasons in a given school year. A girls’ season in the fall, followed by a boys season in the winter. They also have an excellent repository of results. We’re just interested in the most recent state meets, Girls 2019 and Boys 2020, so let’s put links to those results into a list.
NY_Boys <- "http://www.nyhsswim.com/Results/Boys/2020/NYS/Single.htm" NY_Girls <- "http://nyhsswim.com/Results/Girls/2019/NYS/Single.htm" NY_Links <- c(NY_Boys, NY_Girls)
The goal is to read in and clean the raw results, which we’ll do with
SwimmeR::swim_parse respectively. Before doing so however it’s useful to look the raw results over for potential issues.
We can see in the New York raw results that federation and state records for the two meets are recorded as “Federation”, “NYS Fed”, “NYSPHSAA” and “NYS Meet Rec”. Those are strings we’ll want to tell
NY_Avoid <- c("Federation", "NYS Fed", "NYSPHSAA", "NYS Meet Rec")
Pennsylvania is a similar story, with a nice results repository. Unlike New York however Pennsylvania has two different divisions for their state championships, somewhat confusingly called 2A and 3A. The 3A championships were held in 2020 (boys and girls) but the 2A where canceled due to COVID-19. Also diving wasn’t included in the Girls 3A 2020 results so as State-Off meet director I’ll be subsisting 2019 results for 3A diving and all of 2A. There will be five total links.
PA_Boys_3A <- "http://www.paswimming.com/19_20/results/state/PIAA_3_A_boys_states_Results.htm" PA_Girls_3A <- "http://www.paswimming.com/19_20/results/state/PIAA_3_A_girls_states_Results.htm" PA_Girls_3A_Diving <- "http://www.piaa.org/assets/web/documents/2019_3a_girls_f_dive_results.htm" PA_Boys_2A <- "http://www.paswimming.com/18_19/results/states/Results/2_A_Boys_Results_2019.htm" PA_Girls_2A <- "http://www.paswimming.com/18_19/results/states/Results/2_A_Girls_Results_2019.htm" PA_Links <- c(PA_Boys_3A, PA_Girls_3A, PA_Girls_3A_Diving, PA_Boys_2A, PA_Girls_2A)
Inspecting the Pennsylvania raw results gives us a few more strings to avoid, namely “PIAA” (PA record), plus “NFHS” and “NF Hon. Roll”.
PA_Avoid <- c("PIAA", "NFHS", "NF Hon. Rol")
Reading in Results with the SwimmeR Package
Getting our results is now a simple matter of
swim_parse over our list of links with our avoid lists passed to the
avoid argument of
We’ll then add columns
Gender since those are the parameters of our meet - each state is a team, with boys and girls meets, plus a combined total.
Results <- map(c(NY_Links, PA_Links), Read_Results, node = "pre") %>% map(Swim_Parse, avoid = c(NY_Avoid, PA_Avoid)) %>% set_names(c("NY_Boys", "NY_Girls", "PA_Boys", "PA_Girls", "PA_Girls", "PA_Boys", "PA_Girls")) %>% bind_rows(.id = "Source") Results <- Results %>% mutate( State = str_split_fixed(Source, "_", n = 2)[, 1], Gender = str_split_fixed(Source, "_", n = 2)[, 2] ) %>% select(-Source) %>% filter(str_detect(Event, "Swim-off") == FALSE) # remove swim-offs
More Detail on Meet Parameters
We’ll use the National Federation of High School athletics scoring as below. It’s important to specify that 17th place gets 0 points when it comes to dealing with ties.
Point_Values <- c(20, 17, 16, 15, 14, 13, 12, 11, 9, 7, 6, 5, 4, 3, 2, 1, 0) names(Point_Values) <- 1:17
In order to score the meet we need to reorder finishes from each state meet in the context of our larger meet. At first glance this is simple, because the fastest (i.e. lowest) time will win, followed by the second fastest/lowest in second etc. There are several complications though.
Unique swims: In the New York results there are “Federation” results and “Association” results for each event. Federation is a subset of Association though, so athletes/relay teams in the Federation are listed twice, once in the Federation results and then again (with the same times) in the Association results. The Pennsylvania results include preliminary swims, so athletes/relay teams are also listed twice, once in the finals (which appear first) and again in the prelims, with different times in each instance. We’ll need a way to get only the first instance of an athlete or relay team in a given event.
- Relays: Relays are different from individual swims for two reasons
- Naming: relays are named by the team/school (Central High), whereas athletes have both a team/school and an individual name (Sally Swimfast from Central High)
- Scoring: point values are doubled for relays
Ties: Ties happen, and the procedure (per NFHS rules) is for competitors to be awarded the average of their place and the voided place. For example, if two athletes tie for 9th place then there will be no 10th place finisher (both athletes get 9th, 10th is voided). The point value for 9th place is 9 points and the point value for 10th is 7, so each athlete receives (9 + 7) = 16, divided by two, equals 8 points. Our scoring needs to handle this.
- Diving: Here at Swimming + Data Science we love diving even if it is a complication. We’re not going to just cut diving out, we’re going to deal with diving on its own terms. Diving results are different from swimming results for two reasons.
- Format: Diving results are scores not times.
- Ordering: The highest score in diving wins, compared to the fastest (i.e. lowest) time winning in swimming.
- Break up
Resultsinto relays, diving, and individual swimming using
- Take only the first instance of an athlete/team in an event using
- For relays and individual swims convert times in minutes:seconds.hundreths to seconds with
- Reorder and record finishes on basis of time (or score) across the new NY vs. PA meet using
- Award points, accounting for ties using a nifty little combo of
Relay_Results <- Results %>% filter(str_detect(Event, "Relay") == TRUE) %>% # only want relays group_by(Event, Team) %>% slice(1) %>% # select first occurrence of team in each event ungroup() %>% mutate(Finals_Time_sec = sec_format(Finals_Time)) %>% # convert time to seconds group_by(Event) %>% mutate(Place = rank(Finals_Time_sec, ties.method = "min")) %>% # places, low number wins filter(Place <= 16) %>% # only top 16 score select(-Points) Relay_Results <- Relay_Results %>% # deal with ties mutate(New_Place = rank(Place, ties.method = "first"), Points = Point_Values[New_Place]) %>% group_by(Place, Event) %>% summarize(Points = mean(Points)) %>% inner_join(Relay_Results) %>% mutate(Points = Points * 2) # double point values for relays
Same basic structure as relays, but we need to handle scores differently than times.
Diving_Results <- Results %>% filter(str_detect(Event, "Diving") == TRUE) %>% # only want diving events mutate(Finals_Time = as.numeric(Finals_Time)) %>% group_by(Event, Name) %>% slice(1) %>% # first instance of every diver ungroup() %>% group_by(Event) %>% mutate(Place = rank(desc(Finals_Time), ties.method = "min"), # again, highest score gets rank 1 Finals_Time = as.character(Finals_Time)) %>% filter(Place <= 16) %>% #only top 16 score select(-Points) Diving_Results <- Diving_Results %>% # deal with ties mutate(New_Place = rank(Place, ties.method = "first"), Points = Point_Values[New_Place]) %>% group_by(Place, Event) %>% summarize(Points = mean(Points)) %>% inner_join(Diving_Results)
Again, very similar to diving and relays.
Ind_Swimming_Results <- Results %>% filter(str_detect(Event, "Diving") == FALSE, str_detect(Event, "Relay") == FALSE) %>% group_by(Event, Name) %>% slice(1) %>% # first instance of every swimmer ungroup() %>% group_by(Event) %>% mutate(Finals_Time_sec = sec_format(Finals_Time)) %>% # time as seconds mutate(Place = rank(Finals_Time_sec, ties.method = "min")) %>% # places, low number wins filter(Place <= 16) %>% #only top 16 score select(-Points) Ind_Swimming_Results <- Ind_Swimming_Results %>% # deal with ties mutate(New_Place = rank(Place, ties.method = "first"), Points = Point_Values[New_Place]) %>% group_by(Place, Event) %>% summarize(Points = mean(Points)) %>% inner_join(Ind_Swimming_Results)
The Final Results
bind together the results from our three cases (relays, diving and individual swims) and do a but of cleaning up. Pennsylvania for example has all their results in block capitals. That can be fixed with
Results_Final <- bind_rows(Relay_Results, Diving_Results, Ind_Swimming_Results) %>% mutate(Name = str_to_title(Name), Team = str_to_title(Team)) %>% mutate(Team = str_remove_all(Team, "[:punct:]"), Team = str_remove_all(Team, "[0-9]"))
summarise and see who won!
Scores <- Results_Final %>% group_by(State, Gender) %>% summarise(Score = sum(Points)) Scores %>% arrange(Gender, desc(Score)) %>% flextable() %>% bold(part = "header") %>% bg(bg = "#D3D3D3", part = "header")
Scores %>% group_by(State) %>% summarise(Score = sum(Score)) %>% arrange(desc(Score)) %>% flextable() %>% bold(part = "header") %>% bg(bg = "#D3D3D3", part = "header")
Pennsylvania wins both meets and the combined in an upset, by quite a wide margin!
It’s interesting to think for a moment about why this might be. The State-Off is seeded by population. New York has about 19 million people, but about 8 million of them live in New York City. New York City doesn’t have very many swimmers. Swimmers from the new York City Public High School Athletic League have a -P designation after their school name in the raw results. The cleaning we did on
Final_Results reduced this to a trailing
P, which we can search for with
Results_Final %>% ungroup() %>% filter(str_detect(Team, "P$")) %>% summarise(Count = n()) ## # A tibble: 1 x 1 ## Count ## <int> ## 1 3 Results_Final %>% ungroup() %>% filter(State == "NY") %>% summarise(Count = n()) ## # A tibble: 1 x 1 ## Count ## <int> ## 1 134
Only three swims out of New York’s total of 134 swims are from New York City. Pools take up a lot space so they’re difficult to install in cities generally. New York City is also very dense, which makes building pools that much harder. Pennsylvania on the other hand has a total population of 12 million. Philadelphia (1.5 million) and Pittsburgh (300k) are much smaller than New York City, so it’s possible that much more of the Pennsylvania population lives in areas conducive to swimming. There’s also a racial component. New York City has a higher proportion of African American residents than New York State as a whole, and African Americans have been subjected to segregation and systematic discrimination including specifically with respect to swimming pools to the extent that even today black children drown at a rate 3x that of white children. New York’s larger than expected non-swimming population may be reflected in its lower than expected State-Off score.
Swimmers of the Meet
To determine the swimmers of the meet there will be two qualifications: 1. An athlete must have competed in two events - sorry divers. Winner will be the athlete with the lowest average place (winning two events gives an average place of 1). This is an individual award so relays don’t count. 2. As a tiebreaker from 1. above, the athlete whose times are fastest across their two events relative to the All-American cuts will be Swimmer of the Meet.
Now if only someone had the All-American cuts readily accessible. Oh wait someone does and that someone is me. Let’s grab those cuts and join them to
Ind_Swimming_Results. Then we can do some math to calculate each athlete’s average difference from the All-American cut.
Cuts_Link <- "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/State_Cuts.csv" Cuts <- read.csv(url(Cuts_Link)) '%!in%' <- function(x,y)!('%in%'(x,y)) # "not in" function Cuts <- Cuts %>% # clean up Cuts filter(Stroke %!in% c("MR", "FR", "11 Dives")) %>% rename(Gender = Sex) %>% mutate( Event = case_when((Distance == 200 & #match events Stroke == 'Free') ~ "200 Yard Freestyle", (Distance == 200 & Stroke == 'IM') ~ "200 Yard IM", (Distance == 50 & Stroke == 'Free') ~ "50 Yard Freestyle", (Distance == 100 & Stroke == 'Fly') ~ "100 Yard Butterfly", (Distance == 100 & Stroke == 'Free') ~ "100 Yard Freestyle", (Distance == 500 & Stroke == 'Free') ~ "500 Yard Freestyle", (Distance == 100 & Stroke == 'Back') ~ "100 Yard Backstroke", (Distance == 100 & Stroke == 'Breast') ~ "100 Yard Breaststroke", TRUE ~ paste(Distance, "Yard", Stroke, sep = " ")), Event = case_when(Gender == "M" ~ paste("Boys", Event, sep = " "), Gender == "F" ~ paste("Girls", Event, sep = " "))) Ind_Swimming_Results <- Ind_Swimming_Results %>% left_join(Cuts %>% filter((Gender == "M" & Year == 2020) | (Gender == "F" & Year == 2019)) %>% select(AAC_Cut, AA_Cut, Event), by = 'Event') Swimmer_Of_Meet <- Ind_Swimming_Results %>% mutate(AA_Diff = (Finals_Time_sec - sec_format(AA_Cut))/sec_format(AA_Cut), Name = str_to_title(Name)) %>% group_by(Name) %>% filter(n() == 2) %>% # get swimmers that competed in two events summarise(Avg_Place = sum(Place)/2, AA_Diff_Avg = round(mean(AA_Diff, na.rm = TRUE), 2), Gender = unique(Gender), State = unique(State)) %>% arrange(Avg_Place, AA_Diff_Avg) %>% group_split(Gender) # split out a dataframe for boys (1) and girls (2)
Boys swimmer of the meet is Matt Brownstead from Pennsylvania, the only boy to win two events! He also broke the national high school record in the 50 free. Let’s see his results.
Swimmer_Of_Meet[] %>% slice_head(n = 5) %>% select(-Gender) %>% flextable::flextable() %>% bold(part = "header") %>% bg(bg = "#D3D3D3", part = "header")
Results_Final %>% filter(Name == "Brownstead, Matt") %>% select(Place, Name, Team, Finals_Time, Event) %>% arrange(desc(Event)) %>% flextable::flextable() %>% bold(part = "header") %>% bg(bg = "#D3D3D3", part = "header")
Boys 50 Yard Freestyle
Boys 100 Yard Freestyle
As for the girls the competition was a bit tighter, with two athletes, Chloe Stepanek and Megan Deuel, both winning two events. Going to our All-American standard tiebreaker gives the win to Chloe Stepanek! Winning here is hopefully some solace for Chloe after Megan won the award at the NYS girls meet.
Swimmer_Of_Meet[] %>% slice_head(n = 5) %>% select(-Gender) %>% flextable::flextable() %>% bold(part = "header") %>% bg(bg = "#D3D3D3", part = "header")
Results_Final %>% filter(Name == "Chloe Stepanek") %>% select(Place, Name, Team, Finals_Time, Event) %>% arrange(desc(Event)) %>% flextable::flextable() %>% bold(part = "header") %>% bg(bg = "#D3D3D3", part = "header")
Girls 200 Yard Freestyle
Girls 100 Yard Freestyle
That wraps up this match up. Join us next time here at Swimming + Data Science for another Round 1 match up - number 1 seed California vs. number 8 seed Georgia. We’ll see you then!