Scraping the Sugarcoat

[This article was first published on Theory meets practice..., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Abstract:

Web-scraped data are used to put a Rubik’s cube competition result into perspective. The sugarcoating consists of altering the sampling frame of the comparison to the more relevant population of senior first time cubers.


Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Motivation

I just finished teaching an undergraduate course on data wrangling with R at Stockholm University about the tidyverse, SQL, and web-scraping1. Inspired by Jenny Bryan’s STAT545 course the course used GitHub as communication platform . Similar to the userR! lightning talks, each student had to pitch their project work in a 5 minute presentation in order to convince other students to read their report. I was utterly amazed by the content of the reports and the creativity of the presentations (sung slide titles, shiny apps, cliffhangers, and much more). Enabling mathematics students to pull their own data gives them a power to realize ideas and test hypothesis that were not possible before! Most of the students did web-scraping or API calls to get their data. Since I – thanks to support by two TAs – never got around to implement any scraping myself, a blog post feels like the right way to catch up on this.

After finishing last place in the Berlin Winter Cubing 2020 competition, there was an acute need to sugarcoat the result. The aim of the post is thus to substantiate that this last place was purely due to lack of competitors. ???? Since at the time of the analysis, my results were not yet part of the WCA results database, the idea was to use web-scraping from the live feed to pull my results and compare them to the database.

Scraping WCA live results

WCA competition results are reported live, i.e. as they are entered. The results can be queried and a dynamically generated web page displays the information. Below is shown the round 1 results of the Berlin Winter Cubing2020. In case of the traditional Rubik’s cube (aka. 3x3x3) event, one round of the competition consists of 5 solves. A trimmed mean is computed from the five solve times (aka. Ao5) by removing the best and worst result and averaging the tree remaining results.

From discussions in the Competitor’s Area it sounds that usually those last ranks are occupied by parents accompanying their kids to the competition.

The data science job is now to automatically scrape the above results as they become available. In other words dynamically generated pages are to be scraped. The post RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium provided help here, including an explanation on how to change the web driver version to match the installed Chrome version. The Rselenium based scraping code to get the above table looks as follows.

library(RSelenium)
driver <- rsDriver(browser = c("chrome"), chromever = "79.0.3945.36")
remote_driver <- driver[["client"]] 

# Fetch WCA live results of the 3x3x3 round 1 from the Berlin Winter Cubing 2020 competition
url <- "https://live.worldcubeassociation.org/competitions/BerlinWinterCubing2020/rounds/333-r1"
remote_driver$navigate(url)

# Wait a little to make sure page has been generated.
Sys.sleep(5)

The selectorgadget bookmarklet was then used to find the css selector of the table containing the results and the rvest::html_table function was used to extract the table as a data.frame.

library(rvest)

# Extract table with all results from round 1
results <- remote_driver$getPageSource() %>% .[[1]] %>% read_html() %>% 
  html_nodes(css = ".MuiTable-root") %>% html_table(header=1) %>% .[[1]] %>% as_tibble()
# Small helper function to parse WCA results with lubridate, i.e. add "0:" if no minutes.
time_2_ms <- function(x) { if_else(str_detect(x, ":"),  x, str_c("0:", x)) %>% lubridate::ms() }

# Convert reported timings to lubridate periods 
my_results <- results %>% filter(Name == "Michael Höhle") %>%
  mutate_at(vars(`1`,`2`,`3`,`4`,`5`,`Average`,`Best`), .funs= ~ time_2_ms(.))

##Extract the relevant results
my_avg_333 <- my_results %>% pull(`Average`) %>% as.numeric() * 100 #in centiseconds
my_rank    <- my_results %>% pull(`#`) %>%  as.numeric()  # 84
max_rank <- results %>% summarise(n=n()) %>% pull(n)   # 84
my_range   <- my_results %>% select(`1`:`5`) %>%  # best and worst result
  mutate_all(lubridate::period_to_seconds) %>% as.numeric() %>% range() * 100

In other words, my first (and as of today only) official 3x3x3 average is 1M 18.05S which corresponds to 7805 centiseconds. This is much better than 3 minutes anticipated in my analysis from May 2019 and was well under the 4:00 cutoff of the round. Still, I finished last place in the 3x3x3 competition (rank 84/84). However, the competition was in no way representative of my peer group (senior newbie cubers) as, for example, number 1, 3 and 7 of the World Championship 2019 also competed.

The aim of this post is thus to use a data based approach to alter the sampling frame of the comparison in order to make the comparison more relevant (aka. sugarcoating):

  • How does my result rank within the population of German first time competitors?
  • How does my result rank within the population of age 40+ cubers?

German first time competitors

The WCA results database is used to determine all 3x3x3 results by German cubers a shown in the previous Speedmining the Cubing Community with dbplyr post. We perform a comparison with the round 1 results of all German first time 3x3x3 competitors within the last 5 years.

This gives us 1024 cubers to compare with and which constitutes a more relevant population of comparison than, e.g., podium contestants from World’s 2019. The plot below shows the cumulative distribution of the Ao5 the cubers got in round 1 of their first competition. Given a value on the x-axis, the y-axis denotes the proportion of cubers which obtained an average lower or equal to the selected value.

From the graph it becomes clear that my time corresponds to the 94.34 percentile of the distribution, i.e. 94% of the years_to_consider last years German first time competitors had an average better than me in their first competition. In other words, my result was within 95% percentile of German competition newbies. Yay!

How do these cubers evolve after their first competition? I was particularly interested in the trajectory of cubers within my skill bracket, which here shall be defined as an average located between my best and worst solve time, i.e. 65.24s and 105.12s.

In the figure, the two horizontal lines indicate the limits of the skill bracket and the cross denotes the obtained average. A smooth line is fitted to the longitudinal data, due to simplicity the smoothed fit does not take the longitudinal data structure and the drop-out mechanisms into account. By focusing on the cohort of cubers starting to compete within the last years_to_consider years induces censoring: Cubers who started with competitions for example 1 years ago, will not be able to have results more than 1 years back in time. Still, a clear downward trend is visible, if the cuber goes to further cubing competitions. However, only 37% of the first time cubers have a second competition recorded in the data. Somewhat demotivating is to see that only 3 out of the 84 first time cubers in the skill bracket manage to obtain a sub-30s average at a later stage.

Comparing with senior cubers

Michael George maintains an unofficial ranking for the senior cubing community based on the WCA results database and a voluntary registration of senior cubers. As in other sports disciplines, “senior” is defined as aged 40+. Based on a one-time anonymised extract from the WCA database containing the true age of the cuber, the completeness of the self-report sample as well as a statistical extrapolation of the true rank within the WCA 40+ population can be computed. Around 30% of the senior cubers are contained in the self-reported sample. The WCA id as well a personal records of all self-reported “old-cubers” is available in JSON format’ish format and can be scraped using the httr package.

response <- httr::GET("https://logiqx.github.io/wca-ipy-www/data/Senior_Rankings.js") %>% 
  httr::content(as="text") %>% 
  str_replace("rankings =\n", "") %>% 
  jsonlite::fromJSON()

From the response we can extract the WCA id of the self-reported senior cubers, which we then match to the WCA database to get their round 1 result at their first cubing competition. Note: This is a slight approximation to the population of relevance as the cubers could have been younger than 40 at the time of their first WCA average.

# WCA IDs of the senior cubers
ids <- response$persons %>% pull(id) %>% unique()
 
# Extract all WCA 3x3x3 results of these senior cubers and restrict to their first
# competition result.
first_senior <- detailed_results %>% filter(personId %in% ids) %>% 
  group_by(personId) %>% 
  arrange(date,roundTypeId) %>% 
  filter(row_number() == 1) %>% 
  ungroup

# Percentile in the first comp average of senior cubers
senior_percentile__first_average <- ecdf(first_senior %>% pull(average))(my_avg_333)

From this it becomes clear that my average is located at the 70% percentile of the first competition result of senior cubers. Not so bad at all.

Discussion

It’s only logical and in the nature of competitions that somebody has to finish last. From my previous analysis I knew this would happen, but being both the age and skill outlier is a bit of a party pooper. On the positive side: Signing up for a competition helped me shuffle some time free to practice, I learned how a competition works, saw 1,3 and 7 from the World’s 2019 final and got to judge others. The statistical analyses in this post show that, by rectifying the sampling frame to a more comparable group, results are not so bad at all. ????

Technical note

I cube with a stickerless YuXin Little Magic using CFOP (F2L+4LL accelerated with additional PLL algos). My 3x3x3 PBs at home are 46.19 (single) and 58.10 (Ao5) with scrambles generated by cstimer. This illustrates that a competition in terms of pressure is something else than cubing relaxed at home. In one of the attempts I failed the T-perm twice - despite having made a regular expression exercise for it as part of the course…

Acknowledgments

The terms of use of the WCA database requests any use of it to be equipped with the following text:

This information is based on competition results owned and maintained by the World Cube Association, published at https://worldcubeassociation.org/results as of Jan 22, 2020.

Besides this formal note, I thank the WCA Results Team for providing the WCA data for download in this comprehensive form!

Literature


  1. Original course development was done by Martin Sköld in 2018-2019.↩︎

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice....

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)