Analyzing extreme skiing and snowboarding in R: Freeride World Tour 1996–2018

Matt.0

4 years ago

[This article was first published on Stories by Matt.0 on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Freeride World Tour (FWT) has been hosting extreme skiing & snowboarding events since 1996. Having just wrapped up the 2018 season in March I did an analysis on rankings and past FWT winners using R.

If you haven’t heard of the FWT yet it’s an exciting sport where riders choose gnarley-looking lines through cliff-faces, cornices and nasty couloirs (like this line). There’s no artificial jumps or half-pipes just a gate at the top and one at the bottom. The judges use five criteria (Difficulty of Line, Control, Fluidity, Jumps and Technique) which are scored from 0 to 100.

My desire to do this project was mainly to practice some web-scraping, accessing the Twitter API and reinforce my own understanding of the concepts applied within. Skiing and snowboarding are forgotten when it comes to sports analytics — I mean even tennis has an R package– so I thought it would be cool project.

#FWT18 on Twitter

First, I collected data from the Twitter API using the TwitteR package. To do this, I needed to set up a developer account to authenticate the connection from R (a good tutorial on how to do this is here).

library(twitteR)
# set objects
api_key <- "your api key"
api_secret <- "your api secret"
access_token <- "your access token"
access_token_secret <- "your access token secret"
# authenticate
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

Quickly visualize the frequency of tweets mentioning #FWT18

I get a warning that I asked the Twitter API for a maximum of 1,000 tweets but it only returned 58 from the past 9 days.

It would be nice to get a longer history but the twitter API only indexs a few days worth of tweets and does not allow historic queries (there’s a [Python package that can](https://github.com/Jefferson-Henrique/GetOldTweets-python) but I haven’t tried it out yet.

Where in the world are people discussing #FWT18?

It looks like most of the tweets originated from Europe and North America, although we see a few tweets from Japan, Australia and Africa.

Note that geo-tagged tweets are only available for those who have opted in for that in the settings, which is a mere 3% of Twitter users.

What are some of the top words coming up in tweets about #FWT18?

This time I included @FreerideWTour and the Twitter handles of a few of the riders by using a function that looks for mentions and hash-tags and then create a wordcloud for the most popular things being mentioned along the #FWT18 tag.

Looks like crashes and weekend are used often.

Exploring FWT18 Rankings

Since the data is not available as a .txt or a .csv file on the website, nor do they provide and API I needed to crawl for it.

It’s worth mentioning that administrators may want to protect certain parts of their website for a number of reasons, “such as indexing of an unannounced site, traversal of parts of the site which require vast resources of the server, recursive traversal of an infinite URL space, etc.”

Therefore, one should always check if they have permission. One way to do this, is to use the robotstxt package to check if your webbot has permission to access certain parts of a webpage (Thanks to Maëlle Salmon for that tip).

library(robotstxt)
paths_allowed("https://www.freerideworldtour.com/rider/")
[1] TRUE

Okay, it looks like we have permission.

Unfortunately the code for the FWT 2018 rankings page is “fancy” meaning one needs to click the drop-down arrows to get a riders score for each event.

I think the data is being loaded with JavaScript which means that I would need to use a program which can programmatically click the button. I’ve heard splashr or RSelenium may accomplish this. But, I’m new to web-scraping and only familiar with rvest so I came up with a (relatively) quick work-around.

I placed the names from the 2018 roster into a datase t and loaded it as an object. I can automatically crawl every rider by feeding these names to rvest with a for loop to the end of https://www.freerideworldtour.com/rider/

# read in the names from 2018 roster
roster <- read_csv("https://ndownloader.figshare.com/files/11173433")
# create a url prefix
url_base <- "https://www.freerideworldtour.com/rider/"
riders <- roster$name
# Assemble the dataset
output <- data_frame()
for (i in riders) { 
  temp <- read_html(paste0(url_base, i)) %>% 
    html_node("div") %>% 
    html_text() %>% 
    gsub("\\s*\\n+\\s*", ";", .) %>% 
    gsub("pts.", "\n", .) %>% 
    read.table(text = ., fill = T, sep = ";", row.names = NULL, 
               col.names = c("Drop", "Ranking", "FWT", "Events", "Points")) %>%
    subset(select = 2:5) %>% 
    dplyr::filter(
      !is.na(as.numeric(as.character(Ranking))) & 
        as.character(Points) != ""
    ) %>%
    dplyr::mutate(name = i)
  output <- bind_rows(output, temp)
}

I was going to look at the overall standings for each category (skiing and snowboarding) broken-down by how many points athletes earned at each FWT event in 2018; however, I noticed there was something odd going on.

# How many riders in the roster?
unique(roster)
# A tibble: 56 x 3
   name           sport sex  
   <chr>          <chr> <chr>
 1 aymar-navarro  ski   male 
 2 berkeley-patt~ ski   male 
 3 carl-regner-e~ ski   male 
 4 conor-pelton   ski   male 
 5 craig-murray   ski   male 
 6 drew-tabke     ski   male 
 7 fabio-studer   ski   male 
 8 felix-wiemers  ski   male 
 9 george-rodney  ski   male 
10 grifen-moller  ski   male 
# ... with 46 more rows
# How many names in the output object?
unique(output$name)
[1] "aymar-navarro"       
 [2] "berkeley-patterson"  
 [3] "carl-regner-eriksson"
 [4] "conor-pelton"        
 [5] "craig-murray"        
 [6] "drew-tabke"          
 [7] "fabio-studer"        
 [8] "felix-wiemers"       
 [9] "george-rodney"       
[10] "grifen-moller"       
[11] "ivan-malakhov"       
[12] "kristofer-turdell"   
[13] "leo-slemett"         
[14] "logan-pehota"        
[15] "loic-collomb-patton" 
[16] "markus-eder"         
[17] "mickael-bimboes"     
[18] "reine-barkered"      
[19] "ryan-faye"           
[20] "sam-lee"             
[21] "stefan-hausl"        
[22] "taisuke-kusunoki"    
[23] "thomas-rich"         
[24] "trace-cooke"         
[25] "yann-rausis"         
[26] "arianna-tricomi"     
[27] "elisabeth-gerritzen" 
[28] "eva-walkner"         
[29] "hazel-birnbaum"      
[30] "jaclyn-paaso"        
[31] "kylie-sivell"        
[32] "lorraine-huber"      
[33] "rachel-croft"        
[34] "blake-hamm"          
[35] "christoffer-granbom" 
[36] "clement-bochatay"    
[37] "davey-baird"

Turns out the name elias-elhardt, which was row 40, is creating an issue. I’m not sure why this was but since Elias only competed in the qualifiers let’s simply remove him then re-run the code from above.

# Remove Elias Elhardt
roster <- roster[-40,]
riders <- roster$name

Historical FWT winners

The FWT lists past event winners on their website. I gathered the data of all winners from the 23 tours between 1996 and 2018 and included their age from either the website or a quick web-search. The dataset can be found on figshare.

# load the data
df <- read_csv("https://ndownloader.figshare.com/files/11300864")
# Get summary statistics on age of winners
df %>% 
  summarize(mean_age = median(age, na.rm = TRUE),
            max_age = max(age, na.rm = TRUE),
            min_age = min(age, na.rm = TRUE))
  mean_age max_age min_age
1       29      43      15
# Find minimum age of winner by sex and sport
df %>%
  group_by(sex, sport) %>% 
  slice(which.min(age)) %>% 
  dplyr::select(name, sex, sport, age)
# A tibble: 4 x 4
# Groups:   sex, sport [4]
  name            sex    sport       age
  <chr>           <chr>  <chr>     <int>
1 Arianna Tricomi female ski          23
2 Michelle Gmitro female snowboard    16
3 George Rodney   male   ski          21
4 Cyril Neri      male   snowboard    15

How many times have riders won FWT events?

The large number of riders who won at least one FWT event dwarfs those unique athlets who won a considerable number of events.

We can have another look at the data in a table.

Xavier De Le Rue is near the top with 8 and appears under both the ski and snowboarder categories? That’s strange. We can replace that mistake in the data set like this.

df$sport[df$name == "Xavier De Le Rue"] <- "snowboard"

Which countries have the most winners?

Looks like USA, France and Austria produce some of the best free riders.

How old are the winners from each country?

We can use box plots to get an idea of how the age distribution for each country looks like.

However, a better way to get a visual overview of distributions is with a rain cloud plot. Since a few of the countries only had one competitor let’s remove them.

It may also be interesting to ask how has the age of winners changed in the history of the competition.

Are FWT winners getting younger with time?

There doesn’t appear to be any trend with age and winning FWT events over time.

It would have been nice to play around with data on how judges scored each riders run from 0 to 100 for each of the five categories for each event but it doesn’t look like that sort of information is available to the public at this time.

The full code for this analysis can be found on Github.

Analyzing extreme skiing and snowboarding in R: Freeride World Tour 1996–2018 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Stories by Matt.0 on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.