The Tour de France – a short primer
The Tour de France (‘Le Tour’) is the world’s biggest and most prestigious cycling event with a long history spanning back as far as 1903. Each annual ‘edition’ of the race is composed of around 21 stages that traverse the French nation, each stage is a standalone race by itself. The racing is complex, with each team of 9 riders competing for any combination of individual stage wins, sprint points, mountain climbing, aggressive riding and team ability. The most coveted prize of all is the ‘Generale Classification’ (GC) which is awarded to the rider with the lowest aggregate time at the end of the race. Each day, the rider with the lowest aggregate time following the previous stage wears the ‘Maillot Jaune’ (yellow jersey) indicating that they are the current race leader.
tdf an R package for Tour de France data
tdf package is hosted on
github and contains
information about the overall winning rider for each edition of the
race, the winner’s biographical information and the results for each
stage in each edition. To install the package, use
The package is just a container for the dataframe
library(tdf) library(tidyverse) # visualise contents tdf::editions glimpse(editions) ## Observations: 106 ## Variables: 20 ## $ edition
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … ## $ start_date 1903-07-01, 1904-07-02, 1905-07-09, 1906-07-04, 1907-0… ## $ winner_name "Maurice Garin", "Henri Cornet", "Louis Trousselier", "… ## $ winner_team "La Française", "Conte", "Peugeot–Wolber", "Peugeot–Wol… ## $ distance 2428, 2428, 2994, 4637, 4488, 4497, 4498, 4734, 5343, 5… ## $ time_overall 94.55389, 96.09861, NA, NA, NA, NA, NA, NA, NA, NA, 197… ## $ time_margin 2.98916667, 2.27055556, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ stage_wins 3, 1, 5, 5, 2, 5, 6, 4, 2, 3, 1, 1, 1, 4, 2, 0, 3, 4, 4… ## $ stages_led 6, 3, 10, 12, 5, 13, 13, 3, 13, 13, 8, 15, 2, 14, 14, 3… ## $ height 1.62, NA, NA, NA, NA, NA, 1.78, NA, NA, NA, NA, NA, NA,… ## $ weight 60, NA, NA, NA, NA, NA, 88, NA, NA, NA, NA, NA, NA, NA,… ## $ age 32, 19, 24, 27, 24, 25, 22, 22, 26, 23, 23, 24, 33, 30,… ## $ born 1871-03-03, 1884-08-04, 1881-06-29, 1879-06-05, 1882-1… ## $ died 1957-02-19, 1941-03-18, 1939-04-24, 1907-01-25, 1917-1… ## $ full_name NA, NA, NA, NA, "Lucien Georges Mazan", "Lucien Georges… ## $ nickname "The Little Chimney-sweep", "Le rigolo (The joker)", "L… ## $ birth_town "Arvier", "Desvres", "Paris", "Moret-sur-Loing", "Pless… ## $ birth_country "Italy", "France", "France", "France", "France", "Franc… ## $ nationality " France", " France", " France", " France", " France", … ## $ stage_results [[ , ,
editionsis a tibble whose rows each correspond to a single edition of the Tour de France. The columns contain information about the race itself and the overall winner, including:
distanceis the aggregate distance in kilometres covered by the entire race.
time_overallis the time in hours taken by the winner to complete the race
time_marginis the difference in finishing times between the race winner and the first runner up.
stage_winsis the number of stages won by the eventual winner during the edition (note that it is possible to win the GC without winning any stages at all).
stages_ledis the number of stages spent as the race leader (wearing the yellow jersey) by the eventual winner.
weightis the winner’s body weight in kilograms.
heightis the winner’s height in meters.
stage_resultsis a column containing a list of lists. Each element contains a list of stage results for a particular edition of the Tour de France.
How has the race changed over time?
Forget ultra-marathons and tough mudder, early editions of Le Tour were really tough. Riders were mostly self-supported, rode in woollen jerseys for hundreds of miles per day on steel-framed bicycles. The longest stage in Tour history was 482 kilometres (Stage 5, 1919) - the stage winner, Jean Alavoine, took almost 19 hours to complete the stage.
To get a sense for how the length of the race has varied since 1903, we can visualise the total distance in the
editionsdata:library(ggplot2) editions %>% ggplot(aes(x = start_date, y = distance, color = edition)) + geom_point() + xlab('Race start date') + ylab('Distance in kilometres') + ggtitle('Tour de France total distance covered over time') + theme(legend.position = "none")
It’s pretty clear that over time, the distances covered have decreased dramatically, and have roughly stabilised at about 3500 kilometres during the last 2 decades (still a huge distance). You can see that the longest ever Tour de France edition was in 1926, with a total distance covered of 5,745 kilometres!
On the face of it, it seems like the riders of today have it substantially easier compared to riders of the past. But how fast are today’s riders going?library(ggrepel) editions %>% ggplot(aes(x = start_date, y = distance / time_overall, color = edition)) + geom_point(na.rm = TRUE) + geom_label_repel(data = editions %>% sample_n(20), aes(label = winner_name), size = 2.3, nudge_y = -9, na.rm = TRUE, segment.alpha = 0.2) + xlab('Edition start date') + ylab('Average speed km/h') + ggtitle('Tour de France winners average speed') + theme(legend.position = "none")
They’re going pretty fast. It looks like while the race has been getting gradually shorter, the speeds have been getting much faster. The change also coincides with professionalisation of the sport, better equipment and smarter training so it’s hard to provide an exact account for the change in speed. It’s worth highlighting the top two fastest average speeds in Tour de France history:# Top 5 average speeds of Tour de France winners editions %>% mutate(speed = distance / time_overall) %>% select(start_date, winner_name, speed) %>% arrange(desc(speed)) %>% print(n = 2) ## # A tibble: 106 x 3 ## start_date winner_name speed ##
## 1 1998-07-11 Marco Pantani 41.7 ## 2 2005-07-02 Lance Armstrong 41.7 ## # … with 104 more rows
The two fastest ever editions of the Tour de France were won by Marco Pantani (in 1998) and Lance Armstrong (in 2005), both of whom were later stripped of these (and other) wins for their use of banned performance-enhancing substances. The speed of doped riders in such Tour editions was so obviously faster than non-doped riders, that French media declared a culture of “Cyclisme a deux vitesses” (“two-speed cycling”). It is unknown how much riders still use banned substances for performance enhancement, but the average speeds of the Pantani / Armstrong years have not been reached in any edition since.
Note: the the data in the
tdfpackage retains the winning times of banned, disqualified and otherwise sanctioned riders for the purposes of data analysis. The overall standings are as they would have appeared on the final day of the race - therefore please note that the officially recognised winner of a particular edition may not be the rider with the fastest time.
How have the riders changed over time?
France is a mountainous country, and a crucial ingredient for success in the Tour de France is a rider’s ability to climb hills quickly and efficiently. Hill climbing is a fight against gravity that pits a rider’s strength against their total weight (bike + equipment + body). The rider has two options to improve: get stronger and get leaner. Using the
editionsdata we can explore the latter over time by using rider
weightdata to calculate body mass index (BMI), which is a (very rough) proxy for leanness.library(ggrepel) editions %>% ggplot(aes(x = start_date, y = weight / height^2, color = edition)) + geom_point(na.rm = TRUE) + geom_label_repel(data = editions %>% sample_n(20), aes(label = winner_name), size = 1.8, nudge_y = 4, na.rm = TRUE, segment.alpha = 0.2) + xlab('Edition start date') + ylab('Body mass index') + ggtitle('Tour de France winners body mass index') + theme(legend.position = "none")
It’s pretty clear that over time, the trend has been towards winners having lower BMI, and likely being leaner overall. Apart from the obvious issues with BMI as a metric (body shapes are more complex than just height and weight) it’s interesting to consider why this trend has occurred. It’s tempting to conclude that more careful dieting and preparation in recent years has lead to riders having lower body fat percentages, which can enhance a rider’s power to weight ratio and overall performance. However, it could also be due to changes in the race: if race winning becomes more dependent on performance in the mountains (for example, because the number of mountain stages has increased overall) this could result in the lighter and leaner athletes tending to excel overall.
stage_resultscontains the breakdown of results by stage for each edition of the Tour de France. For example, the results of the final stage of the 2019 Tour de France can be printed usingeditions$stage_results$`2019`$`stage-21` ## # A tibble: 155 x 8 ## rank time rider bib_number age team points elapsed ##
## 1 1 3H 4M 8S Ewan Caleb 161 25 Lotto Soudal 100 3H 4M 8S ## 2 2 0S Groenewegen … 84 26 Team Jumbo-Vis… 70 3H 4M 8S ## 3 3 0S Bonifazio Ni… 172 25 Team Total Dir… 50 3H 4M 8S ## 4 4 0S Richeze Maxi… 27 36 Deceuninck - Q… 40 3H 4M 8S ## 5 5 0S Boasson Hage… 201 32 Team Dimension… 32 3H 4M 8S ## 6 6 0S Greipel André 215 37 Team Arkéa Sam… 26 3H 4M 8S ## 7 7 0S Trentin Matt… 107 29 Mitchelton-Sco… 22 3H 4M 8S ## 8 8 0S Stuyven Jasp… 138 27 Trek - Segafre… 18 3H 4M 8S ## 9 9 0S Arndt Nikias 142 27 Team Sunweb 14 3H 4M 8S ## 10 10 0S Sagan Peter 11 29 BORA - hansgro… 10 3H 4M 8S ## # … with 145 more rows
The important columns for the stage data are
timethe finishing time of the stage winner and time difference to the winner.
riderthe rider name formatted as ‘Surname Forename’.
ageage of the rider at the start of the stage.
elapsedthe time taken to reach the finish line - this is stored as a
lubridate::periodobject for easier printing and manipulation.
In the case above, Caleb Ewan won the finish line sprint of the final stage. Since the first 53 riders were part of a contiguous group of riders, they were granted the same finishing time as Ewan, but their finishing order corresponds to the order they passed the finish line.
Comments? Suggestions? Issues?