Formula 1 Analysis in R with f1dataR: Lap Times, Pit Stops, and Driver Performance
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Formula 1 is one of the most compelling areas for data analysis in R because it combines structured results, lap-by-lap timing, pit strategy, and driver performance into one of the richest datasets in sport. For anyone building authority in technical R content, this is an excellent niche: it is specific enough to stand out, but broad enough to support tutorials, visualizations, predictive models, and long-form analytical writing.
One of the biggest advantages of working in this space is that f1dataR gives R users access to both historical Formula 1 data and richer session-level workflows linked to the wider Ergast/Jolpica and FastF1 ecosystem. That makes it possible to move from simple race results into much more interesting questions: Who had the strongest race pace? Which driver managed tyre degradation best? Did a pit stop strategy actually work? Can we build a basic model to estimate race outcomes?
This is where Formula 1 becomes much more than a sports topic. It becomes a practical case study in data wrangling, time-series thinking, feature engineering, visualization, and prediction. And because the R blog space has relatively little deep Formula 1 content compared with more general analytics topics, a strong tutorial here can help position your site as a serious source of expertise.
Why Formula 1 analysis in R is such a strong niche
Most R tutorials on the web focus on standard examples: sales dashboards, housing prices, or generic machine learning datasets. Formula 1 is different. The data has context, drama, and a built-in audience. Every race gives you new material to analyze, and every session contains multiple layers of information: qualifying pace, stint length, tyre compounds, safety car timing, sector performance, overtakes, and pit strategy.
That is part of what makes this topic attractive for long-form content. You are not just teaching code. You are showing how code helps explain real competitive decisions. A lap time is not just a number. It is evidence of tyre wear, traffic, fuel load, track evolution, and driver execution.
For readers who want to go deeper into this kind of workflow, resources such as Racing with Data: Formula 1 and NASCAR Analytics with R are useful because they reinforce the idea that racing analytics in R can go well beyond basic charts and into serious, code-driven analysis.
Installing the packages
The first step is to set up a workflow that is both reproducible and flexible. For most Formula 1 analysis projects in R, you will want f1dataR plus a small set of packages for data cleaning, plotting, reporting, and modeling.
install.packages(c( "f1dataR", "tidyverse", "lubridate", "janitor", "scales", "slider", "broom", "tidymodels", "gt", "patchwork" )) library(f1dataR) library(tidyverse) library(lubridate) library(janitor) library(scales) library(slider) library(broom) library(tidymodels) library(gt) library(patchwork)
If you want to work with official session-level timing data, it is also a good idea to configure FastF1 support and define a local cache.
setup_fastf1()
options(f1dataR.cache = "f1_cache")
dir.create("f1_cache", showWarnings = FALSE)
That may look like a small detail, but caching matters when you are building serious analytical content. It makes your workflow faster, cleaner, and much easier to reproduce when updating notebooks, reports, or blog posts later.
Start with race results
Before diving into laps and strategy, start with historical race results. They provide the backbone for season summaries, driver comparisons, constructor trends, and predictive features.
results_2024 <- load_results(season = 2024) results_2024 %>% clean_names() %>% select(round, race_name, driver, constructor, grid, position, points, status) %>% glimpse()
Once the results are loaded, you can build a season summary table that gives readers an immediate overview of the competitive picture.
season_table <- results_2024 %>%
clean_names() %>%
group_by(driver, constructor) %>%
summarise(
races = n(),
wins = sum(position == 1, na.rm = TRUE),
podiums = sum(position <= 3, na.rm = TRUE),
avg_finish = mean(position, na.rm = TRUE),
avg_grid = mean(grid, na.rm = TRUE),
points = sum(points, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(points), avg_finish)
season_table
You can also convert that summary into a cleaner publication table for a blog or report.
season_table %>%
mutate(
avg_finish = round(avg_finish, 2),
avg_grid = round(avg_grid, 2)
) %>%
gt() %>%
tab_header(
title = "2024 Driver Season Summary",
subtitle = "Wins, podiums, average finish, and points"
)
This type of summary is useful, but by itself it does not explain much about how results were achieved. That is why the next step matters.
Looking beyond the finishing position
One of the easiest ways to improve an F1 analysis is to move beyond final classification. A driver finishing sixth may have delivered an excellent performance in a midfield car, while a podium in a dominant car may tell a much simpler story. A stronger framework compares results to starting position, teammate performance, and race pace.
A good place to begin is position gain.
position_gain_table <- results_2024 %>%
clean_names() %>%
mutate(
position_gain = grid - position
) %>%
group_by(driver, constructor) %>%
summarise(
mean_gain = mean(position_gain, na.rm = TRUE),
median_gain = median(position_gain, na.rm = TRUE),
total_gain = sum(position_gain, na.rm = TRUE),
races = n(),
.groups = "drop"
) %>%
arrange(desc(mean_gain))
position_gain_table
This metric is simple, but it is still valuable because it gives a first signal of race execution. Of course, it has limits. Front-runners have less room to gain places, and midfield races are often influenced by strategy variance, incidents, and reliability. Still, that nuance is exactly what makes the discussion interesting.
Add race and circuit context
Formula 1 performance is always track-dependent. Some cars are stronger on high-speed circuits, some drivers thrive on street tracks, and some teams handle tyre-sensitive venues better than others. Joining race results with schedule data allows you to frame those questions more clearly.
schedule_2024 <- load_schedule(season = 2024) %>%
clean_names()
results_with_schedule <- results_2024 %>%
clean_names() %>%
left_join(
schedule_2024 %>%
select(round, race_name, circuit_name, locality, country, race_date),
by = c("round", "race_name")
)
results_with_schedule %>%
select(round, race_name, circuit_name, country, driver, constructor, grid, position) %>%
slice_head(n = 10)
Even at this stage, you already have enough structure to write multiple types of posts: best performing drivers by circuit type, constructor consistency across the season, teammate gaps by venue, or overperformance relative to starting position.
Lap times: where the analysis gets serious
Race results tell you what happened. Lap times tell you how it happened. This is where Formula 1 analysis becomes much more valuable, because you can begin to evaluate race pace, traffic effects, tyre degradation, and the shape of a driver’s performance over the full event.
It is usually best to focus on one race session first, especially if your goal is to explain the process clearly.
session_laps <- load_laps( season = 2024, round = 10, session = "R" ) %>% clean_names() session_laps %>% select(driver, lap_number, lap_time, compound, tyre_life, stint, pit_out_time, pit_in_time) %>% glimpse()
Lap time fields often need cleaning before they are suitable for visualization or modeling. Converting them into seconds is usually the most practical approach.
laps_clean <- session_laps %>%
mutate(
lap_time_seconds = as.numeric(lap_time),
sector1_seconds = as.numeric(sector_1_time),
sector2_seconds = as.numeric(sector_2_time),
sector3_seconds = as.numeric(sector_3_time)
) %>%
filter(!is.na(lap_time_seconds)) %>%
filter(lap_time_seconds > 50, lap_time_seconds < 200)
summary(laps_clean$lap_time_seconds)
Comparing race pace by driver
Once the lap data is cleaned, you can compare selected drivers and visualize how their pace evolves through the race.
selected_drivers <- c("VER", "NOR", "LEC", "HAM")
laps_clean %>%
filter(driver %in% selected_drivers) %>%
ggplot(aes(x = lap_number, y = lap_time_seconds, color = driver)) +
geom_line(alpha = 0.8, linewidth = 0.8) +
geom_point(size = 1.2, alpha = 0.7) +
scale_y_continuous(labels = label_number(accuracy = 0.1)) +
labs(
title = "Race pace by lap",
subtitle = "Raw lap times across the Grand Prix",
x = "Lap",
y = "Lap time (seconds)",
color = "Driver"
) +
theme_minimal(base_size = 13)
Raw lap time plots are useful, but they are often noisy because pit laps, out-laps, and unusual traffic can distort the pattern. A stronger analysis filters some of that noise and focuses on green-flag pace.
green_flag_laps <- laps_clean %>%
filter(driver %in% selected_drivers) %>%
filter(is.na(pit_in_time), is.na(pit_out_time)) %>%
group_by(driver) %>%
mutate(
median_lap = median(lap_time_seconds, na.rm = TRUE),
lap_delta = lap_time_seconds - median_lap
) %>%
ungroup() %>%
filter(abs(lap_delta) < 5)
green_flag_laps %>%
ggplot(aes(lap_number, lap_time_seconds, color = driver)) +
geom_line(linewidth = 0.9) +
geom_smooth(se = FALSE, method = "loess", span = 0.25, linewidth = 1.1) +
labs(
title = "Green-flag race pace",
subtitle = "Smoothed lap-time profile after removing pit laps and large outliers",
x = "Lap",
y = "Lap time (seconds)"
) +
theme_minimal(base_size = 13)
This kind of chart is one of the most useful in F1 analytics because it shows whether a driver was genuinely fast, merely benefiting from track position, or fading late in the race.
Tyre degradation and stint analysis
One of the best ways to add real authority to an F1 post is to quantify degradation. Instead of simply saying a driver “managed tyres well,” you can estimate how lap time changed as tyre life increased during a stint.
stint_degradation <- laps_clean %>%
filter(driver %in% selected_drivers) %>%
filter(!is.na(stint), !is.na(tyre_life), !is.na(compound)) %>%
filter(is.na(pit_in_time), is.na(pit_out_time)) %>%
group_by(driver, stint, compound) %>%
filter(n() >= 8) %>%
nest() %>%
mutate(
model = map(data, ~ lm(lap_time_seconds ~ tyre_life, data = .x)),
tidied = map(model, broom::tidy)
) %>%
unnest(tidied) %>%
filter(term == "tyre_life") %>%
transmute(
driver,
stint,
compound,
degradation_per_lap = estimate,
p_value = p.value
) %>%
arrange(degradation_per_lap)
stint_degradation
A positive slope generally means pace is dropping as the stint gets older. A smaller slope suggests better tyre preservation or more stable pace. The interpretation is not always simple, because race context matters, but the method is very effective for turning race discussion into evidence.
laps_clean %>%
filter(driver %in% selected_drivers, !is.na(stint), !is.na(tyre_life)) %>%
filter(is.na(pit_in_time), is.na(pit_out_time)) %>%
ggplot(aes(tyre_life, lap_time_seconds, color = driver)) +
geom_point(alpha = 0.5, size = 1.6) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
facet_wrap(~ compound, scales = "free_x") +
labs(
title = "Tyre degradation by compound",
subtitle = "Linear approximation of pace loss as the stint ages",
x = "Tyre life (laps)",
y = "Lap time (seconds)"
) +
theme_minimal(base_size = 13)
This is exactly the kind of analysis that makes a technical article memorable, because it moves from “who won?” to “why did the performance pattern look the way it did?”
Pit stops and strategy
Pit strategy is one of the clearest examples of how Formula 1 combines data and decision-making. A stop is not just an event; it is a trade-off between track position, tyre life, race pace, and the behaviour of nearby competitors.
pit_summary <- session_laps %>%
clean_names() %>%
mutate(
had_pit_event = !is.na(pit_out_time) | !is.na(pit_in_time)
) %>%
group_by(driver) %>%
summarise(
total_laps = n(),
pit_events = sum(had_pit_event, na.rm = TRUE),
stints = n_distinct(stint, na.rm = TRUE),
first_compound = first(na.omit(compound)),
last_compound = last(na.omit(compound)),
.groups = "drop"
) %>%
arrange(desc(pit_events))
pit_summary
A better way to explain strategy is to reconstruct the stints directly.
strategy_table <- session_laps %>%
clean_names() %>%
arrange(driver, lap_number) %>%
group_by(driver, stint) %>%
summarise(
start_lap = min(lap_number, na.rm = TRUE),
end_lap = max(lap_number, na.rm = TRUE),
laps_in_stint = n(),
compound = first(na.omit(compound)),
avg_lap = mean(as.numeric(lap_time), na.rm = TRUE),
median_lap = median(as.numeric(lap_time), na.rm = TRUE),
.groups = "drop"
) %>%
arrange(driver, stint)
strategy_table
strategy_table %>%
ggplot(aes(x = start_lap, xend = end_lap, y = driver, yend = driver, color = compound)) +
geom_segment(linewidth = 6, lineend = "round") +
labs(
title = "Race strategy by driver",
subtitle = "Stint map reconstructed from lap-level data",
x = "Lap window",
y = "Driver",
color = "Compound"
) +
theme_minimal(base_size = 13)
Once you have stint maps, your analysis immediately becomes more strategic. You can discuss undercuts, overcuts, long first stints, aggressive early stops, and whether a team actually converted tyre freshness into meaningful gains.
Measuring post-stop pace
A useful extension is to examine whether a driver actually benefitted from fresh tyres after a stop. That is one of the simplest ways to move from descriptive pit analysis into strategic interpretation.
post_stop_pace <- session_laps %>%
clean_names() %>%
arrange(driver, lap_number) %>%
group_by(driver) %>%
mutate(
pit_out_lap = !is.na(pit_out_time),
laps_since_stop = cumsum(lag(pit_out_lap, default = FALSE))
) %>%
ungroup() %>%
filter(!is.na(lap_time)) %>%
group_by(driver, laps_since_stop) %>%
summarise(
first_laps_avg = mean(as.numeric(lap_time)[1:min(3, n())], na.rm = TRUE),
stint_avg = mean(as.numeric(lap_time), na.rm = TRUE),
.groups = "drop"
)
post_stop_pace
This kind of table helps answer a much better question than “when did they pit?” It asks: “Did the stop create usable pace, and was that pace strong enough to influence the race?”
Teammate comparison as the best benchmark
In Formula 1, teammate comparison is often more informative than full-grid comparison because the car is the closest thing to a controlled environment. If one driver consistently beats the other in grid position, race finish, or pace consistency, that tells you something much more precise than the overall championship table.
teammate_table <- results_2024 %>%
clean_names() %>%
group_by(constructor, round, race_name) %>%
mutate(
teammate_finish_rank = min_rank(position),
teammate_grid_rank = min_rank(grid)
) %>%
ungroup() %>%
group_by(driver, constructor) %>%
summarise(
avg_finish = mean(position, na.rm = TRUE),
avg_grid = mean(grid, na.rm = TRUE),
teammate_beating_rate_finish = mean(teammate_finish_rank == 1, na.rm = TRUE),
teammate_beating_rate_grid = mean(teammate_grid_rank == 1, na.rm = TRUE),
points = sum(points, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(teammate_beating_rate_finish), desc(points))
teammate_table
That kind of comparison is especially strong in a technical post because it gives readers a benchmark they already understand intuitively, while still grounding the discussion in data.
Sector analysis
If lap times tell you the overall pace story, sectors can help reveal where that pace is being gained or lost. Even without diving into full telemetry, sector splits can expose whether a driver is strong in traction zones, high-speed sections, or braking-heavy parts of the circuit.
sector_summary <- laps_clean %>%
filter(driver %in% selected_drivers) %>%
group_by(driver) %>%
summarise(
s1 = mean(sector1_seconds, na.rm = TRUE),
s2 = mean(sector2_seconds, na.rm = TRUE),
s3 = mean(sector3_seconds, na.rm = TRUE),
total = mean(lap_time_seconds, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_longer(cols = c(s1, s2, s3), names_to = "sector", values_to = "seconds")
sector_summary %>%
ggplot(aes(sector, seconds, fill = driver)) +
geom_col(position = "dodge") +
labs(
title = "Average sector times by driver",
subtitle = "A simple way to localize pace differences",
x = "Sector",
y = "Average time (seconds)",
fill = "Driver"
) +
theme_minimal(base_size = 13)
This type of breakdown is useful because it adds shape to the analysis. Instead of saying a driver was faster overall, you can show where the time was coming from.
From description to prediction
One of the strongest editorial angles for an article like this is to end with a predictive modeling section. A title such as Formula 1 Data Science in R: Predicting Race Results works well because it combines clear intent, technical interest, and a topic with built-in audience appeal.
The key is to be realistic. The purpose is not to promise perfect forecasts. It is to show how descriptive Formula 1 data can be converted into features for a baseline model.
model_data <- results_2024 %>%
clean_names() %>%
arrange(driver, round) %>%
group_by(driver) %>%
mutate(
rolling_avg_finish_3 = slide_dbl(position, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
rolling_avg_grid_3 = slide_dbl(grid, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
rolling_points_3 = slide_dbl(points, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
prev_finish = lag(position),
prev_grid = lag(grid)
) %>%
ungroup() %>%
mutate(
target_top10 = if_else(position <= 10, 1, 0),
target_podium = if_else(position <= 3, 1, 0)
) %>%
select(
round, race_name, driver, constructor, grid, points, position,
rolling_avg_finish_3, rolling_avg_grid_3, rolling_points_3,
prev_finish, prev_grid, target_top10, target_podium
) %>%
drop_na()
glimpse(model_data)
This dataset is intentionally simple, but that is a strength in a tutorial. It makes the logic visible and gives readers something they can actually reproduce and extend.
Predicting a top-10 finish
set.seed(42)
split_obj <- initial_split(model_data, prop = 0.8, strata = target_top10)
train_data <- training(split_obj)
test_data <- testing(split_obj)
log_recipe <- recipe(
target_top10 ~ grid + rolling_avg_finish_3 + rolling_avg_grid_3 +
rolling_points_3 + prev_finish + prev_grid,
data = train_data
) %>%
step_impute_median(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
log_spec <- logistic_reg() %>%
set_engine("glm")
log_workflow <- workflow() %>%
add_recipe(log_recipe) %>%
add_model(log_spec)
log_fit <- fit(log_workflow, data = train_data)
top10_predictions <- predict(log_fit, new_data = test_data, type = "prob") %>%
bind_cols(predict(log_fit, new_data = test_data)) %>%
bind_cols(test_data %>% select(target_top10))
top10_predictions
top10_predictions %>%
roc_auc(truth = factor(target_top10), .pred_1)
top10_predictions %>%
accuracy(truth = factor(target_top10), estimate = .pred_class)
Predicting finishing position
finish_recipe <- recipe(
position ~ grid + rolling_avg_finish_3 + rolling_avg_grid_3 +
rolling_points_3 + prev_finish + prev_grid,
data = train_data
) %>%
step_impute_median(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
lm_spec <- linear_reg() %>%
set_engine("lm")
lm_workflow <- workflow() %>%
add_recipe(finish_recipe) %>%
add_model(lm_spec)
lm_fit <- fit(lm_workflow, data = train_data)
finish_predictions <- predict(lm_fit, new_data = test_data) %>%
bind_cols(test_data %>% select(position, driver, constructor, race_name, grid))
metrics(finish_predictions, truth = position, estimate = .pred)
finish_predictions %>%
ggplot(aes(position, .pred)) +
geom_point(alpha = 0.7, size = 2) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(
title = "Predicted vs actual finishing position",
subtitle = "Baseline linear model",
x = "Actual finish",
y = "Predicted finish"
) +
theme_minimal(base_size = 13)
A baseline model like this is not meant to be a perfect forecasting system. Its real value is educational. It shows how to move from results tables to feature engineering, then from features into a reproducible predictive workflow.
A simple custom driver rating
If you want the article to feel more original, one strong option is to create a custom driver score. Composite metrics work well in Formula 1 writing because they combine multiple dimensions of performance into one interpretable ranking.
driver_rating <- results_2024 %>%
clean_names() %>%
group_by(driver, constructor) %>%
summarise(
avg_finish = mean(position, na.rm = TRUE),
avg_grid = mean(grid, na.rm = TRUE),
points = sum(points, na.rm = TRUE),
wins = sum(position == 1, na.rm = TRUE),
podiums = sum(position <= 3, na.rm = TRUE),
gain = mean(grid - position, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
finish_score = rescale(-avg_finish, to = c(0, 100)),
grid_score = rescale(-avg_grid, to = c(0, 100)),
points_score = rescale(points, to = c(0, 100)),
gain_score = rescale(gain, to = c(0, 100)),
win_score = rescale(wins, to = c(0, 100)),
rating = 0.30 * finish_score +
0.20 * grid_score +
0.25 * points_score +
0.15 * gain_score +
0.10 * win_score
) %>%
arrange(desc(rating))
driver_rating
The important thing here is transparency. Readers do not need to agree with every weight in the formula. What matters is that the method is explicit, interpretable, and easy to critique or improve.
Final thoughts
Formula 1 analysis in R is an unusually strong content niche because it combines technical rigor with a naturally engaged audience. With f1dataR, you can begin with historical race results, move into lap-time and stint analysis, explore pit strategy and driver benchmarking, and then build baseline predictive models that make the workflow feel complete.
That range is exactly what makes this such a good topic for an authority-building article. It is practical, it is reproducible, and it opens the door to an entire cluster of follow-up posts on telemetry, qualifying, tyre degradation, teammate comparisons, and race prediction.
If your goal is to publish technical content that demonstrates real expertise rather than just covering surface-level examples, Formula 1 data science in R is one of the best domains you can choose.
The post Formula 1 Analysis in R with f1dataR: Lap Times, Pit Stops, and Driver Performance appeared first on R Programming Books.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.