Pole Position Prediction- A tidymodels Example

[This article was first published on Sport Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello readers, today’s blog I will be looking at predicting the formula 1 grid using the Tidymodels collection of R packages. The idea is to use data from the practice sessions on a Friday, to give an idea of what the grid is expected to be for the race on Sunday before qualifying on Saturday. You may have seen me post this on my twitter:

This is the output from my model to forecast the formula 1 grid. In this blog I am going to explain how I went about it.

First things first I need some data to train the model on. The way a Formula 1 weekend works is there is free practice on a Friday, qualifying on a Saturday and the race on a Sunday. The aim is to use the data generated on the Friday to forecast the grid. So what data points am I going use:

  • Practice 1 laptime
  • Practice 1 difference
  • Practice 1 laps
  • practice 2 lap time
  • practice 2 difference
  • practice 2 laps

There are other variables that could be used such as a way of classifying the type of circuit to try and entice out a cars strength and weaknesses but that an area for future development. I collected data from practice 1 and 2 from 2015 to now.

This produced the above data frame with in total over 1700 records. As I will be creating a classification model I also need to add the drivers qualifying position. I wanted to do a classification model as I want the output to be a percentage chance of achieving that position.

Now I have the data I can use the tidy models collection of packages to create the model.

First I use rsample to split the data into training and testing sets. The first part of r sample is to create the rules for splitting your data. The arguments are what data are you using, what proportion you want to be in training and testing and then there is the strata argument which I haven’t used. The strata argument allows you to balance your target variable in both the testing and training set. As I have lots of possible endings (20, the total size of the F1 grid) and a relatively small data set then it was no possible to use the strata argument.

The next step is you can either create a recipe to process your data using the recipe package or go straight to Parnsip to create your model. I went straight to parsnip as my data was relatively simple and there were no need to pre-process it. As you can see in the code above I am using a random forest model initially and you can see how simply it is to create it.

Reviewing the model and testing its quality on the testing set the yardstick package has numerous functions to allow you to do this. Above I have plotted a ROC curve for each classification option, all the positions on the grid. There is a different between positions. The model seems to be a lot better at predicting the first 3 grid positions compared to other positions. I think this is because the pattern in F1 recently is that the top 2 or 3 teams have been well ahead of the rest. Therefore this makes the classification job for the model a bit easier.

The net step, now I have a baseline model, is to tune it using the tune package. There are 3 tune able hyperparameters for a ranger random forest trees, mtry and min_n. Trees is the number of trees in each random forest and needs to be high enough to reduce the error rate. I am going to set it as 1000 to start off with. Mtry controls the split variable randomisation and is limited by the number of features in your data set. Min_n is short for minimum node size and controls when the trees are split up. The tune package allowes you to conduct a grid search across those parameters to find the right ones.

The results of the tune best of the ROC AUC metric are shown above. Clearly tees doesn’t make any difference its almost spread over randomly. MTRY looks to gave some variation and looks to be the best value of 3 – 5. Min_n is on a slope. The tune function automatically selects a range to train over but it looks like the AUC is only increasing and maybe the best value is a lot higher then 40. Therefore I am going to use a grid search in order to tune the model

I conducted a grid search across a range of values from Min_N and mtry. You can see the best value for min_n is between 200 – 300 and mtry 4. Doing a grid search across both parameters means you can control for the influence of each other and therefore get the best value for both. I then trained another model taking those values.

Comparing the original model to the now turned model and it is now slightly better for most positions. This is now the model I will use going forward. In order to improve this I think I need to add weather data. For example in the recent Hungarian Grand Prix second practice was effected by rain and that makes this model difficult to run. During a F1 race weekend they have 3 different compounds of tyre (Soft, Hard and Medium) which they use. Adding the tyre compound the lap was set with would improve the model because a lap might have been set on the slower tyre compared to others.

Full Code:


Also check out the tidymodels website here


To leave a comment for the author, please follow the link and comment on their blog: Sport Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)