# F1Stats – Correlations Between Qualifying, Grid and Race Classification

February 9, 2013
By

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

Following directly on from F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification, and continuing in my attempt to replicate some of the methodology and results used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing, here’s a quick look at the correlation scores between the final practice, qualifying and grid positions and the final race classification.

I’ve already done brief review of what correlation means (sort of) in F1Stats – A Prequel to Getting Started With Rank Correlations, so I’m just going to dive straight in with some R code that shows how I set about trying to find the correlations between the different classifications:

Here’s the answer from the back of the book paper that we’re aiming for…

Here’s what I got:

```> corrs.df[order(corrs.df\$V1),]
V1   p3pos.int    qpos.int     grid.int racepos.raw    pval.grid    pval.qpos  pval.p3pos
2      AUSTRALIA  0.30075188  0.01503759  0.087218045           1 7.143421e-01 9.518408e-01 0.197072158
13      MALAYSIA  0.42706767  0.57293233  0.630075188           1 3.584362e-03 9.410805e-03 0.061725312
6          CHINA -0.26015038  0.57443609  0.514285714           1 2.183596e-02 9.193214e-03 0.266812583
3        BAHRAIN  0.13082707  0.73233083  0.739849624           1 2.900250e-04 3.601434e-04 0.581232598
16         SPAIN  0.25112782  0.80451128  0.804511278           1 2.179221e-05 2.179221e-05 0.284231482
14        MONACO  0.51578947  0.48120301  0.476691729           1 3.513870e-02 3.326706e-02 0.021403708
17        TURKEY  0.52330827  0.73082707  0.730827068           1 3.756531e-04 3.756531e-04 0.019344720
9  GREAT BRITAIN  0.65413534  0.83007519  0.830075188           1 8.921842e-07 8.921842e-07 0.002260234
8        GERMANY  0.32030075  0.46917293  0.452631579           1 4.657539e-02 3.844275e-02 0.168419054
10       HUNGARY  0.49649123  0.37017544  0.370175439           1 1.194050e-01 1.194050e-01 0.032293715
7         EUROPE  0.28120301  0.72030075  0.720300752           1 4.997719e-04 4.997719e-04 0.228898214
4        BELGIUM  0.06766917  0.62105263  0.621052632           1 4.222076e-03 4.222076e-03 0.777083014
11         ITALY  0.52932331  0.52481203  0.524812030           1 1.895282e-02 1.895282e-02 0.017815489
15     SINGAPORE  0.50526316  0.58796992  0.715789474           1 5.621214e-04 7.414170e-03 0.024579520
12         JAPAN  0.34912281  0.74561404  0.849122807           1 0.000000e+00 3.739715e-04 0.143204045
5         BRAZIL -0.51578947 -0.02105263 -0.007518797           1 9.771776e-01 9.316030e-01 0.021403708
1      ABU DHABI  0.42556391  0.66466165  0.628571429           1 3.684738e-03 1.824565e-03 0.062722332```

The paper mistakenly reports the grid values as the qualifying positions, so if we look down the grid.int column that I use to contain the correlation values between the grid and final classifications, we see they broadly match the values quoted in the paper. I also calculated the p-values and they seem to be a little bit off, but of the right order.

And here’s the R-code I used to get those results… The first chunk is just the loader, a refinement of the code I have used previously:

```require(RSQLite)
require(reshape)

f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite")

getRacesData.full=function(year='2012'){
#Data query
results.combined=dbGetQuery(f1,
paste('SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==',year,' and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;',sep=''))

#Data tidying
results.combined=ddply(results.combined,.(race),mutate,racepos.raw=1:length(race))
for (i in c('racepos','grid','qpos','p3pos','driverNum'))
results.combined[[paste(i,'.int',sep='')]]=as.integer( as.character(results.combined[[i]]))
results.combined\$race=reorder(results.combined\$race,results.combined\$raceNum)

results.combined
}

f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite")

results.combined=getRacesData.full(2009)
corrs.df[order(corrs.df\$V1),]```

Here’s the actual correlation calculation – I use the cor function:

```#The cor() function returns data that looks like:
#            p3pos.int   qpos.int   grid.int racepos.raw
#p3pos.int   1.0000000 0.31578947 0.28270677  0.30075188
#qpos.int    0.3157895 1.00000000 0.97744361  0.01503759
#grid.int    0.2827068 0.97744361 1.00000000  0.08721805
#racepos.raw 0.3007519 0.01503759 0.08721805  1.00000000
#Row/col 4 relates to the correlation with the race classification, so for now just return that

corr.rank.race=function(results.combined,cmethod='spearman'){
##Correlations
corrs=NULL
#Run through the races
for (i in levels(factor(results.combined\$race))){
results.classified = subset( results.combined,
race==i,
select=c('p3pos.int','qpos.int','grid.int','racepos.raw'))
#print(i)
#print( results.classified)
cp=cor(results.classified,method=cmethod,use="complete.obs")
#print(cp[4,])
corrs=rbind(corrs,c(i,cp[4,]))
}
corrs.df=as.data.frame(corrs)

signif=data.frame()
for (i in levels(factor(results.combined\$race))){
results.classified = subset( results.combined,
race==i,
select=c('p3pos.int','qpos.int','grid.int','racepos.raw'))
#p.value
pval.grid=cor.test(results.classified\$racepos.raw,results.classified\$grid.int,method=cmethod,alternative = "two.sided")\$p.value
pval.qpos=cor.test(results.classified\$racepos.raw,results.classified\$qpos.int,method=cmethod,alternative = "two.sided")\$p.value
pval.p3pos=cor.test(results.classified\$racepos.raw,results.classified\$p3pos.int,method=cmethod,alternative = "two.sided")\$p.value

signif=rbind(signif,data.frame(race=i,pval.grid=pval.grid,pval.qpos=pval.qpos,pval.p3pos=pval.p3pos))
}

corrs.df\$qpos.int=as.numeric(as.character(corrs.df\$qpos.int))
corrs.df\$grid.int=as.numeric(as.character(corrs.df\$grid.int))
corrs.df\$p3pos.int=as.numeric(as.character(corrs.df\$p3pos.int))

corrs.df=merge(corrs.df,signif,by.y='race',by.x='V1')
corrs.df\$V1=factor(corrs.df\$V1,levels=levels(results.combined\$race))
corrs.df
}

corrs.df=corr.rank.race(results.combined)```

It’s then trivial to plot the result:

```require(ggplot2)
xRot=function(g,s=5,lab=NULL) g+theme(axis.text.x=element_text(angle=-90,size=s))+xlab(lab)

g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int))
g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1)
g=g+ggtitle('F1 2009 Correlation: grid and final classification')
g```

Recalling that there are different types of rank correlation function, specifically “Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series”, I wondered whether it would make sense to look at correlations under this measure to see whether there were any obvious looking differences compared to Spearmans’s rho, that might prompt us to look at the actual grid/race classifications to see which score appears to be more meaningful.

The easiest way to spot the difference is probably graphically:

```corrs.df2=corr.rank.race(results.combined,'kendall')
corrs.df2[order(corrs.df2\$V1),]

g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int),col='red',size=4)
g=g+geom_point(data=corrs.df2, aes(x=V1,y=grid.int),col='blue')
g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1)
g=g+ggtitle('F1 2009 Correlation: grid and final classification')
g```

```corrs.df2[order(corrs.df2\$V1),]
V1   p3pos.int    qpos.int    grid.int racepos.raw    pval.grid    pval.qpos  pval.p3pos
2      AUSTRALIA  0.17894737 -0.01052632  0.04210526           1 8.226829e-01 9.744669e-01 0.288378196
13      MALAYSIA  0.26315789  0.41052632  0.46315789           1 3.782665e-03 1.110136e-02 0.112604127
6          CHINA -0.20000000  0.41052632  0.35789474           1 2.832863e-02 1.110136e-02 0.233266557
3        BAHRAIN  0.07368421  0.51578947  0.52631579           1 8.408301e-04 1.099522e-03 0.677108239
16         SPAIN  0.17894737  0.64210526  0.64210526           1 2.506940e-05 2.506940e-05 0.288378196
14        MONACO  0.38947368  0.35789474  0.35789474           1 2.832863e-02 2.832863e-02 0.016406081
17        TURKEY  0.37894737  0.64210526  0.64210526           1 2.506940e-05 2.506940e-05 0.019784403
9  GREAT BRITAIN  0.46315789  0.63157895  0.63157895           1 3.622261e-05 3.622261e-05 0.003782665
8        GERMANY  0.23157895  0.31578947  0.30526316           1 6.380788e-02 5.475355e-02 0.164976406
10       HUNGARY  0.36842105  0.36842105  0.36842105           1 2.860214e-02 2.860214e-02 0.028602137
7         EUROPE  0.21052632  0.62105263  0.62105263           1 5.176962e-05 5.176962e-05 0.208628398
4        BELGIUM  0.02105263  0.46315789  0.46315789           1 3.782665e-03 3.782665e-03 0.923502331
11         ITALY  0.35789474  0.36842105  0.36842105           1 2.373450e-02 2.373450e-02 0.028328627
15     SINGAPORE  0.35789474  0.45263158  0.55789474           1 3.589956e-04 4.748310e-03 0.028328627
12         JAPAN  0.26315789  0.57894737  0.69590643           1 6.491222e-06 3.109641e-04 0.124796908
5         BRAZIL -0.37894737 -0.05263158 -0.04210526           1 8.226829e-01 7.732195e-01 0.019784403
1      ABU DHABI  0.34736842  0.61052632  0.55789474           1 3.589956e-04 7.321900e-05 0.033643947```

Hmm.. Kendall gives lower values for all races except Hungary – maybe put that on the “must look at Hungary compared to the other races” pile…;-)

One thing that did occur to me was that I have access to race data from other years, so it shouldn’t be too hard to see how the correlations play out over the years at different circuits (do grid/race correlations tend to be higher at some circuits, for example?).

```testYears=function(years=2009:2012){
bd=NULL
for (year in years) {
d=getRacesData.full(year)
corrs.df=corr.rank.race(d)
bd=rbind(bd,cbind(year,corrs.df))
}
bd
}

a=testYears(2006:2012)
ggplot(a)+geom_point(aes(x=year,y=grid.int))+facet_wrap(~V1)+ylim(0,1)

g=ggplot(a)+geom_boxplot(aes(x=V1,y=grid.int))
g=xRot(g)
g
```

So Spain and Turkey look like they tend to the processional? Let’s see if a boxplot bears that out:

How predictable have the years been, year on year?

```g=ggplot(a)+geom_point(aes(x=V1,y=grid.int))+facet_wrap(~year)+ylim(0,1)
g=xRot(g)
g

ggplot(a)+geom_boxplot(aes(x=factor(year),y=grid.int))```

And as a boxplot:

From a betting point of view, (eg Getting Started with F1 Betting Data and The Basics of Betting as a Way of Keeping Score…) it possibly also makes sense to look at the correlation between the P3 times and the qualifying classification to see if there is a testable edge in the data when it comes to betting on quali?

I think I need to tweak my code slightly to make it easy to pull out correlations between specific columns, but that’ll have to wait for another day…

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...