Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The title says it all, doesn’t it?!

Take the following example – it happens to show race positions by driver for each lap of a particular F1 grand prix, but it could be the evolution over time of any rank-based population.

The question I had in mind was – how can I identify positions that are being contested during a particular window of time, where by contested I mean that the particular position was held by more than one person in a particular window of time?

Let’s zoom in to look at a couple of particular steps.

We see distinct groups of individuals who swap positions with each other between those two consecutive steps, so how can we automatically detect the positions that these drivers are fighting over?

A solution given to a Stack Overflow question on how to get disjoint sets from a list in R gives what I thought was a really nice solution: treat it as a graph, and then grab the connected components.

Here’s my working of it. Start by getting a list of results that show a particular driver held different positions in the window selected – each row in the original dataframe identifies the position held by a particular driver at the end of a particular lap:

library(DBI)
ergastdb =dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Get a race identifier for a specific race
raceId=dbGetQuery(ergastdb,
'SELECT raceId FROM races WHERE year="2012" AND round="1"')

q=paste('SELECT * FROM lapTimes WHERE raceId=',raceId[[1]])

lapTimes=dbGetQuery(ergastdb,q)
lapTimes$position=as.integer(lapTimes$position)

library(plyr)

#Sort by lap first just in case
lapTimes=arrange(lapTimes,driverId,lap)

#Create a couple of new columns
#pre is previous lap position held by a driver given their current lap
#ch is position change between the current and previous lap
tlx=ddply(lapTimes,.(driverId),transform,pre=(c(0,position[-length(position)])),ch=diff(c(0,position)))

#Find rows where there is a change between a given lap and its previous lap
#In particular, focus on lap 17
llx=tlx[tlx['ch']!=0 & tlx['lap']==17,c("position","pre")]

llx

This filters the complete set of data to just those rows where there is a difference between a driver’s current position and previous position (the first column in the result just shows row numbers and can be ignored).

##      position pre
## 17          2   1
## 191        17  18
## 390         9  10
## 448         1   2
## 506         6   4
## 719        10   9
## 834         4   5
## 892        18  19
## 950         5   6
## 1008       19  17

We can now create a graph in which nodes represent positions (position or pre values) and edges connect a current and previous position.

#install.packages("igraph")
#http://stackoverflow.com/a/25130575/454773
library(igraph)

posGraph = graph.data.frame(llx)

}

plot(posGraph)


The resulting graph is split into several components:

We can then identify the connected components:

posBattles=split(V(posGraph)$name, clusters(posGraph)$membership)
#Find the position change battles
for (i in 1:length(posBattles)) print(posBattles[[i]])

This gives the following clusters, and their corresponding members:

## [1] "2" "1"
## [1] "17" "18" "19"
## [1] "9"  "10"
## [1] "6" "4" "5"

To generalise this approach, I think we need to do a couple of things:

• allow a wider window within which to identify battles (so look over groups of three or more consecutive laps);
• simplify the way we detect position changes for a particular driver; for example, if we take the set of positions held by a driver within the desired window, if the cardinality of the set (that is, its size) is greater than one, then we have had at least one position change for that driver within that window. Each set of size > 1 of unique positions held by different drivers can be used to generate a set of distinct, unordered pairs that connect the positions (I think it only matters that they are connected, not that a driver specifically went from position x to position y going from one lap to the next?). If we generate the graph from the set of distinct unordered pairs taken across all drivers, we should then be able to identify the contested/driver change position clusters.

Hmm… I need to try that out… And when I do, if and when it works(?!), I’ll add a complete demonstration of it – and how we might make use of it – to the Wrangling F1 Data With R book.