Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why don’t X-Y plots of latitude and longitude data look “right” compared to traditional map views?

For example, here’s an X-Y scatterplot of some of Jenson Button’s McLaren telemetry data from the 2010 Australian Formula One Grand Prix: The image was generated, from a data file hosted on Google Spreadsheets, using the following R script, and the ggplot2 library:

require(ggplot2)
require(RCurl)

#Data was originally grabbed from the McLaren F1 Live Dashboard during the race and is Copyright (�) McLaren Marketing Ltd 2010 (I think? Or possibly Vodafone McLaren Mercedes F1 2010(?)). I believe that speed, throttle and brake data were sponsored by Vodafone.
key='0AmbQbL4Lrd61dER5Qnl3bHo4MkVNRlZ1OVdicnZnTHc'
q='select *'

#Run the query on the database
df=gsqAPI(key,q)

#Sanity check - preview the imported data

#Example circuit map - sort of - showing the gLat (latitudinal 'g-force') values around the circuit (point size is absolute value of gLat, colour has two values, one for + and one for - values (swing to left and swing to right)).
g=ggplot(df) + geom_point(aes(x=NGPSLongitude,y=NGPSLatitude,col=sign(gLat),size=abs(gLat)))
print(g)

What’s lacking is a projection from the everyday Cartesian coordinate system to something like a Mercator based projection. Fortunately, the Grammar of Graphics model that underpins ggplot allows us to write the necessary co-ordinate system transformation into our chart generating command:

ggplot(df) + geom_point(aes(x=NGPSLongitude,y=NGPSLatitude,col=sign(gLat),size=abs(gLat))) + coord_map(project="mercator")

Here’s the result: (Note: I haven’t totally got my head round what the different co-ordinate transforms do, or how they relate to any sort of ‘reality’! But they’re another thing I’m now aware of…;-)

As to what the chart shows? It’s a plot of how the latitudinal (left-right) ‘g-force’ acts on Button as he tours the circuit. The points are coloured according to whether the force acts from the left or the right (+ or -) and sized according to the magnitude of the force (away from normal?). So it points out left and right hand corners and how tight they are, essentially;-)

To show just how easy it is to write simple graphics and even statistical charts using ggplot, here are a few more examples:

#Example "driver DNA" trace, showing low gear  throttle usage (distance round track on x-axis, lap number on y axis, node size is inversely proportional to gear number (low gear, large point size), colour relativ to throttlepedal depression
print(g) I started calling things like the above chart “Driver DNA” charts – the x-axis represents distance round the track, the y-axis is lap number. In this case, nodes are sized inversely proportionally to the gear (so low gear, large pointsize) and coloured by throttle pedal pressure. You’ll notice how consistent Button is lap on lap. The idea behind the colouring/sizing for this chart was that it would provide a glimpse into behaviour around low gear turns.

#Example of gear value around the track
g=ggplot(df) + geom_line(aes(x=sLap,y=NGear))
print(g) This chart soooo reminds me of simple op art…:-) I’m not sure how useful it is for showing gear selection according to distance round the track, but I just love the line of it:-)

#We can also show a trace for a single lap, such as speed coloured by gear
g=ggplot(subset(df,Lap==22)) + geom_line(aes(x=sLap,y=vCar,colour=NGear))
print(g) ggplot can, of course, do line charts. In the above example, I make an intitial exploration into how line segment colour can be used to highlight gear selection that allows the car to reach the speed (y-axis) it does as it goes round the circuit (x is distance round lap).

#We can also do statistical graphics - like a boxplot showing the distribution of speed values by gear
g = ggplot(df) + geom_boxplot(aes(factor(NGear),vCar))
print(g) ggplot isn’t just about literal graphing from data elements directly to marks on a canvas. It can also do stats as part of the mapping; in this case, we generate a boxplot that summarises the range of speeds achieved for different gear values.

#Footwork - brake and throttle pedal depression based on gear
g = ggplot(df) + geom_jitter(aes(factor(NGear),rThrottlePedal),colour='darkgreen') + geom_jitter(aes(factor(NGear),pBrakeF),colour='darkred')
print(g) Some more dots… in the above case, I try to explore Button;s footwork in a little more detail, seeing how he applies brake and throttle pressure according to gear selection (throttle depression is green, brake is red).

#Forces on the driver
#gLong by brake and gear
g = ggplot(df) + geom_jitter(aes(factor(NGear),gLong,col=pBrakeF)) + scale_colour_gradient(low='red',high='green')
print(g) In the diagrams immediately above and below, I try to show what sorts of longitudinal forces are typically experienced by Button according to gear selection, the idea being that we may get to see whether gears are used for linear acceleration or deceleration.

#gLong by throttle and gear
g = ggplot(df) + geom_jitter(aes(factor(NGear),gLong,col=rThrottlePedal)) + scale_colour_gradient(low='red',high='green')
print(g) #gLong boxplot
ggplot(df) + geom_boxplot(aes(factor(NGear),gLong))+ geom_jitter(aes(factor(NGear),gLong),size=1) Here, I use a boxplot to to try to see whether or not the longitudinal g-force is typically experienced under acceleration or braking by gear. Note that the points are scattered according to random jitter about their actual, integer values.

Finally, here’s a look at how engine RPM and the car speed relate to gear selection. Would you be able to work out how to write this diagram? Here’s how I did it…

#How do engine revs and speed relate to gear selction?
ggplot(df)+geom_point(aes(x=nEngine,y=vCar,col=factor(NGear)))

Hopefully what this quick tour of ggplot has illustrated is how easy it can be to generate a wide range of charts from the same data set. May all your charts be written, and then generated directly from their source data;-)

PS I did try to generate these images via CloudStat, but for some reason the images didn’t generate properly (it seemed to work fine when I tried just a couple? Here’s the link anyway: Cloudstat: F1 telemetry demo. The plots were generated in RStudio and saved using ggsave()

As to how this fits in with other things? Regular readers may remember the occasional rant I’ve had about the importance of providing the queries that map from data sets onto summary data tables so that the means by which the summaries were generated from open raw data are made transparent. In a similar way, data cleansing tools such as Google Refine or Stanford Data Wrangler allow you to log the transformations applied to a messy raw data set in order to get it into a state where you can actually work with it. The same is true of images. It’s far too easy to generate complex graphics from even more complex datasets, and then forget how the image was actually created, maybe what it even represents. By writing the diagram, essentially generating a query that maps from the data onto the visual representation provided by the generated diagram or chart, we preserve the audit trail from data to the chart output.

In the same way we might imagine the word equation:

DATA + QUERY = REPORT

we might also imagine:

DATA + GRAPHER = CHART

(You get the idea? The words are probably not the right words, but the sentiment is there…)

In an academic research setting, where it’s common to find lists of figures presented separately in books or theses, it would also make sense to include a ‘figure appendix’ which gives, for example, the ggplot or ggplot commands required to generate each statistical graphic presented from the actual data source.

Sigh…if only my first name was Damien and I could persuade my minions to do the Lichtenstein thing and paint-by-numbers over some large projections of some of the more aesthetically interesting charts;-)        