Data Aggregation in R: plyr, sqldf and data.table

Posted on April 28, 2011 by hayward in R bloggers, Uncategorized | 0 Comments

[This article was first published on Psychwire » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve also previously put up a couple of posts about aggregating data in R. In this post, I’m going to be trying some other alternative methods for aggregating the dataset. Before I begin, I’d like to thank Matthew Dowle for highlighting these to me. It’s a bit daunting at first, deciding which method of aggregating data is best. So I decided to give them all a go to see what they were like. Let’s go for it!

For this post, I’m going to be using the lexdec dataset that comes with the languageR package. For information see here. I’ve called it full_list here, in order to play around with it. The details of the dataset are not that important; it’s just a case of getting hold of some data from human subjects (i.e., what I’m used to!).

The Target

Before we get into the functions themselves, let’s take a look at the aggregated data that we want. It has the mean, median and standard error of the RT variable (RT stands for Response Time, or time taken to press a button). I want to get some summary statistics of this variable for every level of each participant (Subject column) and for every level of the Class column. So, ultimately, the target is the following summary table (note that I’ve truncated this as there are lots of participants):

A tool by any other name: plyr

Let’s begin with plyr. The power of plyr comes from the fact that it splits up data, runs a function on the split-up data, and then sticks it all back together. It has a wide variety of useful aggregation functions, but here I’m going to use ddply. This function gives as it’s output a dataframe and gives as output another dataframe. The plyr functions are written in the syntax of XYply where X is the input object type and Y is the output object type. In this case, both ds of ddply stand for dataframe. Let’s look at some initial code:

ddply(full_list, c("Subject","Class"), function(df)mean(df$RT))

This is fine, and gives us mean DPS values for each class and spec. But there’s a problem. The “mean” column is labelled V1, which isn’t that helpful, especially if we have multiple columns computed (i.e., ending up with V1, V2, V3 makes it hard to remember which column is which). So let’s get the column labelled:

ddply(full_list, c("Subject","Class"), function(df) return(c(AVERAGE=mean(df$RT))))

Great! Now let’s add some more columns to output:

ddply(full_list, c("Subject","Class"), function(df) return(c(AVERAGE=mean(df$RT),
MEDIAN=median(df$RT),SE=sqrt(var(df$RT)/length(df$RT)))))

This then gives us the target aggregated table pictured above.

It needs no sequel: sqldf

Next up is sqldf. The name gives is away slightly: it’s a library for running SQL statements on data frames. SQL stands for Structured Query Language, with data stored on tables in a database. There are a number of SQL database types, which are all reasonably similar, and sqldf uses as default the incredibly popular SQLite. To get the target aggregated data using this, it’s a case of running a simple query:

sqldf("SELECT SUBJECT, CLASS, AVG(RT) AS AVERAGE, MEDIAN(RT) AS MEDIAN,
SQRT((VARIANCE(RT)/COUNT(RT))) AS SE
FROM full_list
GROUP BY SUBJECT, CLASS")

Note that to get the number of rows involved, we need to use COUNT rather than LENGTH. Easy!

How the tables have turned: data.table

The last library to look at here is data.table. This has the benefit of being considered the roadrunner of aggregation functions. It’s damn fast! This can be achieved as follows:

dps_dt = data.table(full_list)
dps_dt[,list(AVERAGE=.Internal(mean(RT)), MEDIAN=median(RT),
SE= sqrt(var(RT)/length(RT))),by=list(Subject,Class)]

Note that the first line takes our data.frame called full_list and casts it as a data.table object type. Here, two lists are used to do two things: (1) create the column names and (2) group the data by class and spec. The first list call sets up the column names and the calculations that need to be run. The second list gets fed to the by function which then aggregates by class and spec.

Summary

So, there we have three additional ways to aggregate data using R, to be added to tapply() and aggregate() which I have covered previously. Whichever one you end up using will probably depend on your own experience with using them (or, for example, whether you are familiar with SQL in the cae of sqldf), what needs you have, and how fast you need your aggregation processing to be.

Bookmark It