# Data Aggregation in R: plyr, sqldf and data.table

**Psychwire » R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve also previously put up a couple of posts about aggregating data in R. In this post, I’m going to be trying some other alternative methods for aggregating the dataset. Before I begin, I’d like to thank Matthew Dowle for highlighting these to me. It’s a bit daunting at first, deciding which method of aggregating data is best. So I decided to give them all a go to see what they were like. Let’s go for it!

For this post, I’m going to be using the **lexdec **dataset that comes with the **languageR **package. For information see here. I’ve called it full_list here, in order to play around with it. The details of the dataset are not that important; it’s just a case of getting hold of some data from human subjects (i.e., what I’m used to!).

The Target

Before we get into the functions themselves, let’s take a look at the aggregated data that we want. It has the mean, median and standard error of the **RT** variable (RT stands for Response Time, or time taken to press a button). I want to get some summary statistics of this variable for every level of each participant (**Subject **column) and for every level of the **Class **column. * *So, ultimately, the target is the following summary table (note that I’ve truncated this as there are lots of participants):

A tool by any other name: **plyr**

Let’s begin with **plyr**. The power of plyr comes from the fact that it splits up data, runs a function on the split-up data, and then sticks it all back together. It has a wide variety of useful aggregation functions, but here I’m going to use **ddply**. This function gives as it’s output a **dataframe **and gives as output another **dataframe. **The plyr functions are written in the syntax of *XYply *where X is the input object type and Y is the output object type. In this case, both ds of ddply stand for dataframe. Let’s look at some initial code:

ddply(full_list, c("Subject","Class"), function(df)mean(df$RT))

This is fine, and gives us mean DPS values for each class and spec. But there’s a problem. The “mean” column is labelled V1, which isn’t that helpful, especially if we have multiple columns computed (i.e., ending up with V1, V2, V3 makes it hard to remember which column is which). So let’s get the column labelled:

ddply(full_list, c("Subject","Class"), function(df) return(c(AVERAGE=mean(df$RT))))

Great! Now let’s add some more columns to output:

ddply(full_list, c("Subject","Class"), function(df) return(c(AVERAGE=mean(df$RT), MEDIAN=median(df$RT),SE=sqrt(var(df$RT)/length(df$RT)))))

This then gives us the target aggregated table pictured above.

It needs no sequel: *sqldf*

Next up is **sqldf**. The name gives is away slightly: it’s a library for running SQL statements on data frames. SQL stands for Structured Query Language, with data stored on tables in a database. There are a number of SQL database types, which are all reasonably similar, and sqldf uses as default the incredibly popular **SQLite**. To get the target aggregated data using this, it’s a case of running a simple query:

sqldf("SELECT SUBJECT, CLASS, AVG(RT) AS AVERAGE, MEDIAN(RT) AS MEDIAN, SQRT((VARIANCE(RT)/COUNT(RT))) AS SE FROM full_list GROUP BY SUBJECT, CLASS")

Note that to get the number of rows involved, we need to use **COUNT **rather than **LENGTH**. Easy!

How the tables have turned: *data.table*

The last library to look at here is data.table. This has the benefit of being considered the *roadrunner *of aggregation functions. It’s damn fast! This can be achieved as follows:

dps_dt = data.table(full_list) dps_dt[,list(AVERAGE=.Internal(mean(RT)), MEDIAN=median(RT), SE= sqrt(var(RT)/length(RT))),by=list(Subject,Class)]

Note that the first line takes our data.frame called *full_list *and casts it as a data.table object type. Here, two lists are used to do two things: (1) create the column names and (2) group the data by class and spec. The first list call sets up the column names and the calculations that need to be run. The second list gets fed to the **by **function which then aggregates by class and spec.

Summary

So, there we have three additional ways to aggregate data using R, to be added to tapply() and aggregate() which I have covered previously. Whichever one you end up using will probably depend on your own experience with using them (or, for example, whether you are familiar with SQL in the cae of sqldf), what needs you have, and how fast you need your aggregation processing to be.

**leave a comment**for the author, please follow the link and comment on their blog:

**Psychwire » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.