# Transforming subsets of data in R with by, ddply and data.table

**mages' blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Transforming data sets with R is usually the starting point of my data analysis work. Here is a scenario which comes up from time to time: transform subsets of a data frame, based on context given in one or a combination of columns.

As an example I use a data set which shows sales figures by product for a number of years:`df `

I am interested in absolute and relative sales developments by product over time. Hence, I would like to add a column to my data frame that shows the sales figures divided by the total sum of sales in each year, so I can create a chart which looks like this:

There are lots of ways of doing this transformation in R. Here are three approaches using:

- base R with
`by`

, `ddply`

of the`plyr`

package,`data.table`

of the package with the same name.

### by

The idea here is to use `by`

to split the data for each year and to apply the `transform`

function to each subset to calculate the share of sales for each product with the following function:`fn `

Having defined the function `fn`

I can apply it in a `by`

statement, and as its output will be a list, I wrap it into a `do.call`

command to row-bind (`rbind`

) the list elements:

`R1 `

### ddply

Hadely’s plyr package provides an elegant wrapper for this job with the `ddply`

function. Again I use the `transform`

function with my self defined `fn`

function:

`library(plyr)`

R2

### data.table

With data.table I have to do a little bit more legwork, in particular I have to think about the indices I need to use. Yet, it is still straight forward:

`library(data.table)`

## Convert df into a data.table

dt

Although `data.table`

may look cumbersome compared to `ddply`

and `by`

, I will show below that it is actually a lot faster than the two other approaches.

### Plotting the results

With any of the three outputs I can create the chart from above with `latticeExtra`

:

`library(latticeExtra)`

asTheEconomist(

xyplot(Sales + Share ~ Year, groups=Product,

data=R3, t="b",

scales=list(relation="free",x=list(rot=45)),

auto.key=list(space="top", column=3),

main="Product information")

)

## Comparing performance of by, ddply and data.table

Let me move on to a more real life example with 100 companies, each with 20 products and a 10 year history:

`set.seed(1)`

df

I use the same three approaches to calculate the share of sales by product for each year and company, but this time I will measure the execution time on my old iBook G4, running R-2.15.0:

`r1 `

And here are the results:

r1 # by ## user system elapsed ## 13.690 4.178 42.118 r2 # ddply ## user system elapsed ## 18.215 6.873 53.061 r3 # data.table ## user system elapsed ## 0.171 0.036 0.442

It is quite astonishing to see the speed of `data.table`

in comparison to `by`

and `ddply`

, but maybe it shouldn’t be surprise that the elegance of `ddply`

comes with a price as well.

Finally my session info:

`> sessionInfo() # iBook G4 800 MHZ, 640 MB RAM`

R version 2.15.0 Patched (2012-06-03 r59505)

Platform: powerpc-apple-darwin8.11.0 (32-bit)

```
```locale:

[1] C

attached base packages:

[1] stats graphics grDevices utils datasets methods base

other attached packages:

[1] latticeExtra_0.6-19 lattice_0.20-6 RColorBrewer_1.0-5

[4] data.table_1.8.0 plyr_1.7.1

`loaded via a namespace (and not attached):`

[1] grid_2.15.0

**leave a comment**for the author, please follow the link and comment on their blog:

**mages' blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.