cumplyr: Extending the plyr Package to Handle Cross-Dependencies

May 3, 2012
By

(This article was first published on John Myles White » Statistics, and kindly contributed to R-bloggers)

Introduction

For me, Hadley Wickham‘s reshape and plyr packages are invaluable because they encapsulate omnipresent design patterns in statistical computing: reshape handles switching between the different possible representations of the same underlying data, while plyr automates what Hadley calls the Split-Apply-Combine strategy, in which you split up your data into several subsets, perform some computation on each of these subsets and then combine the results into a new data set. Many of the computations implicit in traditional statistical theory are easily described in this fashion: for example, comparing the means of two groups is computationally equivalent to splitting a data set of individual observations up into subsets based on the group assignments, applying mean to those subsets and then pooling the results back together again.

The Split-Apply-Combine Strategy is Broader than plyr

The only weakness of plyr, which automates so many of the computations that instantiate the Split-Apply-Combine strategy, is that plyr implements one very specific version of the Split-Apply-Combine strategy: plyr always splits your data into disjoint subsets. By disjoint, I mean that any row of the original data set can occur in only one of the subsets created by the splitting function. For computations that involve cross-dependencies between observations, this makes plyr inapplicable: cumulative quantities like running means and broadly local quantities like kernelized means cannot be computed using plyr. To highlight that concern, let’s consider three very simple data analysis problems.

Computing Forward-Running Means

Suppose that you have the following data set:

TimeValue
11
23
35

To compute a forward-running mean, you need to split this data into three subsets:

TimeValue
11
TimeValue
11
23
TimeValue
11
23
35

In each of these clearly non-disjoint subsets, you would then compute the mean of Value and combine the results to give:

TimeValue
11
22
33

This sort of computation occurs often enough in a simpler form that R provides tools like cumsum and cumprod to deal with cumulative quantities. But the splitting problem in our example is not addressed by those tools, nor by plyr, because the cumulative quantities have to computed on subsets that are not disjoint.

Computing Backward-Running Means

Consider performing the same sort of calculation as described above, but moving in the opposite direction. In that case, the three non-disjoint subsets are:

TimeValue
35
TimeValue
23
35
TimeValue
11
23
35

And the final result is:

TimeValue
13
24
35

Computing Local Means (AKA Kernelized Means)

Imagine that, instead of looking forward or backward, we only want to know something about data that is close to the current observation being examined. For example, we might want to know the mean value of each row when pooled with its immediately proceeding and succeeding neighbors. This computation must create the following subsets of data:

TimeValue
11
23
TimeValue
11
23
35
TimeValue
23
35

Within these non-disjoint subsets, means are computed and the result is:

TimeValue
12
23
34

A Strategy for Handling Non-Disjoint Subsets

How can we build a general purpose tool to handle these sorts of computations? One way is to rethink how plyr works and then extend it with some trivial variations on its core principles. We can envision plyr as a system that uses a splitting operation that partitions our data into subsets in which each subset satisfies a group of equality constraints: you split the data into groups in which Variable 1 = Value 1 AND Variable 2 = Value 2, etc. Because you consider the conjunction of several equality constraints, the resulting subsets are disjoint.

Seen in this fashion, there is a simple relaxation of the equality constraints that allows us to solve the three problems described a moment ago: instead of looking at the conjunction of equality constraints, we use a conjunction of inequality constraints. For the time being, I’ll describe just three instantiations of this broader strategy.

Using Upper Bounds

Here, we divide data into groups in which Variable 1 <= Value 1 AND Variable 2 <= Value 2, etc. We will also allow equality constraints, so that the operations of plyr are a strict subset of the computations in this new model. For example, we might use the constraint Variable = Value 1 AND Variable 2 <= Value 2. If the upper bound is the Time variable, these contraints will allow us to compute the forward-moving mean we described earlier.

Using Lower Bounds

Instead of using upper bounds, we can use lower bounds to divide data into groups in which Variable >= Value 1 AND Variable 2 >= Value 2, etc. This allows us to implement the backward-moving mean described earlier.

Using Norm Balls

Finally, we can consider a combination of upper and lower bounds. For simplicity, we'll assume that these bounds have a fixed tightness around the "center" of each subset of our split data. To articulate this tightness formally, we look at a specific hypothetical equality constraint like Variable 1 = Value 1 and then loosen it so that norm(Variable 1 - Value 1) <= r. When r = 0, this system gives the original equality constraint. But when r > 0, we produce a "ball" of data around the constraint whose tightness is r. This lets us estimate the local means from our third example.

Implementation

To demo these ideas in a usable fashion, I've created a draft package for R called cumplyr. Here is an extended example of its usage in solving simple variants of the problems described in this post:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
library('cumplyr')
 
data <- data.frame(Time = 1:5, Value = seq(1, 9, by = 2))
 
iddply(data,
       equality.variables = c('Time'),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list(),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c('Time'),
       upper.bound.variables = c(),
       norm.ball.variables = list(),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c('Time'),
       norm.ball.variables = list(),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list('Time' = 1),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list('Time' = 2),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list('Time' = 5),
       func = function (df) {with(df, mean(Value))})

You can download this package from GitHub and play with it to see whether it helps you. Please submit feedback using GitHub if you have any comments, complaints or patches.

Comparing plyr with cumplyr

In the long run, I'm hoping to make the functions in cumplyr robust enough to submit a patch to plyr. I see these tools as one logical extension of plyr to encompass more of the framework described in Hadley's paper on the Split-Apply-Combine strategy.

For the time being, I would advise any users of cumplyr to make sure that you do not use cumplyr for anything that plyr could already do. cumplyr is very much demo software and I am certain that both its API and implementation will change. In contrast, plyr is fast and stable software that can be trusted to perform its job.

But, if you have a problem that cumplyr will solve and plyr will not, I hope you'll try cumplyr out and submit patches when it breaks.

Happy hacking!

To leave a comment for the author, please follow the link and comment on his blog: John Myles White » Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.