Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Consider the common following problem: compute for a data set (say the infamous `iris` example data set) per-group ranks. Suppose we want the rank of `iris` `Sepal.Length`s on a per-`Species` basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.

Iris, by DiliffOwn work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”.

The example

Let’s start (as always) with our data. We are going to look at the `iris` data set in `R`. You can view the data in by typing the following in your `R` console:

``````data(iris)
View(iris)
``````

The package `dplyr` makes the grouped calculation quite easy. We define our “window function” (function we want applied to sub-groups of data in a given order) and then use `dplyr` to apply the function to grouped and arranged data:

``````library('dplyr')

# define our windowed operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
mutate(rank=cumsum(constcol)) %>% select(-constcol)

# calculate
res <-
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>%
rank_in_group

# display first few results
res %>% filter(rank<=2) %>% arrange(Species,rank)

# Source: local data frame [6 x 6]
# Groups: Species [3]
#
#   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  rank
#
# 1          5.8         4.0          1.2         0.2     setosa     1
# 2          5.7         4.4          1.5         0.4     setosa     2
# 3          7.0         3.2          4.7         1.4 versicolor     1
# 4          6.9         3.1          4.9         1.5 versicolor     2
# 5          7.9         3.8          6.4         2.0  virginica     1
# 6          7.7         3.8          6.7         2.2  virginica     2
``````

Some possible confusion

The above works well, because all the operators we used were “grouping aware.” I think all `dplyr` operations are “grouping aware”, but some of the “in the street” tactics of “working with or around `dplyr`” may not be. For example `slice` is part of `dplyr` and grouping aware:

``````iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% slice(1)

# Source: local data frame [3 x 5]
# Groups: Species [3]
#
#   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#
# 1          5.8         4.0          1.2         0.2     setosa
# 2          7.0         3.2          4.7         1.4 versicolor
# 3          7.9         3.8          6.4         2.0  virginica
``````

But `head` is not part of `dplyr` and not grouping aware:

``````iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% head(n=1)

# Source: local data frame [1 x 5]
# Groups: Species [1]
#
#   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#
# 1          7.9         3.8          6.4           2 virginica
``````

Thus `head` “is not part of `dplyr`” even when it is likely a `dplyr` adapter supplying the actual implementation.

This can be very confusing to new analysts. We are seeing changes in semantics of downstream operators based on a data annotation (the “Groups:”). To the analyst grouping and ordering probably have equal stature. In `dplyr` grouping comes first, has a visible annotation, is durable, and changes the semantics of downstream operators. In `dplyr` ordering has no annotation, is not durable (it is quietly lost by many operators such as `dplyr::compute` and `dplyr::collapse`, though this is possibly changing), and can’t be stored (as it isn’t a concept in many back-ends such as relational databases).

It is hard for new analysts to trust `dplyr` the data `iris %>% group_by(Species) %>% arrange(desc(Sepal.Length))` is both grouped and ordered. As we see below order is in the presentation (and not annotated) and grouping is annotated (but not in the presentation):

``````iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% print(n=50)

# Source: local data frame [150 x 5]
# Groups: Species [3]
#
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#
# 1           7.9         3.8          6.4         2.0  virginica
# ...
# 12          7.1         3.0          5.9         2.1  virginica
# 13          7.0         3.2          4.7         1.4 versicolor
# 14          6.9         3.1          4.9         1.5 versicolor
# 15          6.9         3.2          5.7         2.3  virginica
# ...
``````

Notice the apparent mixing of the groups in presentation. That is part of why there is a visible `Groups:` annotation, the grouping can not be inferred from the data presentation.

My opinion

Frankly it can be hard to document and verify which `dplyr` pipelines are maintaining the semantics you intended. We have every reason to believe the following is both grouped and ordered:

``````  iris %>% group_by(Species) %>% arrange(desc(Sepal.Length))
``````

It is ordered as `dplyr::arrange` is the last step and we can verify the grouping is present with `dplyr::groups()`.

We have less reason to trust the following is also grouped and ordered (especially in remote databases or `Spark`):

``````  iris %>% arrange(desc(Sepal.Length)) %>% group_by(Species)
``````

The above may be simultaneously grouped and ordered (i.e. have not lost the order), but for reasons of “trust, but verify” it would be nice to have a user-visible annotation certifying that. Remember, explicitly verifying the order requires the use of a window-function (such as `lag`) so verifying order by hand isn’t always a convenient option.

We need to put some of the `dplyr` machinery in a housing to keep our fingers from getting into the gears.

Essentially this is saying wrap Hadley Wickham’s “The Split-Apply-Combine Strategy for Data Analysis” (link) concept into a single atomic operation with semantics:

``````    data %>% split(column1) %>% lapply(arrange(column2) %>% f()) %>% bind_rows()
``````

which we call “Grouped-Ordered-Apply.”

By wrapping the pipeline into a single “Grouped-Ordered-Apply” operation we are deliberately making intermediate results not visible. This is exactly what is needed to get rid of depending on distinctions of how partitioning is enforced (be it by a grouping annotation, or with an actual split) and worrying about the order of the internal operations.

A Suggested Solution

Our new package `replyr` supplies the “Grouped-Ordered-Apply” operation as `replyr::gapply` (itself built on top of `dplyr`). It performs the above grouped/ordered calculation as follows:

``````library('dplyr')

data(iris)

# define our operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
mutate(rank=cumsum(constcol)) %>% select(-constcol)

# apply our operation to data that is simultaneously grouped and ordered
res  <-
iris %>% gapply(gcolumn='Species',
f=rank_in_group,
ocolumn='Sepal.Length',decreasing=TRUE)

# present results
res %>% filter(rank<=2) %>% arrange(Species,rank)

#   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species rank
# 1          5.8         4.0          1.2         0.2     setosa    1
# 2          5.7         4.4          1.5         0.4     setosa    2
# 3          7.0         3.2          4.7         1.4 versicolor    1
# 4          6.9         3.1          4.9         1.5 versicolor    2
# 5          7.9         3.8          6.4         2.0  virginica    1
# 6          7.7         3.8          6.7         2.2  virginica    2
``````

`replyr::gapply` can use either a split based strategy, or a `dplyr::group_by_` based strategy for calculation. Notice `replyr::gapply`‘s preference for “parametric treatment of variables.” `replyr::gapply` anticipates that the analyst may not know the names of columns or variables when they are writing their code, but may in fact need to take names as values stored in other variables. Essentially we are making `dplyr::*_` forms preferred. The `rank_in_group` is using `dplyr` preferred non-standard evaluation, which assumes the analyst knows the names of the columns they are manipulating; that is appropriate for transient user code.

Now that we have the rank annotations present we can try to confirm they are in fact correct (i.e. that the implementation maintained grouping and ranking throughout). The calculation is detailed (checking ranks are unique per-group, integers in the range 1 to group-size, and order compatible with the value column `Sepal.Length`), so we have wrapped the calculation in `replyr`:

``````replyr_check_ranks(res,
GroupColumnName='Species',
ValueColumnName='Sepal.Length',
RankColumnName='rank',
decreasing=TRUE)

# 1            TRUE     setosa    50       1         0           50          0
# 2            TRUE versicolor    50       1         0           50          0
# 3            TRUE  virginica    50       1         0           50          0
```
```

For simplicity we wrote the primary checking function in terms of operations that happen to be only correct when there is only one group present (i.e. the function needs formal splitting and isolation, not just `dplyr::group_by`). This isn't a problem as we can then use `replyr::gapply(partitionMetod='split')` to correctly apply such code to all groups in turn.
Notice the Split-Apply-Combine steps are all wrapped together and supplied as part of the service; the user only supplies `(column1,column2,f())`. The transient lifetime and limited visibility of the sub-stages of the wrapped calculation are the appropriate abstractions given the fragility of row-order in modern data stores. The user doesn't care if the data is actually split and ordered, as long as it is presented to their function as if it were so structured. We are using the Split-Apply-Combine pattern, but abstracting out if it is actually implemented by formal splitting (ameliorating the differences between `base::split`, `tidyr::nest` and `SQL GROUP BY ... ORDER BY ...`). There are benefits in isolating the user-visible semantics from the details of realization.

Much can be written in terms of this pattern including grouped ranking problems, `dplyr::summarize`, and more. And this is precisely the semantics of `gapply` (grouped ordered apply) found in `replyr`.

An advantage of using the general notation as above is that `dplyr` has implementations that work on large remote data services such as databases and `Spark`.

For example here is the "rank within group" calculation performed in `PostgreSQL` (assuming you have such a database up, and using your own user/password). For these additional examples we are going to continue to suppose our goal is to compute the rank of `Sepal.Length` for irises grouped by `Species`.

``````# install.packages('replyr') # Run this if you don't already have replyr
library('dplyr')
data(iris)

# define our windowed operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
mutate(rank=cumsum(constcol)) %>% select(-constcol)

# get a databse handle and copy the data into the database
my_db <- dplyr::src_postgres(host = 'localhost', port = 5432,
user = 'postgres', password = 'pg')

# run the ranking in the database
res <-
irisD %>% gapply(gcolumn='Species',
f=rank_in_group,
ocolumn='Sepal.Length',decreasing=TRUE)

# present results
res %>% filter(rank<=2) %>% arrange(Species,rank)

# Source:   query [?? x 6]
# Database: postgres 9.6.1 [[email protected]:5432/postgres]
#
#   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  rank
#
# 1          5.8         4.0          1.2         0.2     setosa     1
# 2          5.7         3.8          1.7         0.3     setosa     2
# 3          7.0         3.2          4.7         1.4 versicolor     1
# 4          6.9         3.1          4.9         1.5 versicolor     2
# 5          7.9         3.8          6.4         2.0  virginica     1
# 6          7.7         3.0          6.1         2.3  virginica     2
``````

`dplyr::group_by` can also perform the grouped ordered calculation in `PostgreSQL` using the code below:

``````irisD %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% rank_in_group %>%
filter(rank<=2) %>% arrange(Species,rank)
``````

We can even perform the same calculation using Spark and sparklyr.

``````# get a Spark handle and copy the data in
my_s <- sparklyr::spark_connect(master = "local")

# re-run the ranking in Spark
irisS %>% gapply(gcolumn='Species',
f=rank_in_group,
ocolumn='Sepal_Length',decreasing=TRUE) %>%
filter(rank<=2) %>% arrange(Species,rank)

# Source:   query [?? x 6]
# Database: spark connection master=local[4] app=sparklyr local=TRUE
#
#   Sepal_Length Sepal_Width Petal_Length Petal_Width    Species  rank
#
# 1          5.8         4.0          1.2         0.2     setosa     1
# 2          5.7         4.4          1.5         0.4     setosa     2
# 3          7.0         3.2          4.7         1.4 versicolor     1
# 4          6.9         3.1          4.9         1.5 versicolor     2
# 5          7.9         3.8          6.4         2.0  virginica     1
# 6          7.7         3.8          6.7         2.2  virginica     2
``````

Notice the `sparklyr` adapter changed column names by replacing "`.`" with "`_`", so we had to change our ordering column specification "`ocolumn='Sepal_Length'`" to match. This is the only accommodation we had to make to switch to a `Spark` service. Outside of `R` (and `Lisp`) dots in identifiers are considered a bad idea and should be avoided. For instances most `SQL` databases reserved dots to indicate relations between schemas, tables, and columns (so it is only through sophisticated quoting mechanisms that the `PostgreSQL` example was able to use dots in column names).

`dplyr` directly completes the same calculation with:

``````irisS %>% group_by(Species) %>% arrange(desc(Sepal_Length)) %>% rank_in_group %>%
filter(rank<=2) %>% arrange(Species,rank)
``````

And that is the Grouped-Ordered-Apply pattern and our `dplyr` based reference implementation. (Exercise for the reader: implement `SQL`'s useful "`UPDATE WHERE`" operation in terms of `replyr::gapply`.)