Organize your data manipulation in terms of “grouped ordered apply”

December 15, 2016
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.


Iris germanica Purple bearded Iris Wakehurst Place UK DiliffIris, by DiliffOwn work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”.

The example

Let’s start (as always) with our data. We are going to look at the iris data set in R. You can view the data in by typing the following in your R console:

data(iris)
View(iris)

The package dplyr makes the grouped calculation quite easy. We define our “window function” (function we want applied to sub-groups of data in a given order) and then use dplyr to apply the function to grouped and arranged data:

library('dplyr')

# define our windowed operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
          mutate(rank=cumsum(constcol)) %>% select(-constcol)

# calculate
res <-
  iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% 
  rank_in_group

# display first few results
res %>% filter(rank<=2) %>% arrange(Species,rank)

 # Source: local data frame [6 x 6]
 # Groups: Species [3]
 # 
 #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  rank
 #                                      
 # 1          5.8         4.0          1.2         0.2     setosa     1
 # 2          5.7         4.4          1.5         0.4     setosa     2
 # 3          7.0         3.2          4.7         1.4 versicolor     1
 # 4          6.9         3.1          4.9         1.5 versicolor     2
 # 5          7.9         3.8          6.4         2.0  virginica     1
 # 6          7.7         3.8          6.7         2.2  virginica     2

Some possible confusion

The above works well, because all the operators we used were “grouping aware.” I think all dplyr operations are “grouping aware”, but some of the “in the street” tactics of “working with or around dplyr” may not be. For example slice is part of dplyr and grouping aware:

iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% slice(1)

 # Source: local data frame [3 x 5]
 # Groups: Species [3]
 # 
 #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
 #                                     
 # 1          5.8         4.0          1.2         0.2     setosa
 # 2          7.0         3.2          4.7         1.4 versicolor
 # 3          7.9         3.8          6.4         2.0  virginica

But head is not part of dplyr and not grouping aware:

iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% head(n=1)

 # Source: local data frame [1 x 5]
 # Groups: Species [1]
 # 
 #   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
 #                                    
 # 1          7.9         3.8          6.4           2 virginica

Thus head “is not part of dplyr” even when it is likely a dplyr adapter supplying the actual implementation.

This can be very confusing to new analysts. We are seeing changes in semantics of downstream operators based on a data annotation (the “Groups:”). To the analyst grouping and ordering probably have equal stature. In dplyr grouping comes first, has a visible annotation, is durable, and changes the semantics of downstream operators. In dplyr ordering has no annotation, is not durable (it is quietly lost by many operators such as dplyr::compute and dplyr::collapse, though this is possibly changing), and can’t be stored (as it isn’t a concept in many back-ends such as relational databases).

It is hard for new analysts to trust dplyr the data iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) is both grouped and ordered. As we see below order is in the presentation (and not annotated) and grouping is annotated (but not in the presentation):

iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% print(n=50)

 # Source: local data frame [150 x 5]
 # Groups: Species [3]
 # 
 #    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
 #                                      
 # 1           7.9         3.8          6.4         2.0  virginica
 # ...
 # 12          7.1         3.0          5.9         2.1  virginica
 # 13          7.0         3.2          4.7         1.4 versicolor
 # 14          6.9         3.1          4.9         1.5 versicolor
 # 15          6.9         3.2          5.7         2.3  virginica
 # ...

Notice the apparent mixing of the groups in presentation. That is part of why there is a visible Groups: annotation, the grouping can not be inferred from the data presentation.

My opinion

Frankly it can be hard to document and verify which dplyr pipelines are maintaining the semantics you intended. We have every reason to believe the following is both grouped and ordered:

  iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) 

It is ordered as dplyr::arrange is the last step and we can verify the grouping is present with dplyr::groups().

We have less reason to trust the following is also grouped and ordered (especially in remote databases or Spark):

  iris %>% arrange(desc(Sepal.Length)) %>% group_by(Species)

The above may be simultaneously grouped and ordered (i.e. have not lost the order), but for reasons of “trust, but verify” it would be nice to have a user-visible annotation certifying that. Remember, explicitly verifying the order requires the use of a window-function (such as lag) so verifying order by hand isn’t always a convenient option.

We need to put some of the dplyr machinery in a housing to keep our fingers from getting into the gears.

Essentially this is saying wrap Hadley Wickham’s “The Split-Apply-Combine Strategy for Data Analysis” (link) concept into a single atomic operation with semantics:

    data %>% split(column1) %>% lapply(arrange(column2) %>% f()) %>% bind_rows()

which we call “Grouped-Ordered-Apply.”

By wrapping the pipeline into a single “Grouped-Ordered-Apply” operation we are deliberately making intermediate results not visible. This is exactly what is needed to get rid of depending on distinctions of how partitioning is enforced (be it by a grouping annotation, or with an actual split) and worrying about the order of the internal operations.

A Suggested Solution

Our new package replyr supplies the “Grouped-Ordered-Apply” operation as replyr::gapply (itself built on top of dplyr). It performs the above grouped/ordered calculation as follows:

library('dplyr')
# install.packages('replyr') # Run this if you don't already have replyr
library('replyr')

data(iris)

# define our operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
          mutate(rank=cumsum(constcol)) %>% select(-constcol)

# apply our operation to data that is simultaneously grouped and ordered
res  <-
   iris %>% gapply(gcolumn='Species',
                   f=rank_in_group,
                   ocolumn='Sepal.Length',decreasing=TRUE)

# present results
res %>% filter(rank<=2) %>% arrange(Species,rank)

 #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species rank
 # 1          5.8         4.0          1.2         0.2     setosa    1
 # 2          5.7         4.4          1.5         0.4     setosa    2
 # 3          7.0         3.2          4.7         1.4 versicolor    1
 # 4          6.9         3.1          4.9         1.5 versicolor    2
 # 5          7.9         3.8          6.4         2.0  virginica    1
 # 6          7.7         3.8          6.7         2.2  virginica    2

replyr::gapply can use either a split based strategy, or a dplyr::group_by_ based strategy for calculation. Notice replyr::gapply‘s preference for “parametric treatment of variables.” replyr::gapply anticipates that the analyst may not know the names of columns or variables when they are writing their code, but may in fact need to take names as values stored in other variables. Essentially we are making dplyr::*_ forms preferred. The rank_in_group is using dplyr preferred non-standard evaluation, which assumes the analyst knows the names of the columns they are manipulating; that is appropriate for transient user code.

Now that we have the rank annotations present we can try to confirm they are in fact correct (i.e. that the implementation maintained grouping and ranking throughout). The calculation is detailed (checking ranks are unique per-group, integers in the range 1 to group-size, and order compatible with the value column Sepal.Length), so we have wrapped the calculation in replyr:

replyr_check_ranks(res,
                   GroupColumnName='Species',
                   ValueColumnName='Sepal.Length',
                   RankColumnName='rank',
                   decreasing=TRUE)

 #   goodRankedGroup    groupID nRows nGroups nBadRanks nUniqueRanks nBadOrders
 # 1            TRUE     setosa    50       1         0           50          0
 # 2            TRUE versicolor    50       1         0           50          0
 # 3            TRUE  virginica    50       1         0           50          0

For simplicity we wrote the primary checking function in terms of operations that happen to be only correct when there is only one group present (i.e. the function needs formal splitting and isolation, not just dplyr::group_by). This isn't a problem as we can then use replyr::gapply(partitionMetod='split') to correctly apply such code to all groups in turn.
Notice the Split-Apply-Combine steps are all wrapped together and supplied as part of the service; the user only supplies (column1,column2,f()). The transient lifetime and limited visibility of the sub-stages of the wrapped calculation are the appropriate abstractions given the fragility of row-order in modern data stores. The user doesn't care if the data is actually split and ordered, as long as it is presented to their function as if it were so structured. We are using the Split-Apply-Combine pattern, but abstracting out if it is actually implemented by formal splitting (ameliorating the differences between base::split, tidyr::nest and SQL GROUP BY ... ORDER BY ...). There are benefits in isolating the user-visible semantics from the details of realization.

Much can be written in terms of this pattern including grouped ranking problems, dplyr::summarize, and more. And this is precisely the semantics of gapply (grouped ordered apply) found in replyr.

Additional examples

An advantage of using the general notation as above is that dplyr has implementations that work on large remote data services such as databases and Spark.

For example here is the "rank within group" calculation performed in PostgreSQL (assuming you have such a database up, and using your own user/password). For these additional examples we are going to continue to suppose our goal is to compute the rank of Sepal.Length for irises grouped by Species.

# install.packages('replyr') # Run this if you don't already have replyr
library('dplyr')
library('replyr')
data(iris)

# define our windowed operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
          mutate(rank=cumsum(constcol)) %>% select(-constcol)

# get a databse handle and copy the data into the database
my_db <- dplyr::src_postgres(host = 'localhost', port = 5432,
                             user = 'postgres', password = 'pg')
irisD <- replyr_copy_to(my_db,iris)

# run the ranking in the database
res <-
   irisD %>% gapply(gcolumn='Species',
                    f=rank_in_group,
                    ocolumn='Sepal.Length',decreasing=TRUE)

# present results
res %>% filter(rank<=2) %>% arrange(Species,rank)

 # Source:   query [?? x 6]
 # Database: postgres 9.6.1 [[email protected]:5432/postgres]
 # 
 #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  rank
 #                                       
 # 1          5.8         4.0          1.2         0.2     setosa     1
 # 2          5.7         3.8          1.7         0.3     setosa     2
 # 3          7.0         3.2          4.7         1.4 versicolor     1
 # 4          6.9         3.1          4.9         1.5 versicolor     2
 # 5          7.9         3.8          6.4         2.0  virginica     1
 # 6          7.7         3.0          6.1         2.3  virginica     2

dplyr::group_by can also perform the grouped ordered calculation in PostgreSQL using the code below:

irisD %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% rank_in_group %>% 
  filter(rank<=2) %>% arrange(Species,rank)

We can even perform the same calculation using Spark and sparklyr.

# get a Spark handle and copy the data in
my_s <- sparklyr::spark_connect(master = "local")
irisS <- replyr_copy_to(my_s,iris)

# re-run the ranking in Spark
irisS %>% gapply(gcolumn='Species',
                 f=rank_in_group,
                 ocolumn='Sepal_Length',decreasing=TRUE) %>%
          filter(rank<=2) %>% arrange(Species,rank)

  # Source:   query [?? x 6]
 # Database: spark connection master=local[4] app=sparklyr local=TRUE
 # 
 #   Sepal_Length Sepal_Width Petal_Length Petal_Width    Species  rank
 #                                       
 # 1          5.8         4.0          1.2         0.2     setosa     1
 # 2          5.7         4.4          1.5         0.4     setosa     2
 # 3          7.0         3.2          4.7         1.4 versicolor     1
 # 4          6.9         3.1          4.9         1.5 versicolor     2
 # 5          7.9         3.8          6.4         2.0  virginica     1
 # 6          7.7         3.8          6.7         2.2  virginica     2

Notice the sparklyr adapter changed column names by replacing "." with "_", so we had to change our ordering column specification "ocolumn='Sepal_Length'" to match. This is the only accommodation we had to make to switch to a Spark service. Outside of R (and Lisp) dots in identifiers are considered a bad idea and should be avoided. For instances most SQL databases reserved dots to indicate relations between schemas, tables, and columns (so it is only through sophisticated quoting mechanisms that the PostgreSQL example was able to use dots in column names).

dplyr directly completes the same calculation with:

irisS %>% group_by(Species) %>% arrange(desc(Sepal_Length)) %>% rank_in_group %>% 
  filter(rank<=2) %>% arrange(Species,rank)

And that is the Grouped-Ordered-Apply pattern and our dplyr based reference implementation. (Exercise for the reader: implement SQL's useful "UPDATE WHERE" operation in terms of replyr::gapply.)

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)