Peeling of group layers.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As an experienced dplyr user since almost day one, I thought I knew every aspect of it. But when my new colleague, who is learning dplyr from scratch, asked me to explain the peeling of group layers with summarise
, I was like, what?
Turns out this actually is a thing. Let me show the example from the dplyr introduction:
library(dplyr) library(nycflights13) daily <- flights %>% group_by(year, month, day) %>% summarise(flights = n()) daily ## Source: local data frame [365 x 4] ## Groups: year, month ## ## year month day flights ## 1 2013 1 1 842 ## 2 2013 1 2 943 ## 3 2013 1 3 914 ## 4 2013 1 4 915 ## 5 2013 1 5 720 ## 6 2013 1 6 832 ## 7 2013 1 7 933 ## 8 2013 1 8 899 ## 9 2013 1 9 902 ## 10 2013 1 10 932 ## .. ... ... ... ...
Notice that the newly generated data.frame is grouped by year
and month
only, not by date
anymore, which is the effect of the forementioned peeling. The idea is that if you want to aggregate your data further, in most cases you would aggregate it with one layer off since you already exploited the initial grouping.
So if you want to calculate the amount of flights per month and year and then per year, you don’t have to regroup anymore.
monthly <- daily %>% summarise(flights = sum(flights)) monthly ## Source: local data frame [12 x 3] ## Groups: year ## ## year month flights ## 1 2013 1 27004 ## 2 2013 2 24951 ## 3 2013 3 28834 ## 4 2013 4 28330 ## 5 2013 5 28796 ## 6 2013 6 28243 ## 7 2013 7 29425 ## 8 2013 8 29327 ## 9 2013 9 27574 ## 10 2013 10 28889 ## 11 2013 11 27268 ## 12 2013 12 28135
And finally:
yearly <- monthly %>% summarise(flights = sum(flights)) yearly ## Source: local data frame [1 x 2] ## ## year flights ## 1 2013 336776
How does dplyr determine which layer to peel off? Well it seems to depend on the order of the groups:
daily <- flights %>% group_by(day, month, year) %>% summarise(flights = n()) daily ## Source: local data frame [365 x 4] ## Groups: day, month ## ## day month year flights ## 1 1 1 2013 842 ## 2 1 2 2013 926 ## 3 1 3 2013 958 ## 4 1 4 2013 970 ## 5 1 5 2013 964 ## 6 1 6 2013 754 ## 7 1 7 2013 966 ## 8 1 8 2013 1000 ## 9 1 9 2013 718 ## 10 1 10 2013 965 ## .. ... ... ... ...
This produces the same results as in the first daily computation, but obviously now year got peeled off.
Although it might help you write code faster (by saving you a group_by-clause), in my humble opinion it deteriorates readability. Or would you instantly know what to expect from this code, which is basically the summary of the ones above:
result <- flights %>% group_by(year, month, day) %>% summarise(flights = n()) %>% summarise(flights = sum(flights)) %>% summarise(flights = sum(flights)) result ## Source: local data frame [1 x 2] ## ## year flights ## 1 2013 336776
To understand the result, you have to count the peeled off layers, and see which ones are affected. You can’t simply rely on the group_by
anymore while reading the code. Now imagine you are reviewing code or looking at old code of yours. Everytime you see a wild summarise
without a directly preceeding group_by
you will have to go through the whole code looking for the previous group_by
and then search all summarise
statement and count the peeling.
Peeling onions isn’t fun either. Lilly Martin Spencer, Peeling Onions, ca. 1852.
While it might be a convenient way for quick and dirty analysis, I would not use it in production code and my best practice is to always explicitly use a group_by
statement right before a summarise
. Keeping in mind that the behaviour of group_by
already changed from adding groups to overriding groups completely in dplyr 2.0, safest thing you can do is even to add an ungroup
statement right before grouping.
Another problem is that some users might not be aware of this behaviour (I wasn’t, therefore the blog post) and might be surprised by the somewhat counterintuitive results.
So be aware with this behaviour and you can keep enjoying dplyr!
Peeling of group layers. was originally published by Kirill Pomogajko at Opiate for the masses on August 07, 2015.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.