**R – Jocelyn Ireson-Paine's Blog**, and kindly contributed to R-bloggers)

Most people have a job. Some don’t. And a few have more than one. I’ve mentioned before that our economic model works on data about British households, gathered from surveys such as the Family Resources Survey. Each collection of data is a year long, and contains a file that describes all the adults in the 20,000 or so households covered. There’s another file that describes their jobs. I have to make the model aggregate and merge these.

As an example, assume Pat has three jobs. Her adult data consists of one record, with details such as age and sex. Her jobs data has one record per job, stating how much the job pays. I need to squish these three records down to one which has a field for the number of jobs, a field for the total earnings, and a field with a list of earnings. And I must then merge the result with Pat’s adult record. Now, the jobs file and the adults file are both sorted in order of ID. In our Python version of the model, we looped over these, matching IDs, doing subsidiary loops round records with the same ID to aggregate data, and then combining the results. But in R, I’d like to write code that’s more concise, easier to apprehend, easier to tweak, and that suits R’s programming idioms. As it happens, the Tidyverse can do this easily with `summarise()`

. But it took me a bit of time to get there, and I’m going to show the experiments I did, and why. I’ll precede these by explaining

the main points I wanted to investigate, and what I discovered.

First, I’ll show the call to `summarise()`

that worked.

Suppose my jobs table `t`

has a column `id`

which identifies the adult, and a column `value`

which gives a job’s earnings. Here’s such

a table, obviously not realistic.

t <- tribble( ~id, ~value , 1 , 101 , 1 , 102 , 2 , 201 , 3 , 301 , 3 , 302 , 3 , 303 , 4 , 401 )

From this, I want to make a table whose `counts`

column says how many jobs the adult

has, whose `sums`

column is their total earnings, and whose `lists`

column is a list of earnings. This is how:

t %>% group_by( id ) %>% summarise( counts=n(), sums=sum(value), lists=list(value) )

This may look obvious. But I was thrown off course, partly by

some typos I made early on, and partly by the

questions I was thinking about when I wrote

“Second-Guessing R”.

Second, I’ll mention something that surprised me. In the tables these generate,

each element of the lists column turns out to be an atomic vector, not

a list. So `identical( tsl[[1,4]], list(101,102) )`

is

`FALSE`

, but `identical( tsl[[1,4]], c(101,102) )`

is `TRUE`

. I don’t know whether this is something odd

about base R or about `summarise()`

.

Third, a surprise arising from the above. I can’t call

`c()`

instead of `list()`

. If I do,

I get an error `Error in summarise_impl(.data, dots): Column`

.

`lists` must be length 1 (a summary value), not 2

That sort of makes sense, in R’s strange world. Something

seems to be “lowering”

`list()`

so that it produces atomic vectors when it

would normally produce lists. So perhaps any other function

also gets “lowered”. Including `c()`

. But

`c()`

is already as low as it can get. You can’t

have anything lower than an atomic vector. So “lowering” `c()`

would render it useless. Which is what’s happened.

Fourth, I wanted to see whether there was anything special about

`summarise()`

. Would the same functions work from

`mutate()`

? Indeed they would: they’re not restricted

to `summarise()`

.

The code above produces *one* row per adult. Each group (i.e. each collection of rows for an adult’s jobs) has been “collapsed”, as the Tidyverse documentation for `tally()`

puts it. But if I want to calculate counts, sums and lists *without* collapsing, then I can call `mutate()`

instead of `summarise()`

:

tsl <- t %>% group_by( id ) %>% mutate( counts=n(), sums=sum(value), lists=list(value) )

This adds the count, sum, and list to every row of the original uncollapsed table.

Fifth, another surprise involving `c()`

.

This time, I can call it. But it just causes the

lists column to duplicate the original values.

Sixth, I wanted to see whether `summarise()`

would work if I didn’t group the table first. And it

does. It treats the entire table as one group,

and therefore collapses it to a single row.

Seventh, would `mutate()`

also work if I didn’t group the table first?

It should, of course, because that’s how

it’s normally used. And it does.

`n()`

counts the rows in the

original table, and `sum()`

and

`list()`

get applied to the

entire value column.

My eighth and final point is perhaps

more general, being about how

`summarise()`

and

`mutate()`

treat their

functions. For

more about why I wanted to

investigate this, read

“Second-Guessing R”.

In essence, I’ve found that the

name of a column is always replaced by

a vector. If the table holding the

column is ungrouped, the vector

is the entire column, just

as it would be in a base-R call such as

`data_frame$values * 2`

. If it’s grouped,

the vector becomes whatever

slice of that column belongs to the first

group, then whatever

slice of that column belongs to the second

group, and so on. This seems to be true

whatever the function it’s passed to.

And, the Tidyverse does not appear

to recognise certain functions as special

because of their name or other identity.

It treats the built-in `list()`

the same as a function with different code

or a different name.

With all that out of the way,

here are my experiments. Incidentally, there

are some that I haven’t mentioned above,

mainly with `tally()`

and `count()`

.

# try_sum_count_and_list.R # # When processing adults data, I # need to summarise the earnings # from each adult's jobs. An adult # may have more than one job, and # I need to make fields containing # the number of jobs, the earnings # summed, and a list of the # individual earnings. These are # experiments in doing this with # Tidyverse functions. # # This led on to some general # questions about the functions, # which I've blogged in # http://www.j-paine.org/blog/2017/11/experiments-with-count-tally-and-summarise-how-to-count-and-sum-and-list-elements-of-a-column-in-the-same-call.html . library( tidyverse ) t <- tribble( ~id, ~value , 1 , 101 , 1 , 102 , 2 , 201 , 3 , 301 , 3 , 302 , 3 , 303 , 4 , 401 ) t %>% group_by( id ) %>% tally() # # id n # 1 1 2 # 2 2 1 # 3 3 3 # 4 4 1 # # The 'n' column gives the # number of rows in each # group. Unlike with add_count() # below, the result is # collapsed so there's only # one row per ID. t %>% count( id ) # # The same. t %>% add_count( id ) # # id value n # 1 1 101 2 # 2 1 102 2 # 3 2 201 1 # 4 3 301 3 # 5 3 302 3 # 6 3 303 3 # 7 4 401 1 # # The table is not collapsed, # so each ID has the same # number of rows as before. But # they all get an 'n' column # giving their count. add_tally() # would do the same. t %>% group_by( id ) %>% tally( wt=value ) # # id n # 1 1 203 # 2 2 201 # 3 3 906 # 4 4 401 # # Here, the 'n' column # is the sum over 'value' # within each group. t %>% count( id, wt=value ) # # The same. t %>% add_count( id, wt=value ) # # id value n # 1 1 101 203 # 2 1 102 203 # 3 2 201 201 # 4 3 301 906 # 5 3 302 906 # 6 3 303 906 # 7 4 401 401 # # As before with add_count(), # the table is not collapsed. # An 'n' cell has been appended # to each row, giving the # sum over the value's in # that row's group. add_tally() # would do the same. t %>% add_count( id ) %>% filter( n== 1 ) # # id value n # 1 2 201 1 # 2 4 401 1 # # As the documentation notes, # "add_count() is useful for groupwise # filtering. E.g.: show only species # that have a single member." t %>% group_by( id ) %>% summarise( n(), sum(value) ) # # id `n()` `sum(value)` # 1 1 2 203 # 2 2 1 201 # 3 3 3 906 # 4 4 1 401 # # With tally() and count(), I # can't see a way to count rows and # sum values in the same call. With # summarise(), I can. t %>% group_by( id ) %>% summarise( counts=n(), sums=sum(value) ) # # As above, but the new columns have # my names 'counts' and 'sums'. tsl <- t %>% group_by( id ) %>% summarise( counts=n(), sums=sum(value), lists=list(value) ) # # id counts sums lists # 1 1 2 203# 2 2 1 201 # 3 3 3 906 # 4 4 1 401 # # So as well as counting rows and # summing values, I can aggregate values # into a collection. But beware. Each # cell of 'lists' becomes an atomic # vector, not a list. The code below # proves this. # # (These are the first and second points # in my blog post.) identical( tsl[[1,4]], list(101,102) ) # [1] FALSE identical( tsl[[1,4]], c(101,102) ) # [1] TRUE tsc <- t %>% group_by( id ) %>% summarise( counts=n(), sums=sum(value), lists=c(value) ) # # Gives an error: # Error in summarise_impl(.data, dots) : # Column `lists` must be length 1 (a summary value), not 2 # # So I can't aggregate by calling c(). # I do have to call list(). # # (This is the third point in my blog post.) tsl <- t %>% group_by( id ) %>% mutate( counts=n(), sums=sum(value), lists=list(value) ) # # id value counts sums lists # 1 1 101 2 203 # 2 1 102 2 203 # 3 2 201 1 201 # 4 3 301 3 906 # 5 3 302 3 906 # 6 3 303 3 906 # 7 4 401 1 401 # # If I want the same effect as with # add_tally() and add_count(), this # is the way to do it. It adds the # count, sum, and list to every row # of the original uncollapsed table. # # This also shows that n() and sum() # are not restricted to use from # summarise(). You can use them # on grouped tables from mutate(). # And on ungrouped tables, I suppose, # but that's probably not useful. # # Note: I've not shown it here, but # the result is grouped, so it's # probably best to ungroup it. # # (This is the fourth point in my blog # post.) tsl <- t %>% group_by( id ) %>% mutate( counts=n(), sums=sum(value), lists=c(value) ) # # id value counts sums lists # 1 1 101 2 203 101 # 2 1 102 2 203 102 # 3 2 201 1 201 201 # 4 3 301 3 906 301 # 5 3 302 3 906 302 # 6 3 303 3 906 303 # 7 4 401 1 401 401 # # Very odd, and not useful. The result # of c() doesn't get put into each cell # of 'lists', but treated as a slice of the # column. Probably related to the # behaviour I commented on in # http://www.j-paine.org/blog/2017/10/experiments-with-summarise-or-when-does-x-sub-1-equal-x.html . # # As above, and as expected, the table # is grouped. # # (This is the fifth point in my blog # post.) tsl <- t %>% group_by( id ) %>% transmute( counts=n(), sums=sum(value), lists=list(value) ) # # This produced the same table as above, # but without the 'value' column, and # accompanied by a warning: # Adding missing grouping variables: `id` tsl <- t %>% summarise( counts=n(), sums=sum(value), lists=list(value) ) # # counts sums lists # 1 7 1711 # # Out of interest, this is what # happens if I summarise without grouping. # The table is treated as one group. # # (This is the sixth point in my blog # post.) tsl <- t %>% mutate( counts=n(), sums=sum(value), lists=list(value) ) # # id value counts sums lists # 1 1 101 7 1711 # 2 1 102 7 1711 # 3 2 201 7 1711 # 4 3 301 7 1711 # 5 3 302 7 1711 # 6 3 303 7 1711 # 7 4 401 7 1711 # # And this is what happens if I mutate # without grouping. The aggregating # functions n(), sum() and list() get # applied to the entire column. Their # result is then appended to each row. # # (This is the seventh point in my # blog post.) my_list <- list tsl <- t %>% mutate( counts=n(), sums=sum(value), lists=my_list(value) ) # # The same table as above. # # I wanted to see whether mutate() is # treating 'list' specially. For example, # does it recognise the name and do # something special because of it? # This shows that it doesn't, because # I'm using a different name but getting # the same result. # # (This and the stuff below make the # eigth point in my blog post.) my_list <- function( x ) list( x ) tsl <- t %>% mutate( counts=n(), sums=sum(value), lists=my_list(value) ) # # The same table as above. # # mutate() might have been recognising # the value of list(), i.e. the pointer # to its code. This shows it isn't, # because I'm using a different pointer # but getting the same result. my_list <- function( x ) str_c( x[1], x[2], x[3], x[4], x[5], x[6], x[7], sep="," ) tsl <- t %>% mutate( counts=n(), sums=sum(value), lists=my_list(value) ) # # A table like those above but where each # element of the 'lists' column is a string # concatenation of all the values in # the 'value' column. This shows that # my_list() is receiving all the values # in one go. my_list <- function( x ) x[1] tsl <- t %>% mutate( counts=n(), sums=sum(value), lists=my_list(value) ) # # A table like those above but where each # element of the 'lists' column is 101. # This is consistent with my conclusions # above. tsl <- t %>% mutate( counts=n(), sums=sum(value), lists=value[1] ) # # The same as the last table. # # This shows that it doesn't matter whether # the subscripting is hidden from mutate()'s # view. In other words, it doesn't treat # [] specially. tsl <- t %>% mutate( counts=n(), sums=sum(value), lists=value*2 ) # # The same as the tables above, but each element # of 'lists' is twice the corresponding element # of 'value. my_list <- function(x) { cat("Calling my_list() with argument "); dput(x); x*2 } tsl <- t %>% mutate( counts=n(), sums=sum(value), lists=my_list(value) ) # # The same table as above. Outputs # Calling my_list() with argument c(101, 102, 201, 301, 302, 303, 401) . # # This confirms that my_list() gets called # only once and is passed the entire column. # # The point of these last two experiments is # that when I first used calls such as # mutate( new_value=value*2 ) # I assumed * was getting called once per row, # and that mutate() substituted that row's # 'value' as the argument to * . But that's # not true. * gets the entire column. This # works because of R's vectorisation.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Jocelyn Ireson-Paine's Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...