Thoughts on nest()

[This article was first published on R – Jocelyn Ireson-Paine's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been experimenting with the Tidyverse’s nest function,
because it may be useful when, for
example, using households
together with benefit units.
Below are some thoughts that
I first posted as a comment to Hadley Wickham’s blog entry
“tidyr 0.4.0”.
More on this in future posts.

First, this is likely to be very useful to me. I’m translating
a microeconomic model into R. Its input is a set of British
households, where each household record contains data on income
and expenditure. The model uses these to predict how their
incomes will change if you change tax (e.g. by increasing
income tax) or benefits (e.g. by increasing pensions or child benefit).

Our data splits households into “benefit units”.
A benefit unit ( http://www.poverty.org.uk/summary/households.shtml ) is
defined as an adult plus their spouse if they have one,
plus any dependent children they are living with. So for example,
mum and dad plus 10-year old Johnnie would be one benefit unit. But if
Johnnie is over 18, he becomes an adult who just happens to live with
his parents, and the household has two benefit units. These are
treated more-or-less independently by the benefit system,
e.g. if they receive money when out of work.

This matters because each dataset contains one table for
household-wide data, and another for benefit-unit-wide data.
I’ve been combining these with joins. But it might be cleaner
to nest each household’s benefit units inside the household
data frame. Not least, because sometimes we have to change
data in a household, for example when simulating inflation,
and then propagate the changes down to the benefit units.

Second, nest and
unnest could be useful elsewhere in our
data. Each household’s expenditure data is divided into
categories, for example food, rent, and alcohol. We may
want to group and ungroup these. For example, I make:

d <- tibble( ID=c( 1, 1, 1,  2, 2,  3, 3,
                   4, 4, 4,  5, 5, 5,  6, 6 ),
             expensetype=c( 'food', 'alcohol', 'rent',  'food', 'rent',  'food', 'rent',
                            'food', 'cigarettes', 'rent',  'food', 'alcohol', 'rent',  'food', 'rent' ),
             expenditure = c( 100, 50, 400,  75, 300,  90, 400,
                              100, 30, 420,  75, 50, 550,  150, 600 )  
           )

Then

d %>% group_by(ID) %>% nest %>% arrange(ID)

gives me a table

1 tibble [3 x 2]
2 tibble [2 x 2]
3 tibble [2 x 2]
4 tibble [3 x 2]
5 tibble [3 x 2]
6 tibble [2 x 2]

where the first column is the ID and the second is a table such as

food      100
alcohol    50
rent      400

So in effect, it makes my original table into a function from ID to
℘ expensetype × expenditure.

Whereas if I do

d %>% group_by(expensetype) %>% nest %>% arrange(expensetype)

I get the table

alcohol    tibble [2 x 2]
cigarettes tibble [1 x 2]
food       tibble [6 x 2]
rent       tibble [6 x 2]

where the first column is expenditure category
and the second holds tables of ID versus expenditure. In effect, a
function from expensetype to ℘ ID × expenditure.
This sort of reformatting may be very useful.

Third. Continuing from the above, I wrote this function:

functionalise <- function( data, cols )
{
  data %>% group_by_( .dots=cols ) %>% nest %>% arrange_( .dots=cols )
}

The idea is that
functionalise( data, cols ) reformats data into a data
frame that represents a function. The columns cols represent the
function’s domain, and will never contain more than one
occurrence of the same tuple. The remaining column
represents the function’s codomain. Thus,
functionalise(d,"expensetype") returns the data frame
shown in the last example.

Fourth, I note that I can write either

d %>% group_by( expensetype ) %>% nest %>% arrange( expensetype )

or

nest( d, ID, expenditure )

In the first, I have to name those columns that I want
to act as the function’s domain. In the second, I have
to name the others. I find the first more natural.

Fifth,
nest and
unnest, as well as
spread and
gather
make it very easy to generate alternative but logically
equivalent representations of a data frame. But every
time I change representation in this way, I have to
rewrite all the code that accesses the data. It
would be really nice if either a
representation-independent way of accessing it
could be found, or if nest/unnest and
spread/gather could be made to operate on the
code as well as the data. Paging Douglas
Ross and the Uniform Referent Principle…

To leave a comment for the author, please follow the link and comment on their blog: R – Jocelyn Ireson-Paine's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)