Multinest: Factoring Out Hierarchical Names by Converting Tibbles to Nested Named Lists

[This article was first published on R – Jocelyn Ireson-Paine's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Consider this tibble:

vat_rates <- tribble( ~l1,   ~l2         , ~l3         , ~val
                    , 'VAT', 'Cigarettes', '' 	       , 'standard'
                    , 'VAT', 'Tobacco'   , ''          , 'standard'
                    , 'VAT', 'Narcotics' , ''	       , 0
                    , 'VAT', 'Clothing'  , 'Adult'     , 'standard'
                    , 'VAT', 'Clothing'  , 'Children'  , 0
                    , 'VAT', 'Clothing'  , 'Protective', 0
                    )
It contains three name columns followed by a value. The name columns are hierarchical. The top category is VAT (Value Added Tax). The second column subdivides this into VAT on cigarettes, other tobacco, narcotics, and clothing. And the third column further subdivides clothing into adult, children’s, or protective. This is part of the VAT parameters worksheet that provides our economic model with its knowledge of the British tax system. Here’s the same fragment as a screen shot:
I want to convert this into a tree-structured named list in which I can access parameter values by multiple selection. That is, I want to be able to ask for params$VAT$Tobacco and get 'standard', or for params$VAT$Clothing$Children and get 0. This is how I did so.

I’m going to show my code below. I’ve annotated it in the same way I did in my previous posts, so all the explanation is in the code as comments. What I did was to write a function called multinest, correcting it a step at a time as I found cases it couldn’t handle. I built it up that way because I found it easier to explain. It seemed to be easier for the reader (or at least, for me acting as reader) to visualise than presenting a finished version of the function immediately and explaining how its recursion and base cases worked. I’m not sure why I found the latter harder, but it’s something to do with the fact that nest transforms an entire table in one go. If you then want to perform an inner transformation on the nested tables it inserts, you have to loop or recurse over them. That’s different from how a more traditional algorithm would work. Anyway, here’s my code. The final version of my function is at the end, multinest_4.

# try_multinest_4.R
#
# This is a version of multinest which
# I hope is easier to understand than
# that in try_multinest_3.R . I built
# it by successive enhancement, starting
# with nest() and adding one correction
# at a time.


library( tidyverse )


t <- tribble( ~l1, ~l2, ~val
            , 'A', 'a', 'Aa'
            , 'B', 'a', 'Ba'
            , 'A', 'b', 'Ab'
            , 'B', 'b', 'Bb'
            )
#
# I will try my functions on this
# tibble. I want to write a function
# that factors out the two name
# columns l1 and l2, converting it
# to the nested named list
#   list( A = list( a = "Aa", b = "Ab" )
#       , B = list( a = "Ba" ,b = "Bb" )
#       )
# So if this list is l, l$A$a must be
# "Aa", and l$B$b must be "Bb", etc.
     

# First, let's try nest() interactively.

t_nested <- t %>% group_by( l1 ) %>%
                 nest
#
# Gives a tibble where the second
# and third columns of t have been
# collapsed to inner, nested, tibbles:
#   # A tibble: 2 x 2
#        l1             data
#     <chr>           <list>
#   1     A <tibble [2 x 2]>
#   2     B <tibble [2 x 2]>

t_nested[[1,2]]
#
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
# So the data column in t_nested
# is indeed tibbles.


# Now I want to "spread" t_nested, 
# flipping it into a list where the column-1
# elements "A" and "B" become the list's
# names, and the tibbles in column 2
# becomes the values associated with those
# names.

# In my blog post 
# http://www.j-paine.org/blog/2017/10/how-best-to-convert-a-names-values-tibble-to-a-named-list.html 
# I tried functions for doing this. I could
# use spread(). But it turns out I don't need
# to. setNames() can do the job much more 
# efficiently. Here's my function. See
# the post for why it's as it is.

spread_to_list <- function( t )
{
  setNames( as.list( t[[2]] ), t[[1]] ) 
}


# Let's try it.

t_nested_spread <- spread_to_list( t_nested )
#
#   $A
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
#
#   $B
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Ba
#   2     b    Bb
# So that does what I wanted.
# Let's check by displaying one
# of the elements.

t_nested_spread$A
#
#   #  A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
# Good.

# So we can convert the outer level of
# a table t to a named list whose keys are
# the names in t[,1].

# I should note that spread_to_list() 
# is safe because nest() will ensure 
# that spread_to_list()'s argument 
# never has a name more than
# once in column 1.


# Let's make this into a function.

multinest <- function( t )
{
  colname_1 <- as.name( names(t)[1] )

  t %>% group_by( !!colname_1 ) %>%
        nest %>%
        spread_to_list    
}

t_nested_spread_ <- multinest( t )
identical( t_nested_spread, t_nested_spread_ )
#
# Which gives the same result as above,
# so multinest() appears OK.


# Now I want to make multinest() recursive so
# that it factors out the inner tables
# in the same way. I'll assume all the name
# columns are at the front of the table. I
# do have to give an extra argument saying
# how many there are.

multinest_2 <- function( t, no_of_name_cols )
{
  colname_1 <- as.name( names(t)[1] )
  #
  # For as.name and !! , see e.g. Henrik's
  # answer to https://stackoverflow.com/questions/26724124/standard-evaluation-in-dplyr-summarise-on-variable-given-as-a-character-string ,
  # "[u]se as.name if you have a character 
  # string that gives a variable name".

  listed <- t %>% group_by( !!colname_1 ) %>%
                  nest %>%
                  spread_to_list    

  if ( no_of_name_cols > 1 ) 
    return( map( listed, multinest_2, no_of_name_cols-1 ) )
  else
    return( listed )
}

identical( multinest_2( t, 1 ), t_nested_spread )
#
# So this function called for just one column
# works as before.


# Let's see what happens with two columns.

t_nested_spread_2 <- multinest_2( t, 2 )

for ( X in c('A','B') ) 
  for( Y in c('a','b') ) 
    cat( t_nested_spread_2[[X]][[Y]] ) 
#
# But ...
#   Error in cat(listed_2[[X]][[Y]]) : 
#   argument 1 (type 'list') cannot be handled by 'cat'

t_nested_spread_2$A$a
#   # A tibble: 1 x 1
#       val
#     <chr>
#   1    Aa

# The innermost list elements are tibbles, not
# simple values. That would be OK if I wanted to
# carry more than one value cell with each
# innermost element, but it wastes space when
# I have only one. What's causing this is the
# inner call to nest(), which is converting
# a tibble such as 
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
# into
#   # A tibble: 2 x 2
#       l2             data
#    <chr>           <list>
#  1     a <tibble [1 x 1]>
#  2     b <tibble [1 x 1]>


# Let's rewrite to avoid this. If there's
# only one column left, just convert to list
# without calling nest().

multinest_3 <- function( t, no_of_name_cols )
{
  colname_1 <- as.name( names(t)[1] )
 
  if ( no_of_name_cols > 1 )
    listed <- t %>% group_by( !!colname_1 ) %>%
                    nest %>%
                    spread_to_list %>% 
                    map( multinest_3, no_of_name_cols-1 )
  else if ( no_of_name_cols == 1 )
    listed <- t %>% spread_to_list

  listed
}

identical( multinest_3( t, 1 ), t_nested_listed )
#
# This actually returns FALSE. So asked to nest
# just one column, the function doesn't do the same
# as its predecessors. It returns the list
#   list(A = "a", B = "a", A = "b", B = "b")
# But that's reasonable, because it's treating
# the second column as the values. So I'll not
# amend it.


# What happens with two columns?

t_nested_spread_3 <- multinest_3( t, 2 )

for ( X in c('A','B') ) 
  for( Y in c('a','b') ) 
    cat( t_nested_spread_3[[X]][[Y]] ) 
#
# Displays AaAbBaBb.
# So that looks OK. 

dput(t_nested_spread_3)
#
# So does a dput():
#   structure(list(A = structure(list(a = "Aa", b = "Ab"), .Names = c("a", 
#   "b")), B = structure(list(a = "Ba", b = "Bb"), .Names = c("a", 
#   "b"))), .Names = c("A", "B"))


# Now let's try three columns. 
# I'll reassign t for this.

t <- tribble( ~l1, ~l2, ~l3, ~val
            , 'A', 'a', '1', 'Aa1'
            , 'A', 'a', '2', 'Aa2'
            , 'A', 'b', '1', 'Ab1' 
            , 'A', 'b', '2', 'Ab2'
            , 'B', 'a', '1', 'Ba1'
            , 'B', 'a', '2', 'Ba2'
            , 'B', 'b', '1', 'Bb1' 
            , 'B', 'b', '2', 'Bb2'
            )

t_nested_spread_3 <- multinest_3( t, 3 )

for ( X in c('A','B') ) 
  for( Y in c('a','b') )
    for( Z in c('1','2') ) 
      cat( t_nested_spread_3[[X]][[Y]][[Z]] )
#
# Displays Aa1Aa2Ab1Ab2Ba1Ba2Bb1Bb2. Good.


# Now let's try on some real data, a fragment 
# of our VAT parameters.

vat_rates <- tribble( ~l1,   ~l2         , ~l3         , ~val
                    , 'VAT', 'Cigarettes', '' 	       , 'standard'
                    , 'VAT', 'Tobacco'   , ''          , 'standard'
                    , 'VAT', 'Narcotics' , ''	       , 0
                    , 'VAT', 'Clothing'  , 'Adult'     , 'standard'
                    , 'VAT', 'Clothing'  , 'Children'  , 0
                    , 'VAT', 'Clothing'  , 'Protective', 0
                    )

vat_rates_nested_spread <- multinest_3( vat_rates, 3 )
#
# But this goes awry on the first three table rows,
# where it generates results such as 
#   $VAT$Narcotics
#   $VAT$Narcotics[[1]]
#   [1] "0"
# The value is wrapped in an unwanted list.

# This is because I didn't check for names
# where some of the final components are
# empty. R has come across the '' and done its
# own thing. 


# I need to fix this. I'll do so below. If 
# the table passed to multinest() has '' 
# as its first element, return the final 
# element, the corresponding value. I don't think such a table can 
# ever have more than one row. If it did, it 
# would have been part of a table such as
#   Pars VAT ''  20
#   Pars VAT Low 17
# which would have to translate into the list
#   Pars$VAT <- 20
#   Pars$VAT$Low <- 17
# This would mean that Pars$VAT is 
# simultaneously a terminal and a non-terminal,
# which makes no sense.

multinest_4 <- function( t, no_of_name_cols )
{
  if ( t[[1,1]] == '' )
    return( t[[ 1, ncol(t) ]] )
  else {
    colname_1 <- as.name( names(t)[1] )

    if ( no_of_name_cols > 1 )
      listed <- t %>% group_by( !!colname_1 ) %>%
                      nest %>%
                      spread_to_list %>% 
                      map( multinest_4, no_of_name_cols-1 )
    else if ( no_of_name_cols == 1 )
      listed <- t %>% spread_to_list

    return( listed )
  }
}

vat_rates_nested_spread <- multinest_4( vat_rates, 3 )
#
#   $VAT
#   $VAT$Cigarettes
#   [1] "standard"
#
#   $VAT$Tobacco
#   [1] "standard"
#
#   $VAT$Narcotics
#   [1] "0"
#
#   $VAT$Clothing
#   $VAT$Clothing$Adult
#   [1] "standard"
#
#   $VAT$Clothing$Children
#   [1] "0"
#
#   $VAT$Clothing$Protective
#   [1] "0"
#
# Good.

To leave a comment for the author, please follow the link and comment on their blog: R – Jocelyn Ireson-Paine's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)