How Best to Convert a Names-Values Tibble to a Named List?

October 6, 2017
By

(This article was first published on R – Jocelyn Ireson-Paine's Blog, and kindly contributed to R-bloggers)

Here, in the spirit of my “Experiments with by_row()” post, are some experiments in writing and timing
a function spread_to_list that converts a two-column tibble such as:

x 1
y 2 
z 3
t 4

to a named list:

list( x=1, y=2, z=3, t=4 )

I need this for processing the parameter sheets shown in that
by_row post, and I’ll explain why later. In this post,
I’m just interested in how best to define spread_to_list.
The best implementation looks like either
spread_to_list_3 or spread_to_list_4 below.

# try_spread_to_list.R
#
# Consider a tibble t with two columns,
# where each cell in the second column
# represents the value associated with
# the string (assumed to be its name) 
# in the first column.
#
# I want to define a function spread_to_list() 
# which converts t into a named list whose names
# are the names in the first column, and whose
# values are the values in the second column.
#
# For example, if t is:
#        names no_of_days
#    DaysInMay         31
#   DaysInJune         30
# then
#   spread_to_list(t) 
# would be the list
#   list( DaysInMay = 31, DaysInJune = 30 )
#
# The code here tries various ways
# of implementing spread_to_list,
# and benchmarks them. Two are variants
# of one another, using spread() and converting
# its result. Another two take the values
# column as a list, and call setNames() to 
# convert to a named list.


library( tidyverse )
library( microbenchmark )
library( stringr )


t <- tribble( ~a , ~b 
            , 'x',  1
            , 'y',  2 
            , 'z',  3
            , 't',  4
            )
#
# I'm going to try various ways
# of implementing spread_to_list()
# on t.


# First, what happens if I call spread()?

s <- spread( t, a, b )
#
# s becomes a one-row tibble:
#         t     x     y     z
#   1     4     1     2     3


# How can I convert s to a named list?

# An obvious way is to call map().
# Let's see what the function argument to
# map() gets passed if I call map() on s.

map( s, show )
#
# Displays
#  [1] 4
#  [1] 1
#  [1] 2
#  [1] 3
# So it gets passed a column of the
# tibble as an atomic vector.

map( s, function(x)x )
#
# Returns a list of these elements:
#   list(t = 4, x = 1, y = 2, z = 3)
# That's because map() is defined to return 
# lists. So the call above uses it merely
# as a type converter.

map( s, identity )
#
# Does the same. identity() is a built-in
# identity function.


# But maybe I can avoid mapping. In my
# experiments with by_row(), 
# http://www.j-paine.org/blog/2017/10/experiments-with-by_row.html ,
# I discovered that as.list() will convert
# a tibble to a named list.

as.list( s )
#
# Also returns 
#   list(t = 4, x = 1, y = 2, z = 3).


# But can I avoid spread() altogether?
# Browsing the discussion groups gave me
# the idea of trying setNames().
# Let's try that, passing t's names as its
# second argunent and t's values as its
# first.

setNames( t[[2]], t[[1]] )
# 
# Gives me an atomic named vector. 

# I need to convert it to a list.
# One way is for the argument to setNames()
# to be a list, because that's specified to
# make it return a list.

setNames( as.list( t[[2]] ), t[[1]] )
#
# Returns the list
#   list(x = 1, y = 2, z = 3, t = 4)

# But I could convert the result instead.

as.list( setNames( t[[2]], t[[1]] ) )
#
# Also returns
#   list(x = 1, y = 2, z = 3, t = 4)


# Let's try these four implementations.
# First, define the functions.

spread_to_list_1 <- function( t )
{
  colname1 <- names( t )[[1]]
  colname2 <- names( t )[[2]]
  t %>% 
    spread( !!as.name(colname1), !!as.name(colname2) ) %>% 
    map( identity )
}

spread_to_list_2 <- function( t )
{
  colname1 <- names( t )[[1]]
  colname2 <- names( t )[[2]]
  t %>% 
    spread( !!as.name(colname1), !!as.name(colname2) ) %>% 
    as.list
}

spread_to_list_3 <- function( t )
{
  setNames( as.list( t[[2]] ), t[[1]] )
}

spread_to_list_4 <- function( t )
{
  as.list( setNames( t[[2]], t[[1]] ) )
}


# Now try them.

s1 <- spread_to_list_1( t )
s2 <- spread_to_list_2( t )
s3 <- spread_to_list_3( t )
s4 <- spread_to_list_4( t )

dput( s1 )
dput( s2 )
dput( s3 )
dput( s4 )

identical( s1, s2 )
identical( s1, s3 )
identical( s1, s4 )


# They all return named lists, but the order
# of elements is different for the spread()-based
# versions than from the as.list()-based ones.
# (That was obvious earlier, actually.)
# So I'll sort the lists, then test that
# they're identical. I'll also microbenchmark
# the functions.

sort_list <- function(l)
{ 
  sort( unlist( l ) )
}

identical( sort_list(s1), sort_list(s2) )%>%show
identical( sort_list(s1), sort_list(s3) )%>%show
identical( sort_list(s1), sort_list(s4) )%>%show

mbres <- microbenchmark( spread_to_list_1( t )
                       , spread_to_list_2( t )
                       , spread_to_list_3( t )
                       , spread_to_list_4( t )
                       )
print( mbres )


# Now let's microbenchmark the functions applied
# to bigger tibbles. I'll generate random name-value
# tibbles of sizes n, where n is defined by the
# vector in the 'for' condition.

for ( n in c(10,30,100,300) ) {

  cat( "Trying ", n, "row tibble\n" )

  names <- replicate( n, str_c(sample(letters,5,replace=FALSE),collapse="") )
  #
  # Generate n random alphabetic strings.
  # From Dirk Eddelbuettel's answer to 
  # https://stackoverflow.com/questions/1439513/creating-a-sequential-list-of-letters-with-r .

  values <- runif( n, 1, 100 )
  #
  # Generate n random values.

  t <- tibble( names=names
             , values=values
             )
  #
  # Use these to make a random tibble with
  # two columns and n rows.

  identical( sort_list(s1), sort_list(s2) )%>%show
  identical( sort_list(s1), sort_list(s3) )%>%show
  identical( sort_list(s1), sort_list(s4) )%>%show

  mbres <- microbenchmark( spread_to_list_1( t )
                         , spread_to_list_2( t )
                         , spread_to_list_3( t )
                         , spread_to_list_4( t )
                         )
  print( mbres )
}


# Here are the microbenchmark results for the
# 300-row tibble:
#   Unit: microseconds
#                  expr       min        lq    
#   spread_to_list_1(t) 17738.631 17928.946 
#   spread_to_list_2(t) 14582.521 14805.888 
#   spread_to_list_3(t)    36.223    40.901   
#   spread_to_list_4(t)    35.317    40.146 
#                  expr        mean    median         uq
#   spread_to_list_1(t) 18668.44595 18221.133 19532.8080
#   spread_to_list_2(t) 15386.05835 15050.685 16355.4180
#   spread_to_list_3(t)    46.67244    47.089    51.4655
#   spread_to_list_4(t)    45.77894    45.882    51.6165
#         max neval
#   21477.003   100
#   17314.838   100
#      64.294   100
#      63.087   100

# So the two spread() versions are much slower.
# Converting the spread() result with mapping is
# slower than with as.list(), probably unsurprisingly.
# The two setNames() versions are much faster. 
# It doesn't seem to matter whether we type-convert
# to list by making setNames()'s first argument
# a list, or by making its result one.

To leave a comment for the author, please follow the link and comment on their blog: R – Jocelyn Ireson-Paine's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)