How Best to Convert a Names-Values Tibble to a Named List?
[This article was first published on R – Jocelyn Ireson-Paine's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here, in the spirit of my “Experiments with by_row()” post, are some experiments in writing and timing
a function spread_to_list that converts a two-column tibble such as:
x 1 y 2 z 3 t 4to a named list:
list( x=1, y=2, z=3, t=4 )I need this for processing the parameter sheets shown in that by_row post, and I’ll explain why later. In this post, I’m just interested in how best to define
spread_to_list.
The best implementation looks like either
spread_to_list_3 or spread_to_list_4 below.
# try_spread_to_list.R
#
# Consider a tibble t with two columns,
# where each cell in the second column
# represents the value associated with
# the string (assumed to be its name)
# in the first column.
#
# I want to define a function spread_to_list()
# which converts t into a named list whose names
# are the names in the first column, and whose
# values are the values in the second column.
#
# For example, if t is:
# names no_of_days
# DaysInMay 31
# DaysInJune 30
# then
# spread_to_list(t)
# would be the list
# list( DaysInMay = 31, DaysInJune = 30 )
#
# The code here tries various ways
# of implementing spread_to_list,
# and benchmarks them. Two are variants
# of one another, using spread() and converting
# its result. Another two take the values
# column as a list, and call setNames() to
# convert to a named list.
library( tidyverse )
library( microbenchmark )
library( stringr )
t <- tribble( ~a , ~b
, 'x', 1
, 'y', 2
, 'z', 3
, 't', 4
)
#
# I'm going to try various ways
# of implementing spread_to_list()
# on t.
# First, what happens if I call spread()?
s <- spread( t, a, b )
#
# s becomes a one-row tibble:
# t x y z
# 1 4 1 2 3
# How can I convert s to a named list?
# An obvious way is to call map().
# Let's see what the function argument to
# map() gets passed if I call map() on s.
map( s, show )
#
# Displays
# [1] 4
# [1] 1
# [1] 2
# [1] 3
# So it gets passed a column of the
# tibble as an atomic vector.
map( s, function(x)x )
#
# Returns a list of these elements:
# list(t = 4, x = 1, y = 2, z = 3)
# That's because map() is defined to return
# lists. So the call above uses it merely
# as a type converter.
map( s, identity )
#
# Does the same. identity() is a built-in
# identity function.
# But maybe I can avoid mapping. In my
# experiments with by_row(),
# http://www.j-paine.org/blog/2017/10/experiments-with-by_row.html ,
# I discovered that as.list() will convert
# a tibble to a named list.
as.list( s )
#
# Also returns
# list(t = 4, x = 1, y = 2, z = 3).
# But can I avoid spread() altogether?
# Browsing the discussion groups gave me
# the idea of trying setNames().
# Let's try that, passing t's names as its
# second argunent and t's values as its
# first.
setNames( t[[2]], t[[1]] )
#
# Gives me an atomic named vector.
# I need to convert it to a list.
# One way is for the argument to setNames()
# to be a list, because that's specified to
# make it return a list.
setNames( as.list( t[[2]] ), t[[1]] )
#
# Returns the list
# list(x = 1, y = 2, z = 3, t = 4)
# But I could convert the result instead.
as.list( setNames( t[[2]], t[[1]] ) )
#
# Also returns
# list(x = 1, y = 2, z = 3, t = 4)
# Let's try these four implementations.
# First, define the functions.
spread_to_list_1 <- function( t )
{
colname1 <- names( t )[[1]]
colname2 <- names( t )[[2]]
t %>%
spread( !!as.name(colname1), !!as.name(colname2) ) %>%
map( identity )
}
spread_to_list_2 <- function( t )
{
colname1 <- names( t )[[1]]
colname2 <- names( t )[[2]]
t %>%
spread( !!as.name(colname1), !!as.name(colname2) ) %>%
as.list
}
spread_to_list_3 <- function( t )
{
setNames( as.list( t[[2]] ), t[[1]] )
}
spread_to_list_4 <- function( t )
{
as.list( setNames( t[[2]], t[[1]] ) )
}
# Now try them.
s1 <- spread_to_list_1( t )
s2 <- spread_to_list_2( t )
s3 <- spread_to_list_3( t )
s4 <- spread_to_list_4( t )
dput( s1 )
dput( s2 )
dput( s3 )
dput( s4 )
identical( s1, s2 )
identical( s1, s3 )
identical( s1, s4 )
# They all return named lists, but the order
# of elements is different for the spread()-based
# versions than from the as.list()-based ones.
# (That was obvious earlier, actually.)
# So I'll sort the lists, then test that
# they're identical. I'll also microbenchmark
# the functions.
sort_list <- function(l)
{
sort( unlist( l ) )
}
identical( sort_list(s1), sort_list(s2) )%>%show
identical( sort_list(s1), sort_list(s3) )%>%show
identical( sort_list(s1), sort_list(s4) )%>%show
mbres <- microbenchmark( spread_to_list_1( t )
, spread_to_list_2( t )
, spread_to_list_3( t )
, spread_to_list_4( t )
)
print( mbres )
# Now let's microbenchmark the functions applied
# to bigger tibbles. I'll generate random name-value
# tibbles of sizes n, where n is defined by the
# vector in the 'for' condition.
for ( n in c(10,30,100,300) ) {
cat( "Trying ", n, "row tibble\n" )
names <- replicate( n, str_c(sample(letters,5,replace=FALSE),collapse="") )
#
# Generate n random alphabetic strings.
# From Dirk Eddelbuettel's answer to
# https://stackoverflow.com/questions/1439513/creating-a-sequential-list-of-letters-with-r .
values <- runif( n, 1, 100 )
#
# Generate n random values.
t <- tibble( names=names
, values=values
)
#
# Use these to make a random tibble with
# two columns and n rows.
identical( sort_list(s1), sort_list(s2) )%>%show
identical( sort_list(s1), sort_list(s3) )%>%show
identical( sort_list(s1), sort_list(s4) )%>%show
mbres <- microbenchmark( spread_to_list_1( t )
, spread_to_list_2( t )
, spread_to_list_3( t )
, spread_to_list_4( t )
)
print( mbres )
}
# Here are the microbenchmark results for the
# 300-row tibble:
# Unit: microseconds
# expr min lq
# spread_to_list_1(t) 17738.631 17928.946
# spread_to_list_2(t) 14582.521 14805.888
# spread_to_list_3(t) 36.223 40.901
# spread_to_list_4(t) 35.317 40.146
# expr mean median uq
# spread_to_list_1(t) 18668.44595 18221.133 19532.8080
# spread_to_list_2(t) 15386.05835 15050.685 16355.4180
# spread_to_list_3(t) 46.67244 47.089 51.4655
# spread_to_list_4(t) 45.77894 45.882 51.6165
# max neval
# 21477.003 100
# 17314.838 100
# 64.294 100
# 63.087 100
# So the two spread() versions are much slower.
# Converting the spread() result with mapping is
# slower than with as.list(), probably unsurprisingly.
# The two setNames() versions are much faster.
# It doesn't seem to matter whether we type-convert
# to list by making setNames()'s first argument
# a list, or by making its result one.
To leave a comment for the author, please follow the link and comment on their blog: R – Jocelyn Ireson-Paine's Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.