Converting Lists of Lists of Lists to Data.Frames (or Tibbles)

Posted on February 6, 2021 by rstats-tips.net in R bloggers | 0 Comments

[This article was first published on rstats-tips.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m working on an R-package to access the data of a web service. So I have to handle large data I get back from an API call. The result is encoded in JSON-format which itself results in a large list of lists (of lists).

But I want to convert these lists of lists into a data.frame or tibble. Sounds easy …. Here are the caveats I came across.

Some sample data

Let’s first create some sample data. As I mentioned before the result of an API-call is a list of lists. The inner lists contain lists, too.

So here’s a function that generates such data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


options(tidyverse.quiet = TRUE)
library(tidyverse)

set.seed(21)

list_entry <- function(dummy) {
  list(
    sample(letters, 1),
    sample(1:100, 1),
    list(paste0("Sublist ", round(runif(1, 0, 100))), 
         paste0("Sublist ", round(runif(1, 0, 100)))
         ),
    runif(1, 0, 100)
  )
}

Let’s build a small list and another one we can use for performance measurement.

1
2
3
4
5


n_small <- 5
list_of_list <-lapply(as.list(1:n_small), list_entry)

n_large <- 100000
long_list_of_list <-lapply(as.list(1:n_large), list_entry)

Tidyverse

First we will use the tidyverse functions to convert the data.

Naive Way

1

list_of_list %>% as_tibble()

1

## Error: Columns 1, 2, 3, 4, and 5 must be named.

Uhh, the error occurs because the list does not contain named values. So we need to generate (or repair) the names.

1

list_of_list %>% as_tibble(.name_repair = "universal")

1
2
3
4
5
6


## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5

1
2
3
4
5
6
7


## # A tibble: 4 x 5
##   ...1       ...2       ...3       ...4       ...5      
##   <list>     <list>     <list>     <list>     <list>    
## 1 <chr [1]>  <chr [1]>  <chr [1]>  <chr [1]>  <chr [1]> 
## 2 <int [1]>  <int [1]>  <int [1]>  <int [1]>  <int [1]> 
## 3 <list [2]> <list [2]> <list [2]> <list [2]> <list [2]>
## 4 <dbl [1]>  <dbl [1]>  <dbl [1]>  <dbl [1]>  <dbl [1]>

Okay, that’s the wrong way. It generates the transposed version of what we want.

So use a version of all of these fantastic map-function from the purrr-package: (Read here for other fantastic stuff of purrr.)

1

list_of_list %>% map_dfr(as_tibble, .name_repair = "universal")

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


## # A tibble: 10 x 4
##    ...1   ...2 ...3       ...4
##    <chr> <int> <list>    <dbl>
##  1 a         3 <chr [1]>  91.9
##  2 a         3 <chr [1]>  91.9
##  3 p        21 <chr [1]>  66.8
##  4 p        21 <chr [1]>  66.8
##  5 j        94 <chr [1]>  52.5
##  6 j        94 <chr [1]>  52.5
##  7 b        67 <chr [1]>  77.6
##  8 b        67 <chr [1]>  77.6
##  9 j        46 <chr [1]>  77.8
## 10 j        46 <chr [1]>  77.8

Look’s nice. But the original list consists of 5 rows. The result above is twice as long. So what happend? as_tibble generates for each entry of the list of the inner lists a row. That’s not what I’ve expected. I would prefer to get a another column for each entry.

Flatten the Inner List

When we apply as.data.frame to an inner list the list of each inner list is converted into two columns:

1
2
3
4
5


column_names <- letters[1:5]
list_of_list %>%
  map(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>%
  bind_rows() %>% 
  as_tibble()

1
2
3
4
5
6
7
8


## # A tibble: 5 x 5
##   a         b c          d              e
##   <chr> <int> <chr>      <chr>      <dbl>
## 1 a         3 Sublist 18 Sublist 96  91.9
## 2 p        21 Sublist 99 Sublist 85  66.8
## 3 j        94 Sublist 17 Sublist 4   52.5
## 4 b        67 Sublist 1  Sublist 61  77.6
## 5 j        46 Sublist 79 Sublist 57  77.8

Yeah, that’s great. That’s what we want. We can even do it with map_dfr:

1
2
3


column_names <- letters[1:5]
list_of_list %>% map_dfr(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>% 
  as_tibble()

1
2
3
4
5
6
7
8


## # A tibble: 5 x 5
##   a         b c          d              e
##   <chr> <int> <chr>      <chr>      <dbl>
## 1 a         3 Sublist 18 Sublist 96  91.9
## 2 p        21 Sublist 99 Sublist 85  66.8
## 3 j        94 Sublist 17 Sublist 4   52.5
## 4 b        67 Sublist 1  Sublist 61  77.6
## 5 j        46 Sublist 79 Sublist 57  77.8

Performance

But how performant is this code. Let’s check it with the long_list_of_list and measere the time with the package tictoc.

1
2
3
4
5
6


library(tictoc)

tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>% 
  as_tibble()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


## # A tibble: 100,000 x 5
##    a         b c          d              e
##    <chr> <int> <chr>      <chr>      <dbl>
##  1 w        71 Sublist 93 Sublist 51  74.5
##  2 p        98 Sublist 65 Sublist 34  50.8
##  3 a        49 Sublist 51 Sublist 6   15.1
##  4 u        92 Sublist 77 Sublist 41  87.0
##  5 e        91 Sublist 63 Sublist 83  13.0
##  6 b        68 Sublist 7  Sublist 3   92.2
##  7 x        63 Sublist 13 Sublist 82  92.3
##  8 l        50 Sublist 53 Sublist 97  52.8
##  9 m        85 Sublist 92 Sublist 23  92.8
## 10 t        53 Sublist 99 Sublist 6   86.3
## # … with 99,990 more rows

1

toc()

1

## 57.617 sec elapsed

WHAT? One minute for 100,000 rows? That’s long. What can we change?

1
2
3
4
5


library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x = unlist(x); x = set_names(x, column_names); x}) %>% 
  as_tibble()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


## # A tibble: 100,000 x 5
##    a     b     c          d          e               
##    <chr> <chr> <chr>      <chr>      <chr>           
##  1 w     71    Sublist 93 Sublist 51 74.5060192188248
##  2 p     98    Sublist 65 Sublist 34 50.765589158982 
##  3 a     49    Sublist 51 Sublist 6  15.1016460731626
##  4 u     92    Sublist 77 Sublist 41 87.0233365567401
##  5 e     91    Sublist 63 Sublist 83 13.0497927311808
##  6 b     68    Sublist 7  Sublist 3  92.2474714694545
##  7 x     63    Sublist 13 Sublist 82 92.2626771498471
##  8 l     50    Sublist 53 Sublist 97 52.8331008506939
##  9 m     85    Sublist 92 Sublist 23 92.7664576098323
## 10 t     53    Sublist 99 Sublist 6  86.2593436148018
## # … with 99,990 more rows

1

toc()

1

## 2.7 sec elapsed

Three seconds – that’s better!

But the downside is we lost the different types of the columns. All columns are converted to character.

Don’t lose the types

The solution is still within purrr: We can use the functionflatten:

1
2
3
4
5


library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x = flatten(x); x = set_names(x, column_names); x}) %>% 
  as_tibble()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


## # A tibble: 100,000 x 5
##    a         b c          d              e
##    <chr> <int> <chr>      <chr>      <dbl>
##  1 w        71 Sublist 93 Sublist 51  74.5
##  2 p        98 Sublist 65 Sublist 34  50.8
##  3 a        49 Sublist 51 Sublist 6   15.1
##  4 u        92 Sublist 77 Sublist 41  87.0
##  5 e        91 Sublist 63 Sublist 83  13.0
##  6 b        68 Sublist 7  Sublist 3   92.2
##  7 x        63 Sublist 13 Sublist 82  92.3
##  8 l        50 Sublist 53 Sublist 97  52.8
##  9 m        85 Sublist 92 Sublist 23  92.8
## 10 t        53 Sublist 99 Sublist 6   86.3
## # … with 99,990 more rows

1

toc()

1

## 2.661 sec elapsed

To leave a comment for the author, please follow the link and comment on their blog: rstats-tips.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Converting Lists of Lists of Lists to Data.Frames (or Tibbles)

Some sample data

Tidyverse

Naive Way

Flatten the Inner List

Performance

Don’t lose the types

Related

Some sample data

Tidyverse

Naive Way

Flatten the Inner List

Performance

Don’t lose the types

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)