A package for tidying nested lists

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data == knowledge! Much of the data we use, whether it be from
government repositories, social media, GitHub, or e-commerce sites comes
from public-facing APIs. The quantity of data available is truly
staggering, but munging JSON output into a format that is easily
analyzable in R is an equally staggering undertaking. When JSON is
turned into an R object, it usually becomes a deeply nested list riddled
with missing values that is difficult to untangle into a tidy format.
Moreover, every API presents its own challenges; code you’ve written to
clean up data from GitHub isn’t necessarily going to work on Twitter
data, as each API spews data out in its own unique, headache-inducing
nested list structure. To ease and generalize this process, Amanda
Dobbyn proposed
an
unconf18 project for a general API response tidier! Welcome roomba,
our first stab at easing the process of tidying nested lists!


Drawing

roomba will eventually be able to walk nested lists in a variety of
different structures from JSON output, replace NULL or .empty values
with NAs or a user-specified value, and return a tibble with names
matching a user-specified list. Of course, in two days we haven’t
fully achieved this vision, but we’re off to a promising start.

The birth of roomba

It was clear Amanda was on to something good by the lively discussion in
the #runconf18 issues
repository leading up to the unconf. Thanks to input from Jenny Bryan,
Jim Hester, Carl Boettinger, Scott Chamberlain, Bob Rudis, and Noam
Ross, we had a lot of ideas to work with when the unconf began.
Fortunately, Jim already had a function called dfs_idx()
(here)
written to perform depth-first searches of nested lists from the GitNub
GraphQL API
. With the core
list-traversal code out of the way, we split our efforts between
developing a usable interface, stockpiling .JSON files to test on, and
developing a Shiny app.

What’s working

We’ve got the basic structure of roomba sorted out, and you should
install it from GitHub to try out! Here are a few of the examples we’ve
put together.

library(roomba)
#load twitter data example
data(twitter_data)

#roomba-fy!
roomba(twitter_data, c("created_at", "name"))

## # A tibble: 24 x 2
##    name                 created_at                    
##                                             
##  1 Code for America     Mon Aug 10 18:59:29 +0000 2009
##  2 Ben Lorica     Mon Dec 22 22:06:18 +0000 2008
##  3 Dan Sholler          Thu Apr 03 20:09:24 +0000 2014
##  4 Code for America     Mon Aug 10 18:59:29 +0000 2009
##  5 FiveThirtyEight      Tue Jan 21 21:39:32 +0000 2014
##  6 Digital Impact       Wed Oct 07 21:10:53 +0000 2009
##  7 Drew Williams        Thu Aug 07 18:41:29 +0000 2014
##  8 joe                  Fri May 29 13:25:25 +0000 2009
##  9 Data Analysts 4 Good Wed May 07 16:55:33 +0000 2014
## 10 Ryan Frederick       Sun Mar 01 19:06:53 +0000 2009
## # ... with 14 more rows

And just the first element of the twitter_data list will show you
that roomba has simplified this process quite a bit.

twitter_data[[1]]

## $created_at
## [1] "Mon May 21 17:58:09 +0000 2018"
## 
## $id
## [1] 9.98624e+17
## 
## $id_str
## [1] "998623997397876743"
## 
## $text
## [1] "Could a program like food stamps have a Cambridge Analytica moment? How do we allow for the innovation that data pl… https://t.co/7tVf1qmNmq"
## 
## $truncated
## [1] TRUE
## 
## $entities
## $entities$hashtags
## list()
## 
## $entities$symbols
## list()
## 
## $entities$user_mentions
## list()
## 
## $entities$urls
## $entities$urls[[1]]
## $entities$urls[[1]]$url
## [1] "https://t.co/7tVf1qmNmq"
## 
## $entities$urls[[1]]$expanded_url
## [1] "https://twitter.com/i/web/status/998623997397876743"
## 
## $entities$urls[[1]]$display_url
## [1] "twitter.com/i/web/status/9…"
## 
## $entities$urls[[1]]$indices
## $entities$urls[[1]]$indices[[1]]
## [1] 117
## 
## $entities$urls[[1]]$indices[[2]]
## [1] 140
## 
## 
## 
## 
## 
## $source
## [1] "TweetDeck"
## 
## $in_reply_to_status_id
## NULL
## 
## $in_reply_to_status_id_str
## NULL
## 
## $in_reply_to_user_id
## NULL
## 
## $in_reply_to_user_id_str
## NULL
## 
## $in_reply_to_screen_name
## NULL
## 
## $user
## $user$id
## [1] 64482503
## 
## $user$id_str
## [1] "64482503"
## 
## $user$name
## [1] "Code for America"
## 
## $user$screen_name
## [1] "codeforamerica"
## 
## $user$location
## [1] "San Francisco, California"
## 
## $user$description
## [1] "Government can work for the people, by the people, in the 21st century. Help us make it so."
## 
## $user$url
## [1] "https://t.co/l9lokka0rJ"
## 
## $user$entities
## $user$entities$url
## $user$entities$url$urls
## $user$entities$url$urls[[1]]
## $user$entities$url$urls[[1]]$url
## [1] "https://t.co/l9lokka0rJ"
## 
## $user$entities$url$urls[[1]]$expanded_url
## [1] "http://codeforamerica.org"
## 
## $user$entities$url$urls[[1]]$display_url
## [1] "codeforamerica.org"
## 
## $user$entities$url$urls[[1]]$indices
## $user$entities$url$urls[[1]]$indices[[1]]
## [1] 0
## 
## $user$entities$url$urls[[1]]$indices[[2]]
## [1] 23
## 
## 
## 
## 
## 
## $user$entities$description
## $user$entities$description$urls
## list()
## 
## 
## 
## $user$protected
## [1] FALSE
## 
## $user$followers_count
## [1] 49202
## 
## $user$friends_count
## [1] 1716
## 
## $user$listed_count
## [1] 2659
## 
## $user$created_at
## [1] "Mon Aug 10 18:59:29 +0000 2009"
## 
## $user$favourites_count
## [1] 4490
## 
## $user$utc_offset
## [1] -25200
## 
## $user$time_zone
## [1] "Pacific Time (US & Canada)"
## 
## $user$geo_enabled
## [1] TRUE
## 
## $user$verified
## [1] TRUE
## 
## $user$statuses_count
## [1] 15912
## 
## $user$lang
## [1] "en"
## 
## $user$contributors_enabled
## [1] FALSE
## 
## $user$is_translator
## [1] FALSE
## 
## $user$is_translation_enabled
## [1] FALSE
## 
## $user$profile_background_color
## [1] "EBEBEB"
## 
## $user$profile_background_image_url
## [1] "http://abs.twimg.com/images/themes/theme7/bg.gif"
## 
## $user$profile_background_image_url_https
## [1] "https://abs.twimg.com/images/themes/theme7/bg.gif"
## 
## $user$profile_background_tile
## [1] FALSE
## 
## $user$profile_image_url
## [1] "http://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
## 
## $user$profile_image_url_https
## [1] "https://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
## 
## $user$profile_banner_url
## [1] "https://pbs.twimg.com/profile_banners/64482503/1497895952"
## 
## $user$profile_link_color
## [1] "CF1B41"
## 
## $user$profile_sidebar_border_color
## [1] "FFFFFF"
## 
## $user$profile_sidebar_fill_color
## [1] "F3F3F3"
## 
## $user$profile_text_color
## [1] "333333"
## 
## $user$profile_use_background_image
## [1] FALSE
## 
## $user$has_extended_profile
## [1] FALSE
## 
## $user$default_profile
## [1] FALSE
## 
## $user$default_profile_image
## [1] FALSE
## 
## $user$following
## [1] TRUE
## 
## $user$follow_request_sent
## [1] FALSE
## 
## $user$notifications
## [1] FALSE
## 
## $user$translator_type
## [1] "none"
## 
## 
## $geo
## NULL
## 
## $coordinates
## NULL
## 
## $place
## NULL
## 
## $contributors
## NULL
## 
## $is_quote_status
## [1] FALSE
## 
## $retweet_count
## [1] 0
## 
## $favorite_count
## [1] 0
## 
## $favorited
## [1] FALSE
## 
## $retweeted
## [1] FALSE
## 
## $possibly_sensitive
## [1] FALSE
## 
## $possibly_sensitive_appealable
## [1] FALSE
## 
## $lang
## [1] "en"

We created a Shiny app too, which in its current state allows you to
select a .Rda or .JSON file, pick two variables, and create a
scatterplot of them.

Run the app like this:

shiny_roomba()

What’s not

Of course, in two days we weren’t able to build a magical
one-size-fits-all solution to every API response data headache. Right
now, the main barrier to usability is that both the roomba() function
and shiny_roomba() app only work on sub-list items of the same length
and same data type stored at the same depth. To illustrate on the
twitter_data:

#This doesn't work because "user" has data of different types and lengths
roomba(twitter_data, c("user"))

## # A tibble: 1,007 x 1
##    user      
##        
##  1  
##  2  
##  3  
##  4  
##  5  
##  6  
##  7  
##  8 
##  9  
## 10  
## # ... with 997 more rows

#This doesn't work because "name" and "retweet_count" are at different depths.
roomba(twitter_data, c("name","retweet_count"))

## # A tibble: 0 x 0

In addition, we’ve got some features we want to add, such as handling a
larger variety of column names (i.e. passing a string for a single
column name, keeping all values even if they are all NULL). We would
love your feedback on other things we can add (open an issue in our Git repository)!

The team

Amanda Dobbyn
Job: Data Scientist at Earlybird Software
Project contributions: initial GH issue, package name, wrapper for
dfs_idx()

Jim Hester
Job: Software Engineer at RStudio
Project
contributions: dfs_idx() and remove_nulls() functions, package
building, README, and debugging

Christine Stawitz
Job: Postdoctoral researcher at University of Washington’s School
of Aquatic and Fishery Sciences
Project contributions: Shiny app,
README and blog post writing

Laura DeCicco
Job: Data Scientist at U.S. Geological Survey

Project contributions: Fixing merge conflicts 🙂

Isabella Velasquez
Job: Data Analyst at the Bill & Melinda
Gates Foundation
Project contributions: hex sticker!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)