Working with tree-based hierarchies using data.tree

[This article was first published on rstats-tips.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Lately I tried to visualize an hierarchy with Tableau Desktop. The problem was that the hierarchy had a variable depth because it was tree-based. Each row had an id and a parent_id. Normally hierarchies in Tableau are defined by pulling some fields together, such as product category, product group and product id.

Handling tree-based hierarchies seems to be a lot more complex. I found a plugin at https://github.com/tableau/extension-hierarchy-navigator-sandboxed but this only works online.

So I asked myself how I can handle this using R. I found the R-package data.tree at https://github.com/gluc/data.tree. I want to describe how I use this package to preprocess my data.

Some tree-based data

Let’s create some tree-based data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
options(tidyverse.quiet = TRUE)
library(tidyverse)

geography <- tribble(
  ~id, ~parent_id, ~area, ~some_additional_data,
  1, NA, "Europe", "That's a continent",
  2, 1, "Germany", "That's a country",
  3, 1, "France",  "Oh, yeah, another country",
  4, 1, "Denmark", "Oh dear, how many contries do you know",
  5, 2, "Hesse",   "ah, that's called Bundesland",
  6, 5, "Wiesbaden", "and that's the capital of Hesse",
  7, 2, "Berlin", "that's the capital of Germany and also a Bundesland",
  8, 5, "Main-Kinzig-Kreis", "that's a Landkreis",
  9, 8, "Gelnhausen", "and the Landkreis does have a capital called Kreisstadt",
 10, 3, "Paris", "The city of Love",
 11, 2, "Bavaria", "another Bundesland",
 12, 11, "Munich", "and its capital"
)

geography
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
## # A tibble: 12 x 4
##       id parent_id area            some_additional_data                         
##    <dbl>     <dbl> <chr>           <chr>                                        
##  1     1        NA Europe          That's a continent                           
##  2     2         1 Germany         That's a country                             
##  3     3         1 France          Oh, yeah, another country                    
##  4     4         1 Denmark         Oh dear, how many contries do you know       
##  5     5         2 Hesse           ah, that's called Bundesland                 
##  6     6         5 Wiesbaden       and that's the capital of Hesse              
##  7     7         2 Berlin          that's the capital of Germany and also a Bun…
##  8     8         5 Main-Kinzig-Kr… that's a Landkreis                           
##  9     9         8 Gelnhausen      and the Landkreis does have a capital called…
## 10    10         3 Paris           The city of Love                             
## 11    11         2 Bavaria         another Bundesland                           
## 12    12        11 Munich          and its capital

The Task

So the first two columns id and parent_id define the tree. How can we convert this format into a format where the hierarchy consists of n columns with the highest level in column Level_1, the second highest in Level_2 and so on?

The code should not depend on a given level of depth.

data.tree

Let’s take a look at data.tree:

1
2
3
library(data.tree)

FromDataFrameNetwork(geography)
1
2
##   levelName
## 1        NA

That doesn’t look promising. But wait, let’s replace NA by 0 and for further investigation let’s save id in another column called my_id:

1
2
3
4
geography %>% 
  replace_na(list(parent_id = 0)) %>%
  mutate(my_id = id) %>% 
  FromDataFrameNetwork()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
##                levelName
## 1  0                    
## 2   °--1                
## 3       ¦--2            
## 4       ¦   ¦--5        
## 5       ¦   ¦   ¦--6    
## 6       ¦   ¦   °--8    
## 7       ¦   ¦       °--9
## 8       ¦   ¦--7        
## 9       ¦   °--11       
## 10      ¦       °--12   
## 11      ¦--3            
## 12      ¦   °--10       
## 13      °--4

Oh great, that’s the tree structure we had in mind. Now let’s build a data.frame:

1
2
3
4
5
geography %>% 
  replace_na(list(parent_id = 0)) %>%
  mutate(my_id = id) %>% 
  FromDataFrameNetwork() %>% 
  ToDataFrameTree("level", "area", "my_id")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
##                levelName level              area my_id
## 1  0                         1              <NA>    NA
## 2   °--1                     2            Europe     1
## 3       ¦--2                 3           Germany     2
## 4       ¦   ¦--5             4             Hesse     5
## 5       ¦   ¦   ¦--6         5         Wiesbaden     6
## 6       ¦   ¦   °--8         5 Main-Kinzig-Kreis     8
## 7       ¦   ¦       °--9     6        Gelnhausen     9
## 8       ¦   ¦--7             4            Berlin     7
## 9       ¦   °--11            4           Bavaria    11
## 10      ¦       °--12        5            Munich    12
## 11      ¦--3                 3            France     3
## 12      ¦   °--10            4             Paris    10
## 13      °--4                 3           Denmark     4

Now we have the level of the hierarchy stored in column level and the value or name of the level in area.

Pivoting this data.frame should do the rest:

1
2
3
4
5
6
7
8
geography %>% 
  replace_na(list(parent_id = 0)) %>%
  mutate(my_id = id) %>% 
  FromDataFrameNetwork() %>% 
  ToDataFrameTree("level", "area", "my_id") %>% 
  select(-levelName) %>%
  mutate(level = glue::glue("Level_{level}")) %>%
  pivot_wider(names_from = level, values_from = area)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
## # A tibble: 13 x 7
##    my_id Level_1 Level_2 Level_3 Level_4 Level_5           Level_6   
##    <dbl> <chr>   <chr>   <chr>   <chr>   <chr>             <chr>     
##  1    NA <NA>    <NA>    <NA>    <NA>    <NA>              <NA>      
##  2     1 <NA>    Europe  <NA>    <NA>    <NA>              <NA>      
##  3     2 <NA>    <NA>    Germany <NA>    <NA>              <NA>      
##  4     5 <NA>    <NA>    <NA>    Hesse   <NA>              <NA>      
##  5     6 <NA>    <NA>    <NA>    <NA>    Wiesbaden         <NA>      
##  6     8 <NA>    <NA>    <NA>    <NA>    Main-Kinzig-Kreis <NA>      
##  7     9 <NA>    <NA>    <NA>    <NA>    <NA>              Gelnhausen
##  8     7 <NA>    <NA>    <NA>    Berlin  <NA>              <NA>      
##  9    11 <NA>    <NA>    <NA>    Bavaria <NA>              <NA>      
## 10    12 <NA>    <NA>    <NA>    <NA>    Munich            <NA>      
## 11     3 <NA>    <NA>    France  <NA>    <NA>              <NA>      
## 12    10 <NA>    <NA>    <NA>    Paris   <NA>              <NA>      
## 13     4 <NA>    <NA>    Denmark <NA>    <NA>              <NA>

As we can see the first level Level_1 is always empty but we will need it later. Let’s solve this by reducing each level before pivoting.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
geopgraphy_data.frame <- geography %>% 
  replace_na(list(parent_id = 0)) %>%
  mutate(my_id = id) %>% 
  FromDataFrameNetwork() %>% 
  ToDataFrameTree("level", "area", "my_id") %>% 
  select(-levelName) %>%
  mutate(level = level -1) %>%
  mutate(level = glue::glue("Level_{level}")) %>%
  pivot_wider(names_from = level, values_from = area) 

geopgraphy_data.frame
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
## # A tibble: 13 x 7
##    my_id Level_0 Level_1 Level_2 Level_3 Level_4           Level_5   
##    <dbl> <chr>   <chr>   <chr>   <chr>   <chr>             <chr>     
##  1    NA <NA>    <NA>    <NA>    <NA>    <NA>              <NA>      
##  2     1 <NA>    Europe  <NA>    <NA>    <NA>              <NA>      
##  3     2 <NA>    <NA>    Germany <NA>    <NA>              <NA>      
##  4     5 <NA>    <NA>    <NA>    Hesse   <NA>              <NA>      
##  5     6 <NA>    <NA>    <NA>    <NA>    Wiesbaden         <NA>      
##  6     8 <NA>    <NA>    <NA>    <NA>    Main-Kinzig-Kreis <NA>      
##  7     9 <NA>    <NA>    <NA>    <NA>    <NA>              Gelnhausen
##  8     7 <NA>    <NA>    <NA>    Berlin  <NA>              <NA>      
##  9    11 <NA>    <NA>    <NA>    Bavaria <NA>              <NA>      
## 10    12 <NA>    <NA>    <NA>    <NA>    Munich            <NA>      
## 11     3 <NA>    <NA>    France  <NA>    <NA>              <NA>      
## 12    10 <NA>    <NA>    <NA>    Paris   <NA>              <NA>      
## 13     4 <NA>    <NA>    Denmark <NA>    <NA>              <NA>

That’s nice.

Filling missing values of the hierarchy levels

Now we need to fill the missing value by using the right value of the line above. But wait, we have to group by each level and do this for each level.

That’s a great example for using purrr::reduce. We apply a function fill_level for each level and put the result as argument of the next call with next level.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# group by the parent level of the given level_name and fill the level
fill_level <- function(data, level_name) {
  grouping_level_id <- max(as.numeric(str_extract(level_name, "[0-9]+")) - 1, 0)
  previous_level <- glue::glue("Level_{grouping_level_id}")
  data %>%
    group_by(.data[[previous_level]]) %>%
    fill({{ level_name }}, .direction = "down")
}

# get all level names
level_names <- data.frame(names = names(geopgraphy_data.frame)) %>%
  filter(grepl("Level_", names)) %>%
  pull(names)

# use reduce to apply fill_level for each level to the result of the previous
# result
geopgraphy_data.frame <- reduce(level_names, fill_level, .init = geopgraphy_data.frame) %>%
  select(starts_with("Level_"), everything()) %>% 
  select(-Level_0)

geopgraphy_data.frame
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## # A tibble: 13 x 6
## # Groups:   Level_4 [4]
##    Level_1 Level_2 Level_3 Level_4           Level_5    my_id
##    <chr>   <chr>   <chr>   <chr>             <chr>      <dbl>
##  1 <NA>    <NA>    <NA>    <NA>              <NA>          NA
##  2 Europe  <NA>    <NA>    <NA>              <NA>           1
##  3 Europe  Germany <NA>    <NA>              <NA>           2
##  4 Europe  Germany Hesse   <NA>              <NA>           5
##  5 Europe  Germany Hesse   Wiesbaden         <NA>           6
##  6 Europe  Germany Hesse   Main-Kinzig-Kreis <NA>           8
##  7 Europe  Germany Hesse   Main-Kinzig-Kreis Gelnhausen     9
##  8 Europe  Germany Berlin  <NA>              <NA>           7
##  9 Europe  Germany Bavaria <NA>              <NA>          11
## 10 Europe  Germany Bavaria Munich            <NA>          12
## 11 Europe  France  <NA>    <NA>              <NA>           3
## 12 Europe  France  Paris   <NA>              <NA>          10
## 13 Europe  Denmark <NA>    <NA>              <NA>           4

Adding other fancy columns

Now we can add all the other columns our original data has, currently just some_additional_data:

1
2
3
geopgraphy_data.frame %>% 
  left_join(geography %>% select(-parent_id, -area), by = c('my_id' = 'id')) %>% 
  filter(! is.na(Level_1)) # get rid of this line with all NAs
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
## # A tibble: 12 x 7
## # Groups:   Level_4 [4]
##    Level_1 Level_2 Level_3 Level_4    Level_5  my_id some_additional_data       
##    <chr>   <chr>   <chr>   <chr>      <chr>    <dbl> <chr>                      
##  1 Europe  <NA>    <NA>    <NA>       <NA>         1 That's a continent         
##  2 Europe  Germany <NA>    <NA>       <NA>         2 That's a country           
##  3 Europe  Germany Hesse   <NA>       <NA>         5 ah, that's called Bundesla…
##  4 Europe  Germany Hesse   Wiesbaden  <NA>         6 and that's the capital of …
##  5 Europe  Germany Hesse   Main-Kinz… <NA>         8 that's a Landkreis         
##  6 Europe  Germany Hesse   Main-Kinz… Gelnhau…     9 and the Landkreis does hav…
##  7 Europe  Germany Berlin  <NA>       <NA>         7 that's the capital of Germ…
##  8 Europe  Germany Bavaria <NA>       <NA>        11 another Bundesland         
##  9 Europe  Germany Bavaria Munich     <NA>        12 and its capital            
## 10 Europe  France  <NA>    <NA>       <NA>         3 Oh, yeah, another country  
## 11 Europe  France  Paris   <NA>       <NA>        10 The city of Love           
## 12 Europe  Denmark <NA>    <NA>       <NA>         4 Oh dear, how many contries…

So now we have a data.frame we can use in Tableau. Our script doesn’t depend on the overall depth of the hierarchy or the depth of one special branch.

Maybe there’s a simpler way to build this data.frame. If so, let me know, please. 🙂

To leave a comment for the author, please follow the link and comment on their blog: rstats-tips.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)