Cleaning the Table

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While I’m talking about getting data into R this weekend, here’s another quick example that came up in class this week. The mortality data in the previous example were nice and clean coming in the door. That’s usually not the case. Data can be and usually is messy in all kinds of ways. One of the most common, particularly in the case of summary tables obtained from some source or other, is that the values aren’t directly usable. The following summary table was copied and pasted into Excel from an external source, saved as a CSV file, and arrived looking like this:

library(tidyverse)

rfm_tbl <- read_csv("data/rfm_table.csv")


## Parsed with column specification:
## cols(
##   SEGMENT = col_character(),
##   DESCRIPTION = col_character(),
##   R = col_character(),
##   F = col_character(),
##   M = col_character()
## )


rfm_tbl 


## # A tibble: 23 x 5
##    SEGMENT        DESCRIPTION                             R     F     M    
##    <chr>          <chr>                                   <chr> <chr> <chr>
##  1 <NA>           <NA>                                    <NA>  <NA>  <NA> 
##  2 Champions      Bought recently, buy often and spend t… 4– 5  4– 5  4– 5 
##  3 <NA>           <NA>                                    <NA>  <NA>  <NA> 
##  4 Loyal Custome… Spend good money. Responsive to promot… 2– 5  3– 5  3– 5 
##  5 <NA>           <NA>                                    <NA>  <NA>  <NA> 
##  6 Potential Loy… Recent customers, spent good amount, b… 3– 5  1– 3  1– 3 
##  7 <NA>           <NA>                                    <NA>  <NA>  <NA> 
##  8 New Customers  Bought more recently, but not often     4– 5  <= 1  <= 1 
##  9 <NA>           <NA>                                    <NA>  <NA>  <NA> 
## 10 Promising      Recent shoppers, but haven’t spent much 3– 4  <= 1  <= 1 
## # … with 13 more rows

This is messy and we can’t do anything with the values in R, F, and M. Ultimately we want a table with separate columns containing the low and high values for these variables. If no lower bound is shown, the lower bound is zero. We’re going to use a few tools, notably separate() to get where we want to be. I’ll step through this pipeline one piece at a time, so you can see how the table is being changed from start to finish.

First let’s clean clean the variable names and remove the entirely blank lines.

rfm_tbl %>% 
  janitor::clean_names() %>%
  filter_all(any_vars(!is.na(.))) 


## # A tibble: 11 x 5
##    segment        description                             r     f     m    
##    <chr>          <chr>                                   <chr> <chr> <chr>
##  1 Champions      Bought recently, buy often and spend t… 4– 5  4– 5  4– 5 
##  2 Loyal Custome… Spend good money. Responsive to promot… 2– 5  3– 5  3– 5 
##  3 Potential Loy… Recent customers, spent good amount, b… 3– 5  1– 3  1– 3 
##  4 New Customers  Bought more recently, but not often     4– 5  <= 1  <= 1 
##  5 Promising      Recent shoppers, but haven’t spent much 3– 4  <= 1  <= 1 
##  6 Need Attention Above average recency, frequency & mon… 2– 3  2– 3  2– 3 
##  7 About To Sleep Below average recency, frequency & mon… 2– 3  <= 2  <= 2 
##  8 At Risk        Spent big money, purchased often but l… <= 2  2– 5  2– 5 
##  9 Can’t Lose Th… Made big purchases and often, but long… <= 1  4– 5  4– 5 
## 10 Hibernating    Low spenders, low frequency, purchased… 1– 2  1– 2  1– 2 
## 11 Lost           Lowest recency, frequency & monetary s… <= 2  <= 2  <= 2

Next we start work on the values. I thought about different ways of doing this, notably working out a way to apply or map separate() to each of the columns I want to change. I got slightly bogged down doing this, and instead decided to lengthen the r, f, and m variables into a single key-value pair, do the recoding there, and then widen the result again. First, lengthen the data:

rfm_tbl %>% 
  janitor::clean_names() %>%
  filter_all(any_vars(!is.na(.))) %>%
  pivot_longer(cols = r:m)


## # A tibble: 33 x 4
##    segment         description                                  name  value
##    <chr>           <chr>                                        <chr> <chr>
##  1 Champions       Bought recently, buy often and spend the mo… r     4– 5 
##  2 Champions       Bought recently, buy often and spend the mo… f     4– 5 
##  3 Champions       Bought recently, buy often and spend the mo… m     4– 5 
##  4 Loyal Customers Spend good money. Responsive to promotions   r     2– 5 
##  5 Loyal Customers Spend good money. Responsive to promotions   f     3– 5 
##  6 Loyal Customers Spend good money. Responsive to promotions   m     3– 5 
##  7 Potential Loya… Recent customers, spent good amount, bought… r     3– 5 
##  8 Potential Loya… Recent customers, spent good amount, bought… f     1– 3 
##  9 Potential Loya… Recent customers, spent good amount, bought… m     1– 3 
## 10 New Customers   Bought more recently, but not often          r     4– 5 
## # … with 23 more rows

I’m quite sure that there’s an elegant way to use one of the map() functions to process the r, f, and m columns in sequence. But seeing as I couldn’t quickly figure it out, this alternative strategy works just fine. In fact, as a general approach I think it’s always worth remembering that the tidyverse really “wants” your data to be in long form, and lots of things that are awkward or conceptually tricky can suddenly become much easier if you get the data into the shape that the function toolbox wants it to be in. Lengthening the data you’re working with is very often the right approach, and you know you can widen it later on once you’re done cleaning or otherwise manipulating it.

With our table in long format we can now use separate() on the value column. The separate() function is very handy for pulling apart variables that should be in different columns. Its defaults are good, too. In this case I didn’t have to write a regular expression to specify the characters that are dividing up the values. In the function call we use convert = TRUE to turn the results into integers, and fill = "left" because there’s an implicit zero on the left of each entry that looks like e.g. <= 2.

rfm_tbl %>% 
  janitor::clean_names() %>%
  filter_all(any_vars(!is.na(.))) %>%
  pivot_longer(cols = r:m) %>% 
  separate(col = value, into = c("lo", "hi"), 
           remove = FALSE, convert = TRUE, 
           fill = "left") 


## # A tibble: 33 x 6
##    segment       description                        name  value    lo    hi
##    <chr>         <chr>                              <chr> <chr> <int> <int>
##  1 Champions     Bought recently, buy often and sp… r     4– 5      4     5
##  2 Champions     Bought recently, buy often and sp… f     4– 5      4     5
##  3 Champions     Bought recently, buy often and sp… m     4– 5      4     5
##  4 Loyal Custom… Spend good money. Responsive to p… r     2– 5      2     5
##  5 Loyal Custom… Spend good money. Responsive to p… f     3– 5      3     5
##  6 Loyal Custom… Spend good money. Responsive to p… m     3– 5      3     5
##  7 Potential Lo… Recent customers, spent good amou… r     3– 5      3     5
##  8 Potential Lo… Recent customers, spent good amou… f     1– 3      1     3
##  9 Potential Lo… Recent customers, spent good amou… m     1– 3      1     3
## 10 New Customers Bought more recently, but not oft… r     4– 5      4     5
## # … with 23 more rows

Before widening the data again we drop the value column. We don’t need it anymore. (It will mess up the widening if we keep it, too: try it and see what happens.)

rfm_tbl %>% 
  janitor::clean_names() %>%
  filter_all(any_vars(!is.na(.))) %>%
  pivot_longer(cols = r:m) %>% 
  separate(col = value, into = c("lo", "hi"), 
           remove = FALSE, convert = TRUE, 
           fill = "left") %>%
  select(-value) 


## # A tibble: 33 x 5
##    segment        description                             name     lo    hi
##    <chr>          <chr>                                   <chr> <int> <int>
##  1 Champions      Bought recently, buy often and spend t… r         4     5
##  2 Champions      Bought recently, buy often and spend t… f         4     5
##  3 Champions      Bought recently, buy often and spend t… m         4     5
##  4 Loyal Custome… Spend good money. Responsive to promot… r         2     5
##  5 Loyal Custome… Spend good money. Responsive to promot… f         3     5
##  6 Loyal Custome… Spend good money. Responsive to promot… m         3     5
##  7 Potential Loy… Recent customers, spent good amount, b… r         3     5
##  8 Potential Loy… Recent customers, spent good amount, b… f         1     3
##  9 Potential Loy… Recent customers, spent good amount, b… m         1     3
## 10 New Customers  Bought more recently, but not often     r         4     5
## # … with 23 more rows

Now we can widen the data, with pivot_wider().

rfm_tbl %>% 
  janitor::clean_names() %>%
  filter_all(any_vars(!is.na(.))) %>%
  pivot_longer(cols = r:m) %>% 
  separate(col = value, into = c("lo", "hi"), 
           remove = FALSE, convert = TRUE, 
           fill = "left") %>%
  select(-value) %>%
  pivot_wider(names_from = name, 
              values_from = lo:hi) 


## # A tibble: 11 x 8
##    segment     description               lo_r  lo_f  lo_m  hi_r  hi_f  hi_m
##    <chr>       <chr>                    <int> <int> <int> <int> <int> <int>
##  1 Champions   Bought recently, buy of…     4     4     4     5     5     5
##  2 Loyal Cust… Spend good money. Respo…     2     3     3     5     5     5
##  3 Potential … Recent customers, spent…     3     1     1     5     3     3
##  4 New Custom… Bought more recently, b…     4    NA    NA     5     1     1
##  5 Promising   Recent shoppers, but ha…     3    NA    NA     4     1     1
##  6 Need Atten… Above average recency, …     2     2     2     3     3     3
##  7 About To S… Below average recency, …     2    NA    NA     3     2     2
##  8 At Risk     Spent big money, purcha…    NA     2     2     2     5     5
##  9 Can’t Lose… Made big purchases and …    NA     4     4     1     5     5
## 10 Hibernating Low spenders, low frequ…     1     1     1     2     2     2
## 11 Lost        Lowest recency, frequen…    NA    NA    NA     2     2     2

Finally we put back those implicit zeros using replace_na() and reorder the columns to our liking. Using replace_na() is fine here because we know that every missing value should in fact be a zero.

rfm_tbl %>% 
  janitor::clean_names() %>%
  filter_all(any_vars(!is.na(.))) %>%
  pivot_longer(cols = r:m) %>% 
  separate(col = value, into = c("lo", "hi"), 
           remove = FALSE, convert = TRUE, 
           fill = "left") %>%
  select(-value) %>%
  pivot_wider(names_from = name, 
              values_from = lo:hi) %>%
  mutate_if(is.integer, replace_na, 0) %>%
  select(segment, 
         lo_r, hi_r, 
         lo_f, hi_f, 
         lo_m, hi_m, 
         description)


## # A tibble: 11 x 8
##    segment      lo_r  hi_r  lo_f  hi_f  lo_m  hi_m description             
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>                   
##  1 Champions       4     5     4     5     4     5 Bought recently, buy of…
##  2 Loyal Cust…     2     5     3     5     3     5 Spend good money. Respo…
##  3 Potential …     3     5     1     3     1     3 Recent customers, spent…
##  4 New Custom…     4     5     0     1     0     1 Bought more recently, b…
##  5 Promising       3     4     0     1     0     1 Recent shoppers, but ha…
##  6 Need Atten…     2     3     2     3     2     3 Above average recency, …
##  7 About To S…     2     3     0     2     0     2 Below average recency, …
##  8 At Risk         0     2     2     5     2     5 Spent big money, purcha…
##  9 Can’t Lose…     0     1     4     5     4     5 Made big purchases and …
## 10 Hibernating     1     2     1     2     1     2 Low spenders, low frequ…
## 11 Lost            0     2     0     2     0     2 Lowest recency, frequen…

Much nicer.

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)