Recode values with character subsetting

January 9, 2018
By

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

Do you ever have to recode many values at once? It’s a frequent chore when
preparing data. For example, suppose we had to replace state abbreviations
with the full names:

abbs <- c("AL", "AK", "AZ", "AZ", "WI", "WS")

You could write several ifelse() statements.

ifelse(abbs == "AL", "Alabama", 
       ifelse(abbs == "AK", "Alaska", 
              ifelse(abbs == "AZ", "Arizona", 

Actually, never mind! That gets out of hand very quickly.

case_when() is nice, especially when the replacement rules are more complex
than 1-to-1 matching.

dplyr::case_when(
  # Syntax: logical test ~ value to use when test is TRUE
  abbs == "AL" ~ "Alabama",
  abbs == "AK" ~ "Alaska",
  abbs == "AZ" ~ "Arizona",
  abbs == "WI" ~ "Wisconsin",
  # a fallback/default value
  TRUE ~ "No match"
)
#> [1] "Alabama"   "Alaska"    "Arizona"   "Arizona"   "Wisconsin" "No match"

We could also use one of my very favorite R tricks:
Character subsetting.
We create a named vector where the names are the data we have and the values are
the data we want. I use the mnemonic old_value = new_value. In this case, we
make a lookup table like so:

lookup <- c(
  # Syntax: name = value
  "AL" = "Alabama",
  "AK" = "Alaska",
  "AZ" = "Arizona",
  "WI" = "Wisconsin")

For example, subsetting with the string "AL" will retrieve the value with the
name "AL".

lookup["AL"]
#>        AL 
#> "Alabama"

With a vector of names, we can look up the values all at once.

lookup[abbs]
#>          AL          AK          AZ          AZ          WI         
#>   "Alabama"    "Alaska"   "Arizona"   "Arizona" "Wisconsin"          NA

If the names and the replacement values are stored in vectors, we can construct
the lookup table programmatically using setNames(). In our case, the datasets
package provides vectors with state names and state abbreviations.

full_lookup <- setNames(datasets::state.name, datasets::state.abb)
head(full_lookup)
#>           AL           AK           AZ           AR           CA 
#>    "Alabama"     "Alaska"    "Arizona"   "Arkansas" "California" 
#>           CO 
#>   "Colorado"

full_lookup[abbs]
#>          AL          AK          AZ          AZ          WI         
#>   "Alabama"    "Alaska"   "Arizona"   "Arizona" "Wisconsin"          NA

One complication is that the character subsetting yields NA when the
lookup table doesn’t have a matching name. That’s what’s happening above with
the illegal abbreviation "WS". We can fix this by replacing the NA
values with some default value.

matches <- full_lookup[abbs]
matches[is.na(matches)] <- "No match"
matches
#>          AL          AK          AZ          AZ          WI         
#>   "Alabama"    "Alaska"   "Arizona"   "Arizona" "Wisconsin"  "No match"

Finally, to clean away any traces of the matching process, we can unname() the
results.

unname(matches)
#> [1] "Alabama"   "Alaska"    "Arizona"   "Arizona"   "Wisconsin" "No match"

Many-to-one lookup tables

By the way, the lookup tables can be many-to-one. That is, different names can
retrieve the same value. For example, we can handle this example that has
synonymous names and differences in capitalization with many-to-one matching.

lookup <- c(
  "python" = "Python", "r" = "R", "node" = "Javascript", 
  "js" = "Javascript", "javascript" = "Javascript")

languages <- c("JS", "js", "Node", "R", "Python", "r", "JAvascript")

# Use tolower() to normalize the language names so 
# e.g., "R" and "r" can both match R
lookup[tolower(languages)]
#>           js           js         node            r       python 
#> "Javascript" "Javascript" "Javascript"          "R"     "Python" 
#>            r   javascript 
#>          "R" "Javascript"

Character by character string replacement

I’m motivated to write about character subsetting today because I used it in a
Stack Overflow answer.
Here is my paraphrasing of the problem.

Let’s say I have a long character string, and I’d like to use
stringr::str_replace_all to replace certain letters with others. According to
the documentation, str_replace_all can take a named vector and replaces the
name with the value. That works fine for 1 replacement, but for multiple, it
seems to do the replacements iteratively, so that one replacement can replace
another one.

library(tidyverse)
text_string = "developer"

# This works fine
text_string %>% 
  str_replace_all(c(e ="X")) 
#> [1] "dXvXlopXr"

# But this is not what I want
text_string %>% 
  str_replace_all(c(e ="p", p = "e"))
#> [1] "develoeer"

# Desired result would be "dpvploepr"

The iterative behavior here is that
str_replace_all("developer", c(e ="p", p = "e")) first replaces e with p
(yielding "dpvploppr") and then it applies the second rule on the output of
the first rule, replacing p with e (yielding "develoeer").

When I read this question, the replacement rules looked a lot like the lookup
tables that I use in character subsetting so I presented a function that
handles this problem by using character subsetting.

Let’s work through the question’s example. First, let’s break the string into
characters.

input <- "developer"
rules <- c(e = "p", p = "e")

chars <- unlist(strsplit(input, ""))
chars
#> [1] "d" "e" "v" "e" "l" "o" "p" "e" "r"

To avoid the issue of NAs, we create default rules so that every character in
the input is replaced by itself.

unique_chars <- unique(chars)
complete_rules <- setNames(unique_chars, unique_chars)
complete_rules
#>   d   e   v   l   o   p   r 
#> "d" "e" "v" "l" "o" "p" "r"

Now, we overwrite the default rules with the specific ones we are interested in.

# Find rules with the names as the real rules. 
# Replace them with the real rules.
complete_rules[names(rules)] <- rules
complete_rules
#>   d   e   v   l   o   p   r 
#> "d" "p" "v" "l" "o" "e" "r"

Then lookup with character subsetting will effectively apply all the replacement
rules. We glue the characters back together again to finish the transformation

replaced <- unname(complete_rules[chars])
paste0(replaced, collapse = "")
#> [1] "dpvploepr"

Here is everything combined into a single function, with some additional steps
needed to handle multiple strings at once.

str_replace_chars <- function(string, rules) {
  # Expand rules to replace characters with themselves 
  # if those characters do not have a replacement rule
  chars <- unique(unlist(strsplit(string, "")))
  complete_rules <- setNames(chars, chars)
  complete_rules[names(rules)] <- rules

  # Split each string into characters, replace and unsplit
  for (string_i in seq_along(string)) {
    chars_i <- unlist(strsplit(string[string_i], ""))
    string[string_i] <- paste0(complete_rules[chars_i], collapse = "")
  }
  string
}

rules <- c(a = "X", p = "e", e = "p")
strings <- c("application", "developer")

str_replace_chars(strings, rules)
#> [1] "XeelicXtion" "dpvploepr"

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)