Recode values with character subsetting

[This article was first published on Higher Order Functions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Do you ever have to recode many values at once? It’s a frequent chore when
preparing data. For example, suppose we had to replace state abbreviations
with the full names:

<span class="n">abbs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"AL"</span><span class="p">,</span><span class="w"> </span><span class="s2">"AK"</span><span class="p">,</span><span class="w"> </span><span class="s2">"AZ"</span><span class="p">,</span><span class="w"> </span><span class="s2">"AZ"</span><span class="p">,</span><span class="w"> </span><span class="s2">"WI"</span><span class="p">,</span><span class="w"> </span><span class="s2">"WS"</span><span class="p">)</span><span class="w">
</span>

You could write several ifelse() statements.

<span class="n">ifelse</span><span class="p">(</span><span class="n">abbs</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"AL"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Alabama"</span><span class="p">,</span><span class="w"> 
       </span><span class="n">ifelse</span><span class="p">(</span><span class="n">abbs</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"AK"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Alaska"</span><span class="p">,</span><span class="w"> 
              </span><span class="n">ifelse</span><span class="p">(</span><span class="n">abbs</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"AZ"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Arizona"</span><span class="p">,</span><span class="w"> 
</span>

Actually, never mind! That gets out of hand very quickly.

case_when() is nice, especially when the replacement rules are more complex
than 1-to-1 matching.

<span class="n">dplyr</span><span class="o">::</span><span class="n">case_when</span><span class="p">(</span><span class="w">
  </span><span class="c1"># Syntax: logical test ~ value to use when test is TRUE</span><span class="w">
  </span><span class="n">abbs</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"AL"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Alabama"</span><span class="p">,</span><span class="w">
  </span><span class="n">abbs</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"AK"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Alaska"</span><span class="p">,</span><span class="w">
  </span><span class="n">abbs</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"AZ"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Arizona"</span><span class="p">,</span><span class="w">
  </span><span class="n">abbs</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"WI"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Wisconsin"</span><span class="p">,</span><span class="w">
  </span><span class="c1"># a fallback/default value</span><span class="w">
  </span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"No match"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] "Alabama"   "Alaska"    "Arizona"   "Arizona"   "Wisconsin" "No match"</span><span class="w">
</span>

We could also use one of my very favorite R tricks:
Character subsetting.
We create a named vector where the names are the data we have and the values are
the data we want. I use the mnemonic old_value = new_value. In this case, we
make a lookup table like so:

<span class="n">lookup</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
  </span><span class="c1"># Syntax: name = value</span><span class="w">
  </span><span class="s2">"AL"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Alabama"</span><span class="p">,</span><span class="w">
  </span><span class="s2">"AK"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Alaska"</span><span class="p">,</span><span class="w">
  </span><span class="s2">"AZ"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Arizona"</span><span class="p">,</span><span class="w">
  </span><span class="s2">"WI"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Wisconsin"</span><span class="p">)</span><span class="w">
</span>

For example, subsetting with the string "AL" will retrieve the value with the
name "AL".

<span class="n">lookup</span><span class="p">[</span><span class="s2">"AL"</span><span class="p">]</span><span class="w">
</span><span class="c1">#>        AL </span><span class="w">
</span><span class="c1">#> "Alabama"</span><span class="w">
</span>

With a vector of names, we can look up the values all at once.

<span class="n">lookup</span><span class="p">[</span><span class="n">abbs</span><span class="p">]</span><span class="w">
</span><span class="c1">#>          AL          AK          AZ          AZ          WI        <NA> </span><span class="w">
</span><span class="c1">#>   "Alabama"    "Alaska"   "Arizona"   "Arizona" "Wisconsin"          NA</span><span class="w">
</span>

If the names and the replacement values are stored in vectors, we can construct
the lookup table programmatically using setNames(). In our case, the datasets
package provides vectors with state names and state abbreviations.

<span class="n">full_lookup</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">datasets</span><span class="o">::</span><span class="n">state.name</span><span class="p">,</span><span class="w"> </span><span class="n">datasets</span><span class="o">::</span><span class="n">state.abb</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">full_lookup</span><span class="p">)</span><span class="w">
</span><span class="c1">#>           AL           AK           AZ           AR           CA </span><span class="w">
</span><span class="c1">#>    "Alabama"     "Alaska"    "Arizona"   "Arkansas" "California" </span><span class="w">
</span><span class="c1">#>           CO </span><span class="w">
</span><span class="c1">#>   "Colorado"</span><span class="w">

</span><span class="n">full_lookup</span><span class="p">[</span><span class="n">abbs</span><span class="p">]</span><span class="w">
</span><span class="c1">#>          AL          AK          AZ          AZ          WI        <NA> </span><span class="w">
</span><span class="c1">#>   "Alabama"    "Alaska"   "Arizona"   "Arizona" "Wisconsin"          NA</span><span class="w">
</span>

One complication is that the character subsetting yields NA when the
lookup table doesn’t have a matching name. That’s what’s happening above with
the illegal abbreviation "WS". We can fix this by replacing the NA
values with some default value.

<span class="n">matches</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">full_lookup</span><span class="p">[</span><span class="n">abbs</span><span class="p">]</span><span class="w">
</span><span class="n">matches</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">matches</span><span class="p">)]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"No match"</span><span class="w">
</span><span class="n">matches</span><span class="w">
</span><span class="c1">#>          AL          AK          AZ          AZ          WI        <NA> </span><span class="w">
</span><span class="c1">#>   "Alabama"    "Alaska"   "Arizona"   "Arizona" "Wisconsin"  "No match"</span><span class="w">
</span>

Finally, to clean away any traces of the matching process, we can unname() the
results.

<span class="n">unname</span><span class="p">(</span><span class="n">matches</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] "Alabama"   "Alaska"    "Arizona"   "Arizona"   "Wisconsin" "No match"</span><span class="w">
</span>

Many-to-one lookup tables

By the way, the lookup tables can be many-to-one. That is, different names can
retrieve the same value. For example, we can handle this example that has
synonymous names and differences in capitalization with many-to-one matching.

<span class="n">lookup</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
  </span><span class="s2">"python"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Python"</span><span class="p">,</span><span class="w"> </span><span class="s2">"r"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"R"</span><span class="p">,</span><span class="w"> </span><span class="s2">"node"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Javascript"</span><span class="p">,</span><span class="w"> 
  </span><span class="s2">"js"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Javascript"</span><span class="p">,</span><span class="w"> </span><span class="s2">"javascript"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Javascript"</span><span class="p">)</span><span class="w">

</span><span class="n">languages</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"JS"</span><span class="p">,</span><span class="w"> </span><span class="s2">"js"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Node"</span><span class="p">,</span><span class="w"> </span><span class="s2">"R"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Python"</span><span class="p">,</span><span class="w"> </span><span class="s2">"r"</span><span class="p">,</span><span class="w"> </span><span class="s2">"JAvascript"</span><span class="p">)</span><span class="w">

</span><span class="c1"># Use tolower() to normalize the language names so </span><span class="w">
</span><span class="c1"># e.g., "R" and "r" can both match R</span><span class="w">
</span><span class="n">lookup</span><span class="p">[</span><span class="n">tolower</span><span class="p">(</span><span class="n">languages</span><span class="p">)]</span><span class="w">
</span><span class="c1">#>           js           js         node            r       python </span><span class="w">
</span><span class="c1">#> "Javascript" "Javascript" "Javascript"          "R"     "Python" </span><span class="w">
</span><span class="c1">#>            r   javascript </span><span class="w">
</span><span class="c1">#>          "R" "Javascript"</span><span class="w">
</span>

Character by character string replacement

I’m motivated to write about character subsetting today because I used it in a
Stack Overflow answer.
Here is my paraphrasing of the problem.

Let’s say I have a long character string, and I’d like to use
stringr::str_replace_all to replace certain letters with others. According to
the documentation, str_replace_all can take a named vector and replaces the
name with the value. That works fine for 1 replacement, but for multiple, it
seems to do the replacements iteratively, so that one replacement can replace
another one.

library(tidyverse)
text_string = "developer"

# This works fine
text_string %>% 
  str_replace_all(c(e ="X")) 
#> [1] "dXvXlopXr"

# But this is not what I want
text_string %>% 
  str_replace_all(c(e ="p", p = "e"))
#> [1] "develoeer"

# Desired result would be "dpvploepr"

The iterative behavior here is that
str_replace_all("developer", c(e ="p", p = "e")) first replaces e with p
(yielding "dpvploppr") and then it applies the second rule on the output of
the first rule, replacing p with e (yielding "develoeer").

When I read this question, the replacement rules looked a lot like the lookup
tables that I use in character subsetting so I presented a function that
handles this problem by using character subsetting.

Let’s work through the question’s example. First, let’s break the string into
characters.

<span class="n">input</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"developer"</span><span class="w">
</span><span class="n">rules</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">e</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"p"</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"e"</span><span class="p">)</span><span class="w">

</span><span class="n">chars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">strsplit</span><span class="p">(</span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">))</span><span class="w">
</span><span class="n">chars</span><span class="w">
</span><span class="c1">#> [1] "d" "e" "v" "e" "l" "o" "p" "e" "r"</span><span class="w">
</span>

To avoid the issue of NAs, we create default rules so that every character in
the input is replaced by itself.

<span class="n">unique_chars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">chars</span><span class="p">)</span><span class="w">
</span><span class="n">complete_rules</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">unique_chars</span><span class="p">,</span><span class="w"> </span><span class="n">unique_chars</span><span class="p">)</span><span class="w">
</span><span class="n">complete_rules</span><span class="w">
</span><span class="c1">#>   d   e   v   l   o   p   r </span><span class="w">
</span><span class="c1">#> "d" "e" "v" "l" "o" "p" "r"</span><span class="w">
</span>

Now, we overwrite the default rules with the specific ones we are interested in.

<span class="c1"># Find rules with the names as the real rules. </span><span class="w">
</span><span class="c1"># Replace them with the real rules.</span><span class="w">
</span><span class="n">complete_rules</span><span class="p">[</span><span class="nf">names</span><span class="p">(</span><span class="n">rules</span><span class="p">)]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rules</span><span class="w">
</span><span class="n">complete_rules</span><span class="w">
</span><span class="c1">#>   d   e   v   l   o   p   r </span><span class="w">
</span><span class="c1">#> "d" "p" "v" "l" "o" "e" "r"</span><span class="w">
</span>

Then lookup with character subsetting will effectively apply all the replacement
rules. We glue the characters back together again to finish the transformation

<span class="n">replaced</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unname</span><span class="p">(</span><span class="n">complete_rules</span><span class="p">[</span><span class="n">chars</span><span class="p">])</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="n">replaced</span><span class="p">,</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] "dpvploepr"</span><span class="w">
</span>

Here is everything combined into a single function, with some additional steps
needed to handle multiple strings at once.

<span class="n">str_replace_chars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">string</span><span class="p">,</span><span class="w"> </span><span class="n">rules</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="c1"># Expand rules to replace characters with themselves </span><span class="w">
  </span><span class="c1"># if those characters do not have a replacement rule</span><span class="w">
  </span><span class="n">chars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">strsplit</span><span class="p">(</span><span class="n">string</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)))</span><span class="w">
  </span><span class="n">complete_rules</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">chars</span><span class="p">,</span><span class="w"> </span><span class="n">chars</span><span class="p">)</span><span class="w">
  </span><span class="n">complete_rules</span><span class="p">[</span><span class="nf">names</span><span class="p">(</span><span class="n">rules</span><span class="p">)]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rules</span><span class="w">

  </span><span class="c1"># Split each string into characters, replace and unsplit</span><span class="w">
  </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">string_i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">string</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">chars_i</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">strsplit</span><span class="p">(</span><span class="n">string</span><span class="p">[</span><span class="n">string_i</span><span class="p">],</span><span class="w"> </span><span class="s2">""</span><span class="p">))</span><span class="w">
    </span><span class="n">string</span><span class="p">[</span><span class="n">string_i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">complete_rules</span><span class="p">[</span><span class="n">chars_i</span><span class="p">],</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="n">string</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">rules</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"X"</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"e"</span><span class="p">,</span><span class="w"> </span><span class="n">e</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"p"</span><span class="p">)</span><span class="w">
</span><span class="n">strings</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"application"</span><span class="p">,</span><span class="w"> </span><span class="s2">"developer"</span><span class="p">)</span><span class="w">

</span><span class="n">str_replace_chars</span><span class="p">(</span><span class="n">strings</span><span class="p">,</span><span class="w"> </span><span class="n">rules</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] "XeelicXtion" "dpvploepr"</span><span class="w">
</span>

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)