[This article was first published on Mostly Counting | Tim Tiefenbach, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Intro

Today’s blog post is all about recoding columns using a data dictionary and `dplyr::recode()`.

Many datasets, especially from surveys, come along with a proper documentation often in form of a so called “data dictionary”. A data dictionary contains at least three pieces of information: the (i) column names that are used in the dataset as well as corresponding (ii) numeric values and (iii) labels which translate those abstract numbers into meaningful terms.

At times, you may need to transform the raw values into their associated labels for tasks like reporting or plotting. Here, `dplyr::recode()` serves as an efficient tool to programmatically recode, and also rename, columns according to a data dictionary.

Recently I revamped an old ETL script I had written, which recoded multiple datasets based on a data dictionary. This script was from the pre-dplyr 1.0 era, so updating it provided a great opportunity to revisit this task, this time armed with the latest dplyr version. The resulting approach was such a significant improvement over my original method, that I felt compelled to share it here, serving as both a personal reference and a resource for anyone confronted with similar challenges.

We’ll start with a straightforward example to demonstrate the basic workflow. We then look at a more advanced scenario involving multiple datasets and a comprehensive data dictionary. Lastly, this blog concludes with some thoughts about the recent changes in dplyr version 1.1.0 and `recode`‘s new successor `case_match()`.

## Recoding one dataset

#### Setup

Lets start with a really simple dataset composed of three columms and fives rows:

```library(dplyr)

# dataset
dat <- tibble(a = 1:5,
b = c(10:14),
c = c(20:24))
dat
#> # A tibble: 5 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1    10    20
#> 2     2    11    21
#> 3     3    12    22
#> 4     4    13    23
#> 5     5    14    24
```

Assume we have a data dictionary that looks like this:

```# dictionary
dat_dict <- tibble(

col_nm = c(
rep("b", 5),
rep("c", 5)
),

value = c(
10:14,
20:24
),

label = c(
letters[1:5],
letters[6:10]
)
)

dat_dict
#> # A tibble: 10 × 3
#>    col_nm value label
#>    <chr>  <int> <chr>
#>  1 b         10 a
#>  2 b         11 b
#>  3 b         12 c
#>  4 b         13 d
#>  5 b         14 e
#>  6 c         20 f
#>  7 c         21 g
#>  8 c         22 h
#>  9 c         23 i
#> 10 c         24 j
```

The dictionary has three columns: `col_nm` records our dataset’s column names, `value` holds the values, as seen in our raw data `dat`, and `label` stores the corresponding labels.

Note, that our dataset contains a column `a` that is not part of the data dictionary, just to show that the dictionary doesn’t need to hold value-label pairs for all columns.

Typically, attached data dictionaries come in form of csv or xlsx files, but after reading them into R, they should be in a similar form as our mock dictionary above.

#### Preparations

Next, we need to prepare two things: (i) a custom function to recode a single column according to the dictionary and (ii) a vector of columns names we want to recode. We will then use both inside `dplyr::mutate(across(...))`.

Taking a closer look at our custom recode function, it accepts the column to recode, `x`, as its only argument. The function works in two steps. Initially, we create a named vector of matching label-value pairs. Next, we splice this vector as arguments into `dplyr::recode(x, ...)` using rlang’s triple bang operator `!!!`.

```# recode function to be used within `dplyr::mutate(across(...))`
recode_col <- function(x) {

recode_vec <- dat_dict |>
filter(col_nm == cur_column()) |>
pull(label, name = value)

dplyr::recode(x, !!! recode_vec)
}```

To create a named vector of matching label-value pairs we start with our data dictionary `dat_dict`. We filter the current column using `cur_column()`, which is possible since we’re going use this function inside `dplyr::across()` where `cur_column()` yields us the string name of the current column. Finally, we use `pull(label, name = value)` to get the labels as character vector, along with their matching values as names.

In the final step of our custom function, we “splice” this named vector of label-value pairs into `dplyr::recode()`. Despite “splicing” sounding rather technical, it’s essentially an early evaluation. We evaluate the vector `recode_vec` prior to processing the `dplyr::recode()` call. Assume we want to recode column `b` in `dat` and `recode_vec` looks like this:

`recode_vec <- c("10" = "a", "11" = "b", "12" = "c", "13" = "d", "14" = "e")`

Then …

`dplyr::recode(x, !!! recode_vec)`

… becomes:

`dplyr::recode(x, c("10" = "a", "11" = "b", "12" = "c", "13" = "d", "14" = "e"))`

Keep in mind, we’re presuming hat the data dictionary is available in the global environment, and that our dataset `dat` doesn’t contain a column with an identical name as our data dictionary `dat_dict`. We can bolster the safety of our approach by supplying the data dictionary as a second argument to our `recode_col()` function. For interested readers this is shown in the info box below.

If we want to make sure that our `recode_col()` function uses the correct dictionary then we can supply it as second argument:

```# recode function to be used within `dplyr::mutate(across(...))`
recode_col_safe <- function(x, dict) {

recode_vec <- dict |>
filter(col_nm == cur_column()) |>
pull(label, name = value)

dplyr::recode(x, !!! recode_vec)
}```

In that case, it is not enough to only supply the bare function `recode_col` to `across`. We need to create an anonymous function `\(x, dic) ...` that calls `recode_col`. Here we can use `.env\$dat_dict` to tell dplyr to look for an object `dat_dict` in a parent environment (and not inside our data.frame). This would prevent an error in case our data would contain an column with the same name as our data dictionary `dat_dict`.

```# vector of columns to recode
cols_vec <- unique(dat_dict\$col_nm)

# recoding defined columns
dat |>
mutate(across(all_of(cols_vec),
\(x, dic) recode_col_safe(x, .env\$dat_dict))
)
#> # A tibble: 5 × 3
#>       a b     c
#>   <int> <chr> <chr>
#> 1     1 a     f
#> 2     2 b     g
#> 3     3 c     h
#> 4     4 d     i
#> 5     5 e     j
```

With our custom recode function ready, the next thing we need is a vector of column names that we want to apply this function to. A straightforward way to do this is to get all unique column names from our data dictionary.

```# vector of columns to recode
cols_vec <- unique(dat_dict\$col_nm)
cols_vec
#> [1] "b" "c"
```

#### Recoding

Now we are all set, and the only thing left to do is to call `across()` on all of the column names in our data dictionary `all_of(col_vec)` and let our custom recode function `recode_col()` do its job.

```# recoding defined columns
dat |>
mutate(across(all_of(cols_vec),
recode_col)
)
#> # A tibble: 5 × 3
#>       a b     c
#>   <int> <chr> <chr>
#> 1     1 a     f
#> 2     2 b     g
#> 3     3 c     h
#> 4     4 d     i
#> 5     5 e     j
```

#### Renaming columns

Occasionally, data dictionaries offer not just corresponding values and labels, but also new, typically more descriptive, column names.

I’ve dedicated an entire blog post to the subject of renaming columns based on a lookup table. However, as the approach is slightly different when using a dictionary compared to a pure lookup table, and since this topic is quite relevant, let’s take a brief look at it.

Suppose our data dictionary, `dat_dict2` includes the original abbreviated column names `short_nm` and corresponding descriptive column names `long_nm`:

```#> # A tibble: 10 × 4
#>    short_nm long_nm value label
#>    <chr>    <chr>   <int> <chr>
#>  1 b        new_b      10 a
#>  2 b        new_b      11 b
#>  3 b        new_b      12 c
#>  4 b        new_b      13 d
#>  5 b        new_b      14 e
#>  6 c        new_c      20 f
#>  7 c        new_c      21 g
#>  8 c        new_c      22 h
#>  9 c        new_c      23 i
#> 10 c        new_c      24 j
```
```# dictionary
dat_dict2 <- tibble(

short_nm = c(
rep("b", 5),
rep("c", 5)
),

long_nm = c(
rep("new_b", 5),
rep("new_c", 5)
),

value = c(
10:14,
20:24
),

label = c(
letters[1:5],
letters[6:10]
)
)```

This time, we only require a named vector of corresponding old and new column names. To create that we use our data dictionary and filter it for distinct entries in `short_nm` and `long_nm`. Then we use again `pull()` together with its `name` argument, but note that the old and new values are in reverse positions compared to `recode()`.

```rename_vec <- dat_dict2 |>
distinct(short_nm, long_nm) |>
pull(short_nm, name = long_nm)

rename_vec
#> new_b new_c
#>   "b"   "c"
```

Once we have this named vector of corresponding short and long column names we use `all_of()` inside `rename()`:

```dat |>
rename(all_of(rename_vec))
#> # A tibble: 5 × 3
#>       a new_b new_c
#>   <int> <int> <int>
#> 1     1    10    20
#> 2     2    11    21
#> 3     3    12    22
#> 4     4    13    23
#> 5     5    14    24
```

As we’ve demonstrated above, it’s remarkably straightforward to not just recode, but also rename columns according to a data dictionary. If your work solely involves single datasets, you can skip the next section, which will expand upon the previous approach, demonstrating how to recode a list of datasets.

## Recoding many datasets

#### Setup and Reasoning

Initially, the necessity of having a list of datasets and one comprehensive data dictionary may not be evident. A plausible scenario, for instance, could be a customer survey program composed of numerous surveys featuring similar, yet not identical, survey items. These survey results are stored in a generic table within a database, with columns simply labeled `item1`, `item2`, etc. The number of survey items may vary among surveys. Each row contains a respondent ID to identify a respondent and a survey ID to indicate the specific customer journey under which a respondent was surveyed. As the surveys are similar but not identical, the same column (e.g., `item1`) may contain different values across different surveys. Even if column values are consistent, they could correspond to different labels.

Again, our objective is to recode—and while we’re at it, also rename—all columns listed in the data dictionary across all surveys.

To keep this example as straightforward as possible, we’ll use a minimal setup. In addition to `dat` from before, let’s construct another small toy dataset `dat2` and nest both within a `data.frame` consisting of two columns: the `id` of each dataset and the actual `data`.

```# another dataset
dat2 <- tibble(a = 1:5,
d = 10:14,
e = 7:11)

# a list of datasets
dat_ls <- tibble(id = c("dat1", "dat2"),
data = tibble::lst(dat, dat2))

dat_ls
#> # A tibble: 2 × 2
#>   id    data
#>   <chr> <named list>
#> 1 dat1  <tibble [5 × 3]>
#> 2 dat2  <tibble [5 × 3]>
```

This is how the “data” list-column looks like:

```dat_ls\$data
#> \$dat
#> # A tibble: 5 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1    10    20
#> 2     2    11    21
#> 3     3    12    22
#> 4     4    13    23
#> 5     5    14    24
#>
#> \$dat2
#> # A tibble: 5 × 3
#>       a     d     e
#>   <int> <int> <int>
#> 1     1    10     7
#> 2     2    11     8
#> 3     3    12     9
#> 4     4    13    10
#> 5     5    14    11
```

We assume once more that we have a data dictionary, `dat_dict3`, which contains old, short column names `short_nm`, new long column names `long_nm`, as well as a mapping between `value`s and `label`s. The only difference from the previous example is that we now have an additional `id` column, indicating to which dataset the value-label mappings belong.

```#> # A tibble: 20 × 5
#>    id    short_nm long_nm value label
#>    <chr> <chr>    <chr>   <int> <chr>
#>  1 dat1  b        new_b      10 a
#>  2 dat1  b        new_b      11 b
#>  3 dat1  b        new_b      12 c
#>  4 dat1  b        new_b      13 d
#>  5 dat1  b        new_b      14 e
#>  6 dat1  c        new_c      20 f
#>  7 dat1  c        new_c      21 g
#>  8 dat1  c        new_c      22 h
#>  9 dat1  c        new_c      23 i
#> 10 dat1  c        new_c      24 j
#> 11 dat2  d        new_d      10 f
#> 12 dat2  d        new_d      11 g
#> 13 dat2  d        new_d      12 h
#> 14 dat2  d        new_d      13 i
#> 15 dat2  d        new_d      14 j
#> 16 dat2  e        new_e       7 17
#> 17 dat2  e        new_e       8 18
#> 18 dat2  e        new_e       9 19
#> 19 dat2  e        new_e      10 20
#> 20 dat2  e        new_e      11 21
```
```# a data dictionary containing codes for several datasets
dat_dict3 <- tibble(

id = c(
rep("dat1", 10),
rep("dat2", 10)
),

short_nm = c(
rep("b", 5),
rep("c", 5),
rep("d", 5),
rep("e", 5)
),

long_nm = c(
rep("new_b", 5),
rep("new_c", 5),
rep("new_d", 5),
rep("new_e", 5)
),

value =
c(
10:14,
20:24,
10:14,
7:11
),

label = c(
letters[1:5],
letters[6:10],
letters[6:10],
17:21
)
)```

#### Preparations

Before the actual recoding can take place, we have to prepare two functions:

1. a custom function, `recode_col2()`, that recodes one column according to a specified dictionary and

2. another custom function, `recode_df()`, which applies `recode_col2()` inside `across()` to all specified columns.

The first function is pretty similar to what we have seen earlier, with the only difference being the use of two arguments, the column to recode, `x` and the dictionary according to which the column should be recoded `dict`:

```# recode function
recode_col2 <- function(x, dict) {

col_nm <- cur_column()

recode_vec <- dict |>
filter(short_nm == cur_column()) |>
pull(label, name = value)

dplyr::recode(x, !!! recode_vec)
}```

The second function, `recode_df`, basically wraps the actual recoding that we’ve used in the section above into a function. It takes two arguments: the dataset we want to recode, `dat`, and the `id` of the dataset as specified in the data dictionary (which should be the same as in our nested data.frame `dat_ls`).

```recode_df <- function(dat, dat_id) {

# 1. get the current entries
cur_dat_dict <- dat_dict3 |>
filter(id == dat_id)

# 2. vector of columns to recode
cols_vec <- unique(cur_dat_dict[["short_nm"]])

# 3. use across with both inputs and recode_col2
dat |>
mutate(across(all_of(cols_vec),
\(x) recode_col2(x, cur_dat_dict))
)
}```

This function will be applied iteratively to each individual dataset in our list of data.frames `dat_ls`. The function consists of three steps:

1. We subset our data dictionary with the supplied data ID to ensure only the value-label mappings of the current dataset remain. We call this subset of our dictionary `cur_dat_dict`.

2. We then create a vector of column names, `cols_vec`, that we want to recode. These will consist of all unique column names in the current dictionary `cur_dat_dict`.

3. Finally, we use `dplyr::across()` on `all_of` the columns in `cols_vec` and supply the current column `x` and the current dictionary `cur_dat_dict` to `recode_col2()`.

#### Recoding

The last step is to iteratively apply our new function `recode_df()` to our nested data.frame `dat_ls`. This requires us first to call `rowwise()` on `dat_ls`, which applies all subsequent dplyr functions row-by-row. We then overwrite our column holding the `data` with `list(recode_df(data, id))`. It’s important to note that our custom function must be wrapped in `list()`, as it returns a non-atomic vector (a list of `data.frames`).

```dat_ls |>
rowwise() |>
mutate(data = list(recode_df(data, id))) |>
pull(data) # <= for better printing
#> [[1]]
#> # A tibble: 5 × 3
#>       a b     c
#>   <int> <chr> <chr>
#> 1     1 a     f
#> 2     2 b     g
#> 3     3 c     h
#> 4     4 d     i
#> 5     5 e     j
#>
#> [[2]]
#> # A tibble: 5 × 3
#>       a d     e
#>   <int> <chr> <chr>
#> 1     1 f     17
#> 2     2 g     18
#> 3     3 h     19
#> 4     4 i     20
#> 5     5 j     21
```

#### Rename many datasets based on a data dicitonary

Analogous to `recode_df()`, we can create a function that renames all columns of a dataset, below called `rename_df()`. The function works in two steps. Initially, we create vector of old and new name pairs based on the distinct entries of our data dictionary that are relevant for this dataset `filter(id == dat_id)`. Next, we use this named vector within `rename(all_of()`:

```rename_df <- function(data, dat_id) {

# 1. create a vector of old and new name pairs ...
# ... based on the current dictionary
rename_vec <- dat_dict3 |>
filter(id == dat_id ) |>
distinct(short_nm, long_nm) |>
pull(short_nm, name = long_nm)

# 2. use this vector on the supplied data
data |>
rename(all_of(rename_vec))
}```

Applying this function to our nested `data.frame` of datasets is basically the same approach as outlined above:

```dat_ls |>
rowwise() |>
mutate(data = list(rename_df(data, id))) |>
pull(data) # <= for betters printing
#> [[1]]
#> # A tibble: 5 × 3
#>       a new_b new_c
#>   <int> <int> <int>
#> 1     1    10    20
#> 2     2    11    21
#> 3     3    12    22
#> 4     4    13    23
#> 5     5    14    24
#>
#> [[2]]
#> # A tibble: 5 × 3
#>       a new_d new_e
#>   <int> <int> <int>
#> 1     1    10     7
#> 2     2    11     8
#> 3     3    12     9
#> 4     4    13    10
#> 5     5    14    11
```

## Final Thoughts

Readers who follow me on Twitter might know that, although being a dplyr fan-boy, my feelings towards `dplyr::recode()` were less enthusiastic. My main issue was the unconventional order of arguments, which diverges from what we know from `rename()` or `mutate()`.

The order of arguments was in part a reason why `recode()` was flagged as “questioning” in dplyr version 1.0. Since dplyr version 1.1.0 `recode()` moved one stage further in its life cycle and is now “superseded” by `dplyr::case_match()`.

I’m not sure of the full implications of this development. If I understand the life cycle stages correctly, then “superseded” means that `dplyr::recode()` is not going away any time soon and will continue to be maintained, though it will not see new features.

However, if there’s a chance that `dplyr::recode()` might become deprecated in future major releases, we would need to think about a workaround, since `case_match()` doesn’t support the splicing of named vectors as arguments that we have used above. In that case, I will certainly update this blog post.

Despite my initial skepticism towards `dplyr::recode()`, I have to concede that, particularly in combination with `across()`, it provides a clear and straight-forward workflow. With its ability to be used programmatically and to handel complex cases, I hope that this blog post has convincingly shown the benefits of this approach.

My original dplyr workflow was much more convoluted. The curious reader can find it in the answers to my question on StackOverflow from a couple of years ago.

But even outside of dplyr I haven’t encountered a similarly seamless approach to recoding multiple columns across several datasets. If you are up for a challenge, I’d love to see what base R or data.table solutions you can come with to tackle this problem. Let me know in the comments or via Twitter or Mastodon if you have an alternative approach.

Session Info
```#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-06-30
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package * version date (UTC) lib source
#>  dplyr   * 1.1.0   2023-01-29 [1] CRAN (R 4.2.0)
#>
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
```
To leave a comment for the author, please follow the link and comment on their blog: Mostly Counting | Tim Tiefenbach.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)