Converting Individual Binary vectors to a Value based on Column Names

May 7, 2019
By

(This article was first published on R – William Doane, and kindly contributed to R-bloggers)

When processing data downloaded from popular survey engines, it’s not uncommon for multiple choice questions to be represented as one column per possible response coded as 0/1. So, a question with just two responses might be downloaded as part of a CSV with one column for q1_1 and another for q1_2. If the responses are mutually exclusive, then (q1_1 == 0 iff q1_2 == 1) and (q1_1 == 1 iff q1_2 == 0). If the responses are part of a “choose all that apply” question, then it’s possible to have multiple 1s.

How can these individual binary indicator variables be reassembled into a single response variable?

First, let’s simulate some response data for non-mutually exclusive questions—each row represents one respondent’s choices:

df <- data.frame(
  q1_1 = round(runif(5), 0),
  q1_2 = round(runif(5), 0),
  q1_3 = round(runif(5), 0),
  
  q2_1 = round(runif(5), 0),
  q2_2 = round(runif(5), 0),
  
  q3_1 = round(runif(5), 0),
  q3_2 = round(runif(5), 0),
  q3_3 = round(runif(5), 0),
  q3_4 = round(runif(5), 0)
)

df

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4
1    1    0    0    1    0    1    1    0    1
2    0    1    0    0    1    1    1    0    1
3    1    1    1    1    1    0    0    0    1
4    0    1    0    1    1    1    1    0    1
5    0    0    1    1    0    1    0    0    0

R’s dplyr package offers the coalesce function, which doesn’t suit my needs when the data contains 0s for non-selected response choices. Notice below in row 2, for example, that q1 and q2 select the first non-NA values, which is 0:

library(dplyr)
 
df %>%
  mutate(q1 = coalesce(q1_1, q1_2, q1_3)) %>%
  mutate(q2 = coalesce(q2_1, q2_2)) %>%
  mutate(q3 = coalesce(q3_1, q3_2, q3_3, q3_4))

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3
1    1    0    0    1    0    1    1    0    1  1  1  1
2    0    1    0    0    1    1    1    0    1  0  0  1
3    1    1    1    1    1    0    0    0    1  1  1  1
4    0    1    0    1    1    1    1    0    1  0  1  1
5    0    0    1    1    0    1    0    0    0  0  1  1

If you replace all 0s with NA, you can get closer to what you need:

df %>%
  mutate_all(~ifelse(. == 0, NA_real_, .)) %>%
  mutate(q1 = coalesce(q1_1, q1_2, q1_3)) %>%
  mutate(q2 = coalesce(q2_1, q2_2)) %>%
  mutate(q3 = coalesce(q3_1, q3_2, q3_3, q3_4))

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3
1    1   NA   NA    1   NA    1    1   NA    1  1  1  1
2   NA    1   NA   NA    1    1    1   NA    1  1  1  1
3    1    1    1    1    1   NA   NA   NA    1  1  1  1
4   NA    1   NA    1    1    1    1   NA    1  1  1  1
5   NA   NA    1    1   NA    1   NA   NA   NA  1  1  1

Unfortunately, the q1 vector here only tells us that there was some response by each respondent, not which response they gave for q1.

It would be nice to have a version of coalesce that gathered not the first non-NA value, but the column name of the first non-NA value. Here, I’ll use the structure of dplyr’s coalesce as a model:

coalesce_colname <-
  function(...) {
    if (missing(..1)) {
      abort("At least one argument must be supplied")
    }
   
    colnames <- as.character(as.list(match.call()))[-1]
   
    values <- list(...)
   
    x <- values[[1]]
    x[!is.na(x)] <- colnames[1]
   
    values <- values[-1]
    colnames <- colnames[-1]
   
    for (i in seq_along(values)) {
      x <- ifelse(is.na(x) & !is.na(values[[i]]), colnames[i], x)
    }
   
    x
  }

With this, you have a drop-in replacement for coalesce that captures the column name:

df %>%
  mutate_all(~ifelse(. == 0, NA_real_, .)) %>%
  mutate(q1 = coalesce_colname(q1_1, q1_2, q1_3)) %>%
  mutate(q2 = coalesce_colname(q2_1, q2_2)) %>%
  mutate(q3 = coalesce_colname(q3_1, q3_2, q3_3, q3_4))

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4   q1   q2   q3
1    1   NA   NA    1   NA    1    1   NA    1 q1_1 q2_1 q3_1
2   NA    1   NA   NA    1    1    1   NA    1 q1_2 q2_2 q3_1
3    1    1    1    1    1   NA   NA   NA    1 q1_1 q2_1 q3_4
4   NA    1   NA    1    1    1    1   NA    1 q1_2 q2_1 q3_1
5   NA   NA    1    1   NA    1   NA   NA   NA q1_3 q2_1 q3_1

and with a little effort, you can wrangle the column name to extract the response value:

df %>%
  mutate_all(~ifelse(. == 0, NA_real_, .)) %>%
  mutate(q1 = coalesce_colname(q1_1, q1_2, q1_3)) %>%
  mutate(q2 = coalesce_colname(q2_1, q2_2)) %>%
  mutate(q3 = coalesce_colname(q3_1, q3_2, q3_3, q3_4)) %>%
  
  mutate_at(c("q1", "q2", "q3"), ~stringr::str_extract(., "\\d+$"))

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2   q3
1    1   NA   NA    1   NA    1    1   NA    1  1  1    1
2   NA    1   NA   NA    1    1    1   NA    1  2  2    1
3    1    1    1    1    1   NA   NA   NA    1  1  1    4
4   NA    1   NA    1    1    1    1   NA    1  2  1    1
5   NA   NA    1    1   NA    1   NA   NA   NA  3  1    1

To leave a comment for the author, please follow the link and comment on their blog: R – William Doane.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)