Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.

Let’s say we have some example data:

df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)


Suppose our goal is to build a new column called “derived” that for each row chooses from the column named by the names in the column called “choice”. In our case the result would be “[1, 6, 3, NA]” (the last value being “NA” as “z” is not a column of df).

How does one do this?

In base-R we can use a matrix-index to specify the set of data cells we want (please see “help([)” for details):

dtmp <- df[,
intersect(df$choice, colnames(df)), drop = FALSE] dtmp[ cbind(seq_len(nrow(df)), match(df$choice, colnames(dtmp)))]
]
# [1]  1  6  3 NA


data.table has some known notation and solutions for this (please see “help(.SD)” for details):

library("data.table")
dt <- as.data.table(df)

dt[, derived := .SD[[choice]], by = choice][]

#    x y choice derived
# 1: 1 5      x       1
# 2: 2 6      y       6
# 3: 3 7      x       3
# 4: 4 8      z      NA


What we are doing is grouping by “choice”. In each group the choice column is a constant, and data.table supplies the name of the group as a scalar in the variable named choice. This lets us assign values from that column all at once (.SD, being data.table’s notation for the sub-data.table representing the group). Some discussion of the ideas can be found here.

What is going on: is we are reducing the choice column to a series of scalars- so we can then use the scalar column selection operators (such as “[[]]”) to access columns.

There is also a dplyr solution:

df %>%
mutate(c2 = ifelse(choice %in% colnames(df),
choice,
colnames(df)[[1]])) %>%
group_by(c2) %>%
mutate(derived = .data[[ c2[[1]] ]]) %>%
ungroup() %>%
mutate(derived = ifelse(choice %in% colnames(df),
derived,
NA)) %>%
select(-c2)
# # A tibble: 4 x 4
#       x     y choice derived
#
# 1     1     5 x            1
# 2     2     6 y            6
# 3     3     7 x            3
# 4     4     8 z           NA


It is essentially the same trick as the data.table solution: break the problem into groups where the choice is not varying (allowing us to use scalar column specifiers to do the work). There is some extra care to deal with mis-matches. Also, the “[[1]]” is converting the column of choice values (all of which are the same due to the grouping by choice) into a scalar. We call this trick a “pseudo aggregator”, and we often teach the technique in SQL (where it is often min() or max()).

We have wrapped base-R and data.table implementations into method a in rquery/rqdatatable here. It works like this:

library("rqdatatable")

lookup_by_column(df, "choice", "derived")

#    x y choice derived
# 1: 1 5      x       1
# 2: 2 6      y       6
# 3: 3 7      x       3
# 4: 4 8      z      NA


Also, rquery can perform this operation on database tables! (In database terms this looks a lot like a variation of a controlled coalesce.)

These are some ideas that will eventually be expanded and make it into our theory of coordinatized data (please also see here) and Practical Data Science in R 2nd Edition.