A Subtle Flaw in Some Popular R NSE Interfaces

September 23, 2018
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

It is no great secret: I like value oriented interfaces that preserve referential transparency. It is the side of the public debate I take in R programming.

"One of the most useful properties of expressions is that called by Quine referential transparency. In essence this means that if we wish to find the value of an expression which contains a sub-expression, the only thing we need to know about the sub-expression is its value."

Christopher Strachey, "Fundamental Concepts in Programming Languages", Higher-Order and Symbolic Computation, 13, 1149, 2000, Kluwer Academic Publishers (lecture notes written by Christopher Strachey for the International Summer School in Computer Programming at Copenhagen in August, 1967).

Please read on for discussion of a subtle bug shared by a few popular non-standard evaluation interfaces.

Most of my work on non-standard-evaluation (NSE or code capturing interfaces, alternatives to value oriented or referentially transparent interfaces) is to help eliminate and contain use of NSE. For instance wrapr::let() is designed to adapt existing NSE interfaces into standard value oriented interfaces, not to create new NSE interfaces. I pretty much feel composing one NSE interface into another NSE interface is cleaning up one mess by adding another mess. If wrapr::let() fails to clean up a mess: it can be evidence of a limitation of wrapr::let(), evidence of user error, or evidence the initial mess was already a problem.

Let’s take a look at a standard interface. For example we can use base-R to select a column from a data.frame by name (technically we are building a new data.frame with the selected column(s)). The clarity of the value oriented notation is most valuable when exposed to potentially unclear code (we most value help when we most need help). For example: suppose the user is selecting the column named x, and their selection is in a poorly-named variable called y (technically values are not "in" variables, but there are environments associating values with names).

y <- "x"
d <- data.frame(x = 1, y = 2)

d[, y, drop= FALSE] # returns a data.frame
##   x
## 1 1
d[[y]]  # returns a column
## [1] 1

And if the column we are looking for ("x") is not present we still get reliable and easily predictable results.

y <- "x"
d <- data.frame(y = 2)

d[, y, drop = FALSE] # exception, signalling no such column
## Error in `[.data.frame`(d, , y, drop = FALSE): undefined columns selected
d[[y]]  # NULL, signalling no such column
## NULL

Beyond the uninformative choice of variable name there was nothing confusing or dangerous about the above code.

Now let’s look at a non-standard or code capturing interface: "$".

y <- "x"
d <- data.frame(x = 1, y = 2)

d$y
## [1] 2
d <- data.frame(x = 1)
d$y
## NULL

Notice both of these operations used the name "y" (captured from user code) to try and retrieve a column named "y", regardless of what the value referred to by the variable "y" is. In these cases the interface is unambiguously taking the name from the user code. This non-standard interface is convenient and reliable, and has an obvious standard analogue in "[[]]", so it does not present problems.

Now let’s look at a more problematic non-standard interface base::subset()

y <- "x"
d <- data.frame(x = 1, y = 2)

subset(d, select = y)
##   y
## 1 2

subset() appears to be using the name found in the code to look-up columns. This is not quite the case: from help(subset) we see the select argument is defined as "expression, indicating columns to select from a data frame." What may not be obvious from a cursory read of the documentation is these expressions are evaluated in a new environment that maps the data.frame column names to integers (a clever way to replace column names with column indices) but in an enclosure by the current execution environment (via a call to parent.frame()). This subtlety means that any names that are not column names of the data.frame are evaluated as expressions in the caller’s environment. This in turn means they are replaced by values from this environment, which leads to very confusing outcomes such as the following.

y <- "x"
d <- data.frame(x = 1)

subset(d, select = y)
##   x
## 1 1

Notice the column "x" was returned, (not a NULL). Essentially each specified column name gets two chances to resolve to a data.frame column name or index: first from an environment mapping data.frame names to indices, and second from the calling environment.

The above issue would be fixed if the expression evaluation occurred in a limited environment where only a few symbols (such as "-" and "c" were defined). Such a function is given below:

subset_strict <- function(x, subset, select, drop = FALSE) {
  r <- if (missing(subset))
    rep_len(TRUE, nrow(x))
  else {
    e <- substitute(subset)
    r <- eval(e, x, parent.frame())
    if (!is.logical(r))
      stop("'subset' must be logical")
    r & !is.na(r)
  }
  vars <- if (missing(select))
    TRUE
  else {
    nl <- as.list(seq_along(x))
    names(nl) <- names(x)
    env <- new.env(parent = emptyenv())
    for(sym in c("-", "c", ":")) {
      assign(sym, get(sym, envir = parent.frame()), envir = env)
    }
    eval(substitute(select), nl, env)
  }
  x[r, vars, drop = drop]
}

This function is stricter than base::subset(), reliably signalling errors when non-columns are requested.

y <- "x"
d <- data.frame(x = 1)

subset_strict(d, select = y)
## Error in eval(substitute(select), nl, env): object 'y' not found
d <- data.frame(x = 1, y = 2, z = 3)
subset_strict(d, select = y)
##   y
## 1 2
subset_strict(d, select = -y)
##   x z
## 1 1 3
subset_strict(d, select = c(x, y))
##   x y
## 1 1 2
subset_strict(d, select = c(x:z))
##   x y z
## 1 1 2 3

Roughly I would say the above issue is a reason to not use base::subset() when you are not around to check if it is correct (i.e. in scripts or packages). And help(subset) has some text about this:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

If you don’t use base::subset() you would think this is a non-issue. Except a similar mal-feature seems to be also in dplyr::select(), and dplyr documentation doesn’t currently seem to advise not using dplyr in non-interactive environments.

help(select, package = "dplyr") tells us that select() takes positional argument from "..." and they are to be "One or more unquoted expressions separated by commas. You can treat variable names like they are positions." What is unclear is: what environment "..." is actually evaluated in.

R.version.string
## [1] "R version 3.5.0 (2018-04-23)"
library("dplyr")
packageVersion("dplyr")
## [1] '0.7.6'
packageVersion("rlang")
## [1] '0.2.2'
packageVersion("tidyselect")
## [1] '0.2.4'

Now consider the following code:

y <- "x"

data.frame(x = 1) %>% 
  select(y)

My theory is the typical dplyr user expectation is that the above should throw an exception (as that is what select(data.frame(x = 1), z) does in our clean environment). My opinion is: the above expresses that we are asking for a column named "y", and there is no such column in the data. The unfortunate coincidence that "y" has a value in our environment should be irrelevant to a select(). "y"’s value might (or might not) be relevant in a complex expression, such as a mutate()– but not to a select().

We the result actually is as before: the value "y" refers to in the executing environment can unfortunately alter the result (depending on the columns present in the data.frame).

y <- "x"

data.frame(x = 1) %>% 
  select(y)
##   x
## 1 1

Notice we got back a data.frame with a column named "x". There was no warning or error indicated. This wrong value could poison later results (especially if we had used pull() to convert the data into an unnamed vector). Maybe your code has never run into this, or maybe it already quietly has (as it is not a signalling error).

Notice how the above result differs from the following:

y <- "x"

data.frame(x = 1, y = 2) %>% 
  select(y)
##   y
## 1 2

The name/string version of this issue possibly goes back to dplyr 0.7.0 (the introduction of rlang). A numeric-index version of the issue can be exhibited for dplyr 0.5.0 and dplyr 0.4.0 (and possibly earlier, dplyr 0.4.0 is the earliest version of dplyr that is convenient to install in R 3.5.0). Notice how results vary with data changes below.

# restart R
install.packages("dplyr_0.4.0.tar.gz", repos = NULL)
# restart R
packageVersion("dplyr")
# [1] ‘0.4.0’
library("dplyr")


y <- "x"

data.frame(x = 1, y = 2) %>% 
   select(y)
#   y
# 1 2

data.frame(x = 1) %>% 
  select(y)
# Error: All select() inputs must resolve to integer column positions.
# The following do not:
# *  y

y <- 1L

data.frame(x = 1, y = 2) %>% 
   select(y)
#   y
# 1 2

data.frame(x = 1) %>% 
  select(y)
#   x
# 1 1
  

# restart R
install.packages("dplyr")
# restart R

It is possible this is in fact designed or intentional behavior. However, if that is the case it still violates some principles of safety through isolation and least astonishment.

In dplyr 0.7.6: select(.data[[y]]) shows the same issue, while select(.data[[!!y]]) does not (and select(.data$y) appears to always use the name "y", which seems right). I suspect ".data[y]", while commonly taught, may not be well-formed dplyr code (i.e. the user should not type that so the results are possibly user error, and not package error).

In conclusion:

  • base::subset() in addition to having known issues with the non-standard evaluation of its subset argument has a dangerous double look-up mal-feature in its select argument.
  • dplyr::select() has variations of a similar mal-feature. In the presence of this issue the only safe way to use name-capturing code with dplyr::select() appears to be select("y") or select(.data$y), which are both inconvenient enough to defeat the purpose of non-standard evaluation (the eliding of quotation marks in this case).
  • It is possible select(.data[[y]]) is not well-formed dplyr code, or is ambiguous (is it meant to operate on names, on values, or both?), or has a similar bug.

I think the above shows a small bit of how NSE interfaces can be confusing and hard to get right both for users and package developers. This is why I advise avoiding introducing NSE interfaces unless they are buying you something much larger than leaving out a few quote marks.

"Those who would give up essential Referential Transparency, to purchase a little temporary Notational Conveneince, deserve neither Referential Transparency nor Notational Conveneince."

(not) Ben Franklin

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)