Non-standard-evaluation and standard evaluation in dplyr

June 12, 2016
By

(This article was first published on Clean Code, and kindly contributed to R-bloggers)

I love the dplyr package with all of its functions, however if you use normal dplyr in functions in your package r-cmd-check will give you a warning: R CMD check NOTE: No visible binding for global variable NAME OF YOUR VARIABLE 1. The functions do work, and everything is normal, however if you submit your package to CRAN, such a NOTE is not acceptable. A workaround is to add globalVariables to one of your scripts. for instance:

globalVariables(c("var1", "var2", "varyourmother")
				)

Which works but it is not necessary.

NSE

dplyr (and some other packages and functions) work with non-standard-evaluation (NSE). One example is library(magrittr) vs library("magrittr") , both work. But
install.packages(magrittr) vs install.packages("magrittr") is different, you need the quotes. In almost all the functions in r when you name a part of an object you need the qoutes but in some functions you don’t. They are designed to work in a non-standard way. Some even miss a standard way.

I will focus on the dplyr functions only, a general introduction to NON standard evaluation might come later.

Under the hood the dplyr functions work just as other functions, in fact
all the functions use normal evaluation (standard evaluation), but for interactive use there is a non standard evaluation version, which saves you typing. The interactive version is then first evaluated with the lazyeval package and is then send to the SE version.
There is even a naming scheme 2:
> Every function that uses NSE should have a standard evaluation (SE) escape hatch that does the actual computation. The SE-function name should end with _.

Therefore there are multiple verbs: select(), select_(), mutate(), mutate_(), etc. Under the hood select() is evaluated with the lazyeval package and send to select_().
In functions you should use the SE versions, not only to stop notes from creating, but also because it gives you extra options.

From NSE (the standard interactive use) to SE (standard evalation within functions

So this is a list of things i regularly do with NSE and their translation in SE.

I will use the data file about students in higher education in the Netherlands.

background

There are basicaly three ways to quote variables that dplyr/ lazyeval understands:

  • with a formula ~mean(mpg)
  • with quote() quote(mean(mpg))
  • as a string "mean(mpg)"

Select()

Example of the select function from dplyr.

library(dplyr)
 # first the normal NSE version
select(duo2015_tidy, OPLEIDINGSNAAM.ACTUEEL, FREQUENCY)
# standard evaluation 
select_(duo2015_tidy, ~OPLEIDINGSNAAM.ACTUEEL)
select_(duo2015_tidy, ~OPLEIDINGSNAAM.ACTUEEL, ~FREQUENCY) # comma doesn't work, + doesn't work
select_(duo2015_tidy, quote(OPLEIDINGSNAAM.ACTUEEL, FREQUENCY)) # nope
select_(duo2015_tidy, quote(OPLEIDINGSNAAM.ACTUEEL), quote(FREQUENCY)) # yes!
select_(duo2015_tidy, "OPLEIDINGSNAAM.ACTUEEL", "FREQUENCY", "YEAR", "OPLEIDINGSFASE.ACTUEEL") # works

Output:

Source: local data frame [24,150 x 2]

   OPLEIDINGSNAAM.ACTUEEL FREQUENCY
                    (chr)     (int)
1     B Aarde en Economie       121
2     B Aarde en Economie        54
3     B Aarde en Economie       140
4     B Aarde en Economie        52
5     B Aarde en Economie       132
6     B Aarde en Economie        55
7     B Aarde en Economie       144

Filter()

Then the filter function ( I also use the select function here)

# ways that work. 
filter(duo2015_tidy, YEAR ==2015) %>% select(OPLEIDINGSNAAM.ACTUEEL, FREQUENCY)
filter_(duo2015_tidy, ~YEAR ==2015) %>% select_(~OPLEIDINGSNAAM.ACTUEEL, ~FREQUENCY)
filter_(duo2015_tidy, quote(YEAR ==2015)) %>% select_(~OPLEIDINGSNAAM.ACTUEEL, ~FREQUENCY)
filter_(duo2015_tidy, "YEAR ==2015") %>% select_(~OPLEIDINGSNAAM.ACTUEEL, ~FREQUENCY)
# or with a list to dots.
dotsfilter <- list(~OPLEIDINGSNAAM.ACTUEEL, ~FREQUENCY)
filter_(duo2015_tidy, "YEAR ==2015") %>% select_(.dots = dotsfilter)

output:

Source: local data frame [4,830 x 2]

         OPLEIDINGSNAAM.ACTUEEL FREQUENCY
                          (chr)     (int)
1           B Aarde en Economie       151
2           B Aarde en Economie        60
3           B Aardwetenschappen         0
4           B Aardwetenschappen       149
5           B Aardwetenschappen       335
6           B Aardwetenschappen         0
7           B Aardwetenschappen        83

## Group_by() & Summarize()
Group_by and summarize examples, see also the NSE vignette on dplyr 3.

group_by(duo2015_tidy, GENDER) %>% summarise(total = n())
# group by in SE, and summarize with NSE
group_by_(duo2015_tidy, ~GENDER) %>% summarise(total = sum(FREQUENCY))
# both NSE, pass list of arguments to .dots
group_by_(duo2015_tidy, ~GENDER) %>% summarise_(.dots = list(~total = sum(FREQUENCY))) # does not work
group_by_(duo2015_tidy, ~GENDER) %>% summarise_(.dots = list(~sum(FREQUENCY))) # does work. 
dots <- list(~sum(FREQUENCY))
group_by_(duo2015_tidy, ~GENDER) %>% summarise_(.dots = dots)
group_by_(duo2015_tidy, ~GENDER) %>% summarise_(.dots = setNames(dots, "total"))
group_by_(duo2015_tidy, ~GENDER) %>% summarise_("sum(FREQUENCY)")
group_by_(duo2015_tidy, ~GENDER) %>% summarise_(~sum(FREQUENCY))

output:

Source: local data frame [2 x 2]

  GENDER sum(FREQUENCY)
   (chr)          (int)
1    MAN         609755
2  VROUW         639609

Mutate() and slightly more advanced use

You want to add two columns up, but you don’t yet know which columns that will be (example from Paul Hiemstra4).

# normal interactive use  
library(dplyr)
mtcars %>% mutate(new_column = mpg + wt)

So you would like a function that does something like this:

f <- function(col1, col2, new_col_name) {
    mtcars %>% mutate(new_col_name = col1 + col2)
}

The problem is that r will search for col1 and col2, which don’t exist.
Furthermore the name of the endresult will be new_col_name, and not the content of new_col_name. To get around non-standard evaluation, you can use the lazyeval package. The following function does what we expect:

f <- function(col1, col2, new_col_name) {
    mutate_call <- lazyeval::interp(~ a + b, a = as.name(col1), b = as.name(col2))
    mtcars %>% mutate_(.dots = setNames(list(mutate_call), new_col_name))
}

You first create a call that will be evaluated by mutate_ . the call is first interpreted so that the final and correct names are used by mutate_

Of course if you already knew wich varibles you would use, there is no need for interpretation, and something like this would work:

mtcars %>% mutate_(.dots = setNames(list(~mpg+wt), "sum mpg wt"))
mtcars %>% mutate_(.dots = list(~mpg+wt)) # if you don't need the name specified

NSE in context

So if you want to use the dplyr functions in your own functions these are some variants that you could use. See the list of References and Notes for more information.

References:

question on stack overflow
using mutate inside a function, shows excellent use of mutate function, r-bloggers

fun standardizing NSE (he has a particular kind of fun…)
advanced r chapter about NSE – hadley wickham
on r, I have not read this one

NOTES

  1. an issue that demonstrates the r cmd check NOTE. https://github.com/Rdatatable/data.table/issues/850

  2. wow the package is updated yesterday, but this describes the naming https://cran.r-project.org/web/packages/lazyeval/vignettes/lazyeval-old.html

  3. NSE in dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html

  4. This example comes from Paul Hiemstra on his numbertheory blog that I found via r-bloggers. http://www.numbertheory.nl/2015/09/23/using-mutate-from-dplyr-inside-a-function-getting-around-non-standard-evaluation/ With the reference to the r-bloggers version in the links above.

Non-standard-evaluation and standard evaluation in dplyr was originally published by at Clean Code on June 13, 2016.

To leave a comment for the author, please follow the link and comment on their blog: Clean Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)