Neat New seplyr Feature: String Interpolation

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R package seplyr has a neat new feature: the function seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.


Safety

This provides a powerful way to easily work complicated expressions into the seplyr data manipulation methods.

The method is easiest to see with an example:

library("seplyr")
## Loading required package: wrapr
ratio <- 2
compCol1 <- "Sepal.Width"
expr <- expand_expr("Sepal.Length" >= ratio * compCol1)
print(expr)
## [1] "Sepal.Length >= ratio * Sepal.Width"

expand_expr works by capturing the user supplied expression unevaluated, performing some transformations, and returning the entire expression as a single quoted string (essentially returning new source code).

Notice in the above one layer of quoting was removed from "Sepal.Length" and the name referred to by “compCol1” was substituted into the expression. “ratio” was left alone as it was not referring to a string (and hence can not be a name; unbound or free variables are also left alone). So we see that the substitution performed does depend on what values are present in the environment.

If you want to be stricter in your specification, you could add quotes around any symbol you do not want de-referenced. For example:

expand_expr("Sepal.Length" >= "ratio" * compCol1)
## [1] "Sepal.Length >= ratio * Sepal.Width"

After the substitution the returned quoted expression is exactly in the form seplyr expects. For example:

resCol1 <- "Sepal_Long"

datasets::iris %.>%
  mutate_se(., 
            resCol1 := expr) %.>%
  head(.)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long
## 1          5.1         3.5          1.4         0.2  setosa      FALSE
## 2          4.9         3.0          1.4         0.2  setosa      FALSE
## 3          4.7         3.2          1.3         0.2  setosa      FALSE
## 4          4.6         3.1          1.5         0.2  setosa      FALSE
## 5          5.0         3.6          1.4         0.2  setosa      FALSE
## 6          5.4         3.9          1.7         0.4  setosa      FALSE

Details on %.>% (dot pipe) and := (named map builder) can be found here and here respectively. The idea is: seplyr::mutate_se(., "Sepal_Long" := "Sepal.Length >= ratio * Sepal.Width") should be equilant to dplyr::mutate(., Sepal_Long = Sepal.Length >= ratio * Sepal.Width).

seplyr also provides an number of seplyr::*_nse() convenience forms wrapping all of these steps into one operation. For example:

datasets::iris %.>%
  mutate_nse(., 
             resCol1 := "Sepal.Length" >= ratio * compCol1) %.>%
  head(.)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long
## 1          5.1         3.5          1.4         0.2  setosa      FALSE
## 2          4.9         3.0          1.4         0.2  setosa      FALSE
## 3          4.7         3.2          1.3         0.2  setosa      FALSE
## 4          4.6         3.1          1.5         0.2  setosa      FALSE
## 5          5.0         3.6          1.4         0.2  setosa      FALSE
## 6          5.4         3.9          1.7         0.4  setosa      FALSE

To use string literals you merely need one extra layer of quoting:

"is_setosa" := expand_expr(Species == "'setosa'")
##               is_setosa 
## "Species == \"setosa\""
datasets::iris %.>%
  transmute_nse(., 
             "is_setosa" := Species == "'setosa'") %.>%
  summary(.)
##  is_setosa      
##  Mode :logical  
##  FALSE:100      
##  TRUE :50

The purpose of all of the above is to mix names that are known while we are writing the code (these are quoted) with names that may not be known until later (i.e., column names supplied as parameters). This allows the easy creation of useful generic functions such as:

countMatches <- function(data, columnName, targetValue) {
  # extra quotes to say we are interested in value, not de-reference
  targetSym <- paste0('"', targetValue, '"') 
  data %.>%
    transmute_nse(., "match" := columnName == targetSym) %.>%
    group_by_se(., "match") %.>%
    summarize_se(., "count" := "n()")
}

countMatches(datasets::iris, "Species", "setosa")
## # A tibble: 2 x 2
##   match count
##   <lgl> <int>
## 1 FALSE   100
## 2  TRUE    50

The purpose of the seplyr string system is to pull off quotes and de-reference indirect variables. So, you need to remember to add enough extra quotation marks to prevent this where you do not want it.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)