Parametric variable names and dplyr

December 3, 2016
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R data manipulation library dplyr currently supports parametric treatment of variables through “underbar forms” (methods of the form dplyr::*_), but their use can get tricky.


NewImage
Rube Goldberg machine 1931 (public domain).

Better support for parametric treatment of variable names would be a boon to dplyr users. To this end the replyr package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete dplyr code to be used as if it was parametric.

The Problem

dplyr is a library that prefers you know the name of the column you want to work with. This is great when performing a specific analysis, but somewhat painful when supplying re-usable functions or packages. dplyr has a complete parametric interface with the “underbar forms” (for example: using dplyr::filter_ instead of dplyr::filter). However, the underbar notation (and the related necessary details around specifying lazy evaluation of formulas) rapidly becomes difficult.

As an attempted work-around replyr now supplies an adapter that applies a mapping from column names you have (which can be supplied parametrically) to concrete column names you wish you had (which would allow you to write dplyr code simply in terms of known or assumed column names).

It is easier to show than explain.

An Example

First we set up our libraries and type in some notional data as our example:

# install.packages('devtools')                 # Run this if you don't already have devtools
# devtools::install_github('WinVector/replyr') # Run this if you don't already have replyr
library('dplyr')
library('replyr')

d <- data.frame(Sepal_Length=c(5.8,5.7),Sepal_Width=c(4.0,4.4),
                Species='setosa',rank=c(1,2))
print(d)

 #   Sepal_Length Sepal_Width Species rank
 # 1          5.8         4.0  setosa    1
 # 2          5.7         4.4  setosa    2

Then we rename the columns to standard values while restricting to only the named columns (this is the magic step):

nmap <- c('GroupColumn','ValueColumn','RankColumn')
names(nmap) <-  c('Species','Sepal_Length','rank')
print(nmap)

 #       Species  Sepal_Length          rank 
 # "GroupColumn" "ValueColumn"  "RankColumn" 

dtmp <- replyr_renameRestrictCols(d,nmap)
print(dtmp)

 #   GroupColumn ValueColumn RankColumn
 # 1      setosa         5.8          1
 # 2      setosa         5.7          2

At this point you do know the column names (they are the ones you picked) and can write nice neat dplyr. You can then do your work:

# pretend this block is a huge sequence of complicated and expensive operations.
dtmp %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero

Notice we were able to use dplyr::mutate without needing to use dplyr::mutate_ (and without needing to go to Stack Overflow to lookup the lazy-eval notation yet again).

Once you have your desired result you restore the original names of our restricted column set:

invmap <- names(nmap)
names(invmap) <- as.character(nmap)
print(invmap)

 #    GroupColumn    ValueColumn     RankColumn 
 #      "Species" "Sepal_Length"         "rank" 

replyr_renameRestrictCols(dtmp,invmap)

 #   Species Sepal_Length rank
 # 1  setosa          5.8    0
 # 2  setosa          5.7    1

If you haven’t worked a lot with dplyr this won’t look that interesting. If you do work a lot with dplyr you may have been asking for something like this for quite a while. If you use dplyr::*_ you will love replyr::replyr_renameRestrictCols. Be aware: replyr::replyr_renameRestrictCols is a bit of a hack; it mutates all of the columns it is working with, which is unlikely to be a cheap operation.

A Proposal

I feel the replyr::replyr_renameRestrictCols interface represents the correct design for a better dplyr based adapter.

I’ll call this the “column view stack proposal.” I would suggest the addition of two operators to dplyr:

  1. view_as(df,columnNameMap) takes a data item and returns a data item reference that behaves as if the column names have been re-mapped.
  2. unview() removes the view_as annotation.

Obviously there is an issue of nested views, I would suggest maintaining the views as a stack while using the composite transformation implied by the stack of mapping specifications. I am assuming dplyr does not currently have such a facility. Another possibility is a term-rewriting engine to re-map formulas from standard names to target names, but this is what the lazy-eval notations are already attempting (and frankly it isn’t convenient or pretty).

I would also suggest that dplyr::arrange be enhanced to have a visible annotation (just the column names it has arranged by) that allows the user to check if the data is believed to be ordered (crucial for window-function applications). With these two suggestions dplyr data sources would support three primary annotations:

  1. Groups: placed by dplyr::group_by, removed by dplyr::ungroup, and viewed by dplyr::groups.
  2. Orders: placed by dplyr::arrange, removed by Xdplyr::unarrange (just removes annotation, does not undo arrangement; annotation also removed by any operation that re-orders the data, such as join), and viewed by Xdplyr::arrangement.
  3. Column Views: placed by Xdplyr::view_as, removed by Xdplyr::unview, and viewed by Xdplyr::views.

The “Xdplyr::” items are the extensions that are being proposed.

An Alternative Proposal

Another possibility would be some sort of “let” statement that controls name bindings for the duration of a block of code. Using such a statement we would write our above example calculation as:

let(
 list(RankColumn='rank',GroupColumn='Species'),
 {
  # pretend this block is a huge sequence of complicated and expensive operations.
  d %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero
 }
)

The idea is the items 'rank' and 'Species' could be passed in parametrically (notice the let specification is essentially invmap, so we could just pass that in). This isn’t quite R‘s “with” statement as we are not binding names to values, but names to names. Essentially we are asking for macro facility that is compatible with dplyr remote data sources (and the non-standard evaluation methods used to capture variables names). I haven’t experimented with gtools yet, so I don’t know if this is feasible (some R macro examples can be found here).

If somebody has this working I would love to hear about it. The reason I worry is the following. Clever meta-programming depends on all of the rest of the code not being clever and not breaking referential transparency. And in this case dplyr is using non-standard evaluation to capture variable names, which is typically not referentially transparent. Also R functions are roughly Lisp fexprs, which signals to me R‘s design may not have a fully clean distinctions between functions and macros.

Conclusion

I feel the above methods will make working with parameterized variables in dplyr much easier.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)