Using replyr::let to Parameterize dplyr Expressions

December 6, 2016
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Rplot

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

dist_intervals(iris, "Sepal.Length", "Species")

# A tibble: 3 × 7
     Species  sdlower  mean  sdupper iqrlower median iqrupper
                         
1     setosa 4.653510 5.006 5.358490   4.8000    5.0   5.2000
2 versicolor 5.419829 5.936 6.452171   5.5500    5.9   6.2500
3  virginica 5.952120 6.588 7.223880   6.1625    6.5   6.8375

For a specific data frame, with known column names, such a table is easy to construct using dplyr::group_by and dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in dplyr can get quite hairy, quite quickly. Try it yourself, and see.

Enter let, from our new package replyr.

replyr::let implements a mapping from the “symbolic” names used in a dplyr expression to the names of the actual columns in a data frame. This allows you to encapsulate complex dplyr expressions without the use of the lazyeval package, which is the currently recommended way to manage dplyr‘s use of non-standard evaluation. Thus, you could write the function to create the table above as:

# to install replyr: 
# devtools::install_github('WinVector/replyr')

library(dplyr)
library(replyr)  

#
# calculate mean +/- sd intervals and
#           median +/- 1/2 IQR intervals
# for arbitrary data frame column, with optional grouping
#
dist_intervals = function(dframe, colname, groupcolname=NULL) {
  mapping = list(col=colname)
  if(!is.null(groupcolname)) {
    dframe %>% group_by_(groupcolname) -> dframe
  }
  let(alias=mapping,
      expr={
        dframe %>% summarize(sdlower = mean(col)-sd(col),
                             mean = mean(col),
                             sdupper = mean(col) + sd(col),
                             iqrlower = median(col)-0.5*IQR(col),
                             median = median(col),
                             iqrupper = median(col)+0.5*IQR(col))
      })()
}

The mapping is specified as a list of assignments symname=colname, where symname is the name used in the dplyr expression, and colname is the name (as a string) of the corresponding column in the data frame. We can now call our dist_intervals on the iris dataset:

dist_intervals(iris, "Sepal.Length")

   sdlower     mean  sdupper iqrlower median iqrupper
1 5.015267 5.843333 6.671399     5.15    5.8     6.45

dist_intervals(iris, "Sepal.Length", "Species")
# A tibble: 3 × 7
     Species  sdlower  mean  sdupper iqrlower median iqrupper
                         
1     setosa 4.653510 5.006 5.358490   4.8000    5.0   5.2000
2 versicolor 5.419829 5.936 6.452171   5.5500    5.9   6.2500
3  virginica 5.952120 6.588 7.223880   6.1625    6.5   6.8375

dist_intervals(iris, "Petal.Length", "Species")
# A tibble: 3 × 7
     Species  sdlower  mean  sdupper iqrlower median iqrupper
                         
1     setosa 1.288336 1.462 1.635664   1.4125   1.50   1.5875
2 versicolor 3.790089 4.260 4.729911   4.0500   4.35   4.6500
3  virginica 5.000105 5.552 6.103895   5.1625   5.55   5.9375

The implementation of let is adapted from gtools::strmacro by Gregory R. Warnes. Its primary purpose is for wrapping dplyr, but you can use it to parameterize other functions that take their arguments via non-standard evaluation, like ggplot2 functions — in other words, you can use replyr::let instead of ggplot2::aes_string, if you are feeling perverse. Because let creates a macro, you have to avoid variable collisions (for example, remapping x in ggplot2 will clobber both sides of aes(x=x)), and you should remember that any side effects of the expression will escape let‘s execution environment.

The replyr package is available on github. Its goal is to supply uniform dplyr-based methods for manipulating data frames and tbls both locally and on remote (dplyr-supported) back ends. This is a new package, and it is still going through growing pains as we figure out the best ways to implement desired functionality. We welcome suggestions for new functions, and more efficient or more general ways to implement the functionality that we supply.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)