Simpler R coding with pipes > the present and future of the magrittr package

Posted on August 5, 2014 by Tal Galili in R bloggers | 0 Comments

[This article was first published on R-statistics blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a guest post by Stefan Milton, the author of the magrittr package which introduces the %>% operator to R programming.

Preface (by Tal Galili)

I was first introduced to the %>% (a.k.a: pipe) operator in R, thanks to Hadley Wickham’s (fascinating) dplyr tutorial (link to the workshop’s material) at useR!2014. After several discussions during the conference (including one very influential conversation with Rstudio’s Joe Cheng), I got convinced that the pipe operator is one (if not THE) most important innovation introduced, this year, to the R ecosystem.

Soon after, I contacted Stefan Milton (the author of the magrittr package), asking him to write about his implementation of the pipe operator. Stefan generously agreed, and what follows is what he had to share with the rest of us.

magrittr: The Difficult Crossing – by Stefan Milton Bache

fig1

Background

It has only been 7 months and a bit since my initial magrittr commit to GitHub on January 1st. It has had more success than I had anticipated, and it appears that I was not quite alone with a frustration which caused me to start the magrittr project. I am not easily frustrated with R, but after a few weeks working with F# at work, I felt it upon returning to R: I had gotten used to writing code in a different way — all nicely aligned with thought and order of execution. The forward pipe operator |> was so addictive that being unable to do something similar in R was more than mildly irritating. Reversing thought, deciphering nested function calls, and making excessive use of temporary variables almost became deal breakers! Surprisingly, I had never really noticed this before, but once I did my returning to R became a difficult crossing.

An amazing thing about R is that it is a very flexible language and the problem could be solved. The |> operator in F# is indeed very simple: it is defined as let (|>) x f = f x. However, the usefulness of this simplicity relies heavily on a concept that is not available in R: partial application. Furthermore, functions in F# almost always adhere to certain design principles which make the simple definition sufficient. Suppose that f is a function of two arguments, then in F# you may apply f to only the first argument and obtain a new function as the result — a function of the second argument alone. This is partial application, and works with any number of arguments, but application is always from left to right in the argument list. This is why the most important argument (and the one most likely to be a left-hand side object in the pipeline) is almost always the last argument, which in turn makes the simple definition of |> work. To illustrate, consider the following example:

some_value |> some_function other_value

Here, some_function is partially applied to other_value, creating a new function of a single argument, and by the simple definition of |>, this is applied to some_value.

It was clear to me that because R is lacking native partial application and conventions on argument order, no simple solution would be satisfactory, although definitely possible, see e.g. here or here. I wanted to make something that would feel natural in R, and which would serve the main purpose of improving cognitive performance of those writing the code, and of those reading the code.

It turned out that while I was working on magrittr’s %>% operator, Hadley Wickham and Romain Francois was implementing a similar %.% operator in their dplyr package which they announced on January 17. However, it was not quite as flexible, and we thought that piping functionality was better placed in its own more light-weight package. Hadley joined the magrittr project, and in dplyr 2.0 the %.% operator was deprecated — instead%>% was imported from magrittr.

The basics

Although quite a few blogs have nice introductions to the magrittr package (there is also a vignette), I’ll provide a brief recap here to add some context to the thoughts presented above. Consider the example below (no claim of any scientific relevance, but it was a nice opportunity to try Hadley’s babynames package):

library(babynames) # data package
library(dplyr)     # provides data manipulating functions.
library(magrittr)  # ceci n'est pas un pipe
library(ggplot2)   # for graphics

babynames %>% 
    filter(name %>% substr(1, 3) %>% equals("Ste")) %>% 
    group_by(year, sex) %>% 
    summarize(total = sum(n)) %>%
    qplot(year, total, color = sex, data = ., geom = "line") %>%
    add(ggtitle('Names starting with "Ste"')) %>% 
    print

fig2

First note, that even without knowing much about magrittr (or even R) reading this chunk of code is pretty easy — like a recipe, and not a single temporary variable is needed. It’s almost like

1. take the baby data, then 
2.   filter it such that the name sub-string from character 1 to 3 equals "Ste", then
3.   group it by year and sex, then
4.   summarize it by computing total sum for each group, then
5.   plot the resuls, coloring by sex, then
6.   add a title, then 
7.   print it to the canvas.

Maybe even easier?! The order in which you’d think of these steps is the same as the order in which they are written, and as the order in which they are executed. The alternative would be to use either a bunch of variables, or to have a nasty string of nested functions calls starting with print at the very left, babynames somewhere in the middle, and the remaining arguments and values scattered around.

The example illustrates a few features of %>%. Firstly, the dplyr functions filter, group_by, and summarize all take as first argument a data object, and as default this is where %>% will place its left-hand side. The babynames data is thus inserted as first argument in the call to filter. When the filtering is done, the result is passed as the first argument to group_by, and similarly for summarize. However, one is not always so fortunate that a function is designed to accept the data (or whatever you might be piping along) as its first argument (the dplyr functions are designed with %>% operations in mind). This is the case with e.g. qplot, but note the data = . argument. This tells %>% to place the left-hand side there, and not as the first argument. This is a simple and natural way to accommodate the lack of consistency of function signatures, and allows the left-hand side to go anywhere in the call on the right-hand side. You may also have noted that print is used without parentheses; this is to make the code even cleaner when only one the left-hand side is needed as input. Finally, note that %>% can be used in a nested fashion (a separate chain is found within the filter call) and that magrittr has aliases for commonly used operators, such as add for + and equals for == used above. These make pipe chains more readable (not necessarily shorter).

Outlook

The two main places to obtain magrittr are CRAN (using install.packages) and GitHub (using devtools::install_github). As usual, the first is the stable version, and the latter is the development version and at the time of this writing the latter has quite a lot of features not yet available made it to the CRAN version. Examples are the tee operator %T> operator which works like %>% but returns the left-hand side after applying the right-hand side; the %$% operator which exposes the contents/variables of left-hand side for the right-hand side expression (so one can omit the verbosedataset$ in front of each); a compound assignment pipe operator %<>% which pipes the left-hand side symbol as usual, but rather than returning the result of the entire chain, the original symbol is overwritten (could also be e.g. dataset$variable instead of a simple symbol). One reason that these features have not yet appeared in the CRAN version (although really useful) is that we give a lot of thought to the more general philosophy, and how all these pieces fit best together in a a coherent framework. In particular, one interesting concept that I think is promising is one of functionalsequences (ala magrittr). Currently each right-hand side is viewed in isolation, and independent of the others in the chain. But since they are all tied together in a linear fashion; one input, one output, one can view everything in the chain, except for the first argument, as a function of a single argument—a functional sequence constructed from a sequence of magrittr-like right-hand sides. Furthermore, currently %>% serves the purpose of building values, but a functional sequence is an analogue for building functions, and ties the concepts together. In the development version there is a first attempt to implement this, but this should still be considered experimental.

I’ll illustrate the concept by an example. Consider an auction where participants submit the quantities they are willing to buy at different prices. Given all the submitted bids, our task is to aggregate the demand and supply curves and visualize the crossing at which supply and demand meet to determine the price which clears the market.

Let’s first generate some (unrealistically uniform) artificial data:

set.seed(1) # reproducability

# Utility function for sampling.
sample_with_replace <-
    function(v, n = 100) sample(v, size = n, replace = TRUE)

# Generate some auction data for the example.
auction.data <- 
    data.frame(
        Price    = 1:100 %>% sample_with_replace,
        Quantity = 1:10  %>% sample_with_replace,
        Type     = 
            0:1 %>%
            sample_with_replace %>%
            factor(labels = c("Buy", "Sell"))
    ) %T>%
    (lambda(x ~ x %>% head %>% print))
##   Price Quantity Type
## 1    27        7  Buy
## 2    38        4  Buy
## 3    58        3 Sell
## 4    91       10  Buy
## 5    21        7  Buy
## 6    90        3 Sell

Notice the use of both the tee operator and the experimental lambda syntax, which are currently only available in the development version.

The task is split into two steps; we construct a function, using a functional sequence operator %,%, a function which is able to aggregate a supply (or demand) curve for sellers (buyers). The other step uses this in a chain which takes the data all the way to a visual.

# Define a function that aggregates the bid data for a type.
# Note that the sorting direction depends on type. 
# For each price level find the total volume which will be sold/bought.
aggregate_bids <- 
    group_by(Type, Price) %,%
    summarize(Quantity = sum(Quantity)) %,%
    ungroup %,%
    arrange(Price*(1 - 2*(Type == "Buy"))) %,%
    mutate(Quantity = Quantity %>% cumsum)

# Group the data, aggregate the bids, and plot the supply and demand curves.
auction.data %>%
    group_by(Type) %>%
    do(aggregate_bids(.)) %>%
    qplot(Quantity, Price, col = Type, geom = "step", data = .) %>%
    print

fig3

Note how the aggregate_bids function is built in a way completely analogous to a usual %>% chain, except that the %,% is used to signal that the result is a functional sequence and not a value. Another option is to use %>% here too and have a designated first left-hand side, e.g. . (suggested by Romain Francois, R-enthusiast and R/C++ hero).

The functional sequence view of the pipe-chain also opens up for possible optimization. Currently the %>% pipe is built for robustness and user-friendliness in a sense similar to generic functions. It will figure out how to proceed given the structure of the right-hand side, which of course has a small overhead. In most situations this is negligible, in others (such as the one described here) one can restructure ones code so it becomes negligible. But granted, one might encounter realistic examples where a little performance boost would be nice. It is quite possible that integrating functional sequences in a way where it only needs to be clever about each step once would lead to good results. This could be particularly useful in situations like

result <- 
    looong_vector %>%
    lapply(
        one_action %,%
        another_action(requiring_x) %,%
        (lambda(. ~ finalizing_actions))
    )

Ce n’est qu’un au revoir

I want to thank Tal Galili for inviting me to write a post about magrittr. It has been interesting to see how the package has caught on in the community, and this was a good opportunity to give a few thoughts on its background and some thoughts on its future. There is definitely an interesting road ahead for magrittr; how it will turn out only time will tell.

Below, I have compiled a few tweet “testimonials”. Happy piping!

fig4

Final note (by Tal Galili):

While this post was written, other R bloggers wrote their own posts on magrittr, here is what they had to say:

To leave a comment for the author, please follow the link and comment on their blog: R-statistics blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Simpler R coding with pipes > the present and future of the magrittr package

Preface (by Tal Galili)

magrittr: The Difficult Crossing – by Stefan Milton Bache

Background

The basics

Outlook

Ce n’est qu’un au revoir

Final note (by Tal Galili):

Related

Preface (by Tal Galili)

magrittr: The Difficult Crossing – by Stefan Milton Bache

Background

The basics

Outlook

Ce n’est qu’un au revoir

Final note (by Tal Galili):

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)