Practical Tidy Evaluation

[This article was first published on jessecambon-R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tidy evaluation is a framework for controlling how expressions and
variables in your code are evaluated by
tidyverse functions. This framework,
housed in the rlang package, is a powerful
tool for writing more efficient and elegant code. In particular, you’ll
find it useful for passing variable names as inputs to functions that
use tidyverse packages like dplyr and
ggplot2.

The goal of this post is to offer accessible examples and intuition for
putting tidy evaluation to work in your own code. Because of this I will
keep conceptual explanations brief, but for more comprehensive
documentation you can refer to dplyr’s
website
, rlang’s
website
, the ‘Tidy Evaluation’
book
by Lionel Henry and Hadley
Wickham, and the Metaprogramming Section of the ‘Advanced R’
book
by Hadley Wickham.

Motivating Example

To begin, let’s consider a simple example of calculating summary
statistics with the mtcars
dataset
.
Below we calculate maximum and minimum horsepower (hp) by the number of
cylinders (cyl) using the
group_by and
summarize
functions from dplyr.

<span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">hp_by_cyl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mtcars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">cyl</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">min_hp</span><span class="o">=</span><span class="nf">min</span><span class="p">(</span><span class="n">hp</span><span class="p">),</span><span class="w">
            </span><span class="n">max_hp</span><span class="o">=</span><span class="nf">max</span><span class="p">(</span><span class="n">hp</span><span class="p">))</span><span class="w">
</span>
cyl min_hp max_hp
4 52 113
6 105 175
8 150 335

Now let’s say we wanted to repeat this calculation multiple times while
changing which variable we group by
. A brute force method to accomplish
this would be to copy and paste our code as many times as necessary and
modify the group by variable in each iteration. However, this is
inefficient especially if our code gets more complicated, requires many
iterations, or requires further development.

To avoid this inelegant solution you might think to store the name of a
variable inside of another variable like this groupby_var <- "vs".
Then you could attempt to use your newly created “groupby_var” variable
in your code: group_by(groupby_var). However, if you try this you will
find it doesn’t work. The “group_by” function expects the name of the
variable you want to group by as an input, not the name of a variable
that contains the name of the variable you want to group by.

This is the kind of headache that tidy evaluation can help you solve. In
the example below we use the
quo function and the
“bang-bang” !!
operator to set “vs” (engine type, 0 = automatic, 1 = manual) as our
group by variable. The “quo” function allows us to store the variable
name in our “groupby_var” variable and “!!” extracts the stored
variable name.

<span class="n">groupby_var</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">quo</span><span class="p">(</span><span class="n">vs</span><span class="p">)</span><span class="w">

</span><span class="n">hp_by_vs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mtcars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="o">!!</span><span class="n">groupby_var</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">min_hp</span><span class="o">=</span><span class="nf">min</span><span class="p">(</span><span class="n">hp</span><span class="p">),</span><span class="w">
            </span><span class="n">max_hp</span><span class="o">=</span><span class="nf">max</span><span class="p">(</span><span class="n">hp</span><span class="p">))</span><span class="w">
</span>
vs min_hp max_hp
0 91 335
1 52 123

The code above provides a method for setting the group by variable by
modifying the input to the “quo” function when we define “groupby_var”.
This can be useful, particularly if we intend to reference the group by
variable multiple times. However, if we want to use code like this
repeatedly in a script then we should consider packaging it into a
function. This is what we will do next.

Making Functions with Tidy Evaluation

To use tidy evaluation in a function, we will still use the “!!”
operator as we did above, but instead of “quo” we will use the
enquo function. Our
new function below takes the group by variable and the measurement
variable as inputs so that we can now calculate maximum and minimum
values of any variable we want. Also note two new features I have
introduced in this function:

  • The as_label
    function extracts the string value of the “measure_var” variable
    (“hp” in this case). We use this to set the value of the
    “measure_var” column.
  • The “walrus operator”
    :=
    is used to create a column named after the variable name stored in
    the “measure_var” argument (“hp” in the example). The walrus
    operator allows you to use strings and evaluated variables (such as
    “measure_var” in our example) on the left hand side of an
    assignment operation (where there would normally be a “=” operator)
    in functions such as “mutate” and “summarize”.

Below we define our function and use it to group by “am” (transmission
type, 0 = automatic, 1 = manual) and calculate summary statistics with
the “hp” (horsepower) variable.

<span class="n">car_stats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">groupby_var</span><span class="p">,</span><span class="n">measure_var</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">groupby_var</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">enquo</span><span class="p">(</span><span class="n">groupby_var</span><span class="p">)</span><span class="w">
  </span><span class="n">measure_var</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">enquo</span><span class="p">(</span><span class="n">measure_var</span><span class="p">)</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">mtcars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">group_by</span><span class="p">(</span><span class="o">!!</span><span class="n">groupby_var</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">summarize</span><span class="p">(</span><span class="n">min</span><span class="o">=</span><span class="nf">min</span><span class="p">(</span><span class="o">!!</span><span class="n">measure_var</span><span class="p">),</span><span class="w">
              </span><span class="n">max</span><span class="o">=</span><span class="nf">max</span><span class="p">(</span><span class="o">!!</span><span class="n">measure_var</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
          </span><span class="n">mutate</span><span class="p">(</span><span class="n">measure_var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as_label</span><span class="p">(</span><span class="n">measure_var</span><span class="p">),</span><span class="w">
            </span><span class="o">!!</span><span class="n">measure_var</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="kc">NA</span><span class="p">)</span><span class="w">
    </span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">hp_by_am</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">car_stats</span><span class="p">(</span><span class="n">am</span><span class="p">,</span><span class="n">hp</span><span class="p">)</span><span class="w">
</span>
am min max measure_var hp
0 62 245 hp NA
1 52 335 hp NA

We now have a flexible function that contains a dplyr workflow. You can
experiment with modifying this function for your own purposes.
Additionally, as you might suspect, you could use the same tidy
evaluation functions we just used with tidyverse packages other than
dplyr.

As an example, below I’ve defined a function that builds a scatter plot
with ggplot2. The function takes a
dataset and two variable names as inputs. You will notice that the
dataset argument “df” needs no tidy evaluation. The
as_label function is
used to extract our variable names as strings to create a plot title
with the “ggtitle” function.

<span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">scatter_plot</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">x_var</span><span class="p">,</span><span class="n">y_var</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">x_var</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">enquo</span><span class="p">(</span><span class="n">x_var</span><span class="p">)</span><span class="w">
  </span><span class="n">y_var</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">enquo</span><span class="p">(</span><span class="n">y_var</span><span class="p">)</span><span class="w">
  
  </span><span class="nf">return</span><span class="p">(</span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=!!</span><span class="n">x_var</span><span class="p">,</span><span class="n">y</span><span class="o">=!!</span><span class="n">y_var</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
  </span><span class="n">theme</span><span class="p">(</span><span class="n">plot.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">lineheight</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">face</span><span class="o">=</span><span class="s2">"bold"</span><span class="p">,</span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_smooth</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="n">str_c</span><span class="p">(</span><span class="n">as_label</span><span class="p">(</span><span class="n">y_var</span><span class="p">),</span><span class="w"> </span><span class="s2">" vs. "</span><span class="p">,</span><span class="n">as_label</span><span class="p">(</span><span class="n">x_var</span><span class="p">)))</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">scatter_plot</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span><span class="n">disp</span><span class="p">,</span><span class="n">hp</span><span class="p">)</span><span class="w">
</span>

As you can see, we’ve plotted the “hp” (horsepower) variable against
“disp” (displacement) and added a regression line. Now, instead of
copying and pasting ggplot code to create the same plot with different
datasets and variables, we can just call our function.

The “Curly-Curly” Shortcut and Passing Multiple Variables

To wrap things up, I’ll cover a few additional tricks and shortcuts for
your tidy evaluation toolbox.

  • The “curly-curly” {{
    }}

    operator directly extracts a stored variable name from
    “measure_var” in the example below. In the prior example we
    needed both “enquo” and “!!” to evaluate a variable like this so
    the “curly-curly” operator is a convenient shortcut. However, note
    that if you want to extract the string variable name with the
    “as_label” function, you will still need to use “enquo” and
    “!!” as we have done below with “measure_name”.
  • The syms function and
    the “!!!” operator are used for passing a list of variables as a
    function argument. In prior examples “!!” was used to evaluate a
    single group by variable; we now use “!!!” to evaluate a list of
    group by variables. One quirk is that to use the “syms” function we
    will need to pass the variable names in quotes.
  • The walrus operator “:=” is again used to create new columns, but
    now the column names are defined with a combination of a variable
    name stored in a function argument and another string (“_min” and
    “_max” below). We use the “enquo” and “as_label” functions to
    extract the string variable name from “measure_var” and store it in
    “measure_name” and then use the “str_c” function from
    stringr to combine strings. You
    can use similar code to build your own column names from variable
    name inputs and strings.

Our new function is defined below and is first called to group by the
“cyl” variable and then called to group by the “am” and “vs”
variables. Note that the “!!!” operator and “syms” function can be
used with either a list of strings or a single string.

<span class="n">get_stats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">data</span><span class="p">,</span><span class="n">groupby_vars</span><span class="p">,</span><span class="n">measure_var</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">groupby_vars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">syms</span><span class="p">(</span><span class="n">groupby_vars</span><span class="p">)</span><span class="w">
  </span><span class="n">measure_name</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_label</span><span class="p">(</span><span class="n">enquo</span><span class="p">(</span><span class="n">measure_var</span><span class="p">))</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="w"> 
    </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="o">!!!</span><span class="n">groupby_vars</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
            </span><span class="n">summarize</span><span class="p">(</span><span class="w"> </span><span class="o">!!</span><span class="n">str_c</span><span class="p">(</span><span class="n">measure_name</span><span class="p">,</span><span class="s2">"_min"</span><span class="p">)</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nf">min</span><span class="p">({{</span><span class="n">measure_var</span><span class="p">}}),</span><span class="w">
                       </span><span class="o">!!</span><span class="n">str_c</span><span class="p">(</span><span class="n">measure_name</span><span class="p">,</span><span class="s2">"_max"</span><span class="p">)</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nf">max</span><span class="p">({{</span><span class="n">measure_var</span><span class="p">}}))</span><span class="w">
    </span><span class="p">)}</span><span class="w">
</span><span class="n">cyl_hp_stats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mtcars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">get_stats</span><span class="p">(</span><span class="s2">"cyl"</span><span class="p">,</span><span class="n">mpg</span><span class="p">)</span><span class="w">
</span><span class="n">gear_stats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mtcars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">get_stats</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"am"</span><span class="p">,</span><span class="s2">"vs"</span><span class="p">),</span><span class="n">gear</span><span class="p">)</span><span class="w">
</span>
cyl mpg_min mpg_max
4 21.4 33.9
6 17.8 21.4
8 10.4 19.2
am vs gear_min gear_max
0 0 3 3
0 1 3 4
1 0 4 5
1 1 4 5

This concludes my introduction to tidy evaluation. Hopefully this serves
as a useful starting point for using these concepts in your own code.

To leave a comment for the author, please follow the link and comment on their blog: jessecambon-R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)