[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog post is an update to an older one I wrote in March. In the post from March, `dplyr` was at version 0.50, but since then a major update introduced some changes that make some of the tips in that post obsolete. So here I revisit the blog post from March by using `dplyr` 0.70.

Create new columns with `mutate()` and `case_when()`

The basic things such as selecting columns, renaming them, filtering, etc did not change with this new version. What did change however is creating new columns using `case_when()`. First, load `dplyr` and the `mtcars` dataset:

```library("dplyr")
data(mtcars)```

This was how it was done in version 0.50 (notice the ‘.\$’ symbol before the variable ‘carb’):

```mtcars %>%
mutate(carb_new = case_when(.\$carb == 1 ~ "one",
.\$carb == 2 ~ "two",
.\$carb == 4 ~ "four",
TRUE ~ "other")) %>%
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb carb_new
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     four
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     four
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1      one
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1      one
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2      two```

This has been simplified to:

```mtcars %>%
mutate(carb_new = case_when(carb == 1 ~ "one",
carb == 2 ~ "two",
carb == 4 ~ "four",
TRUE ~ "other")) %>%
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb carb_new
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     four
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     four
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1      one
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1      one
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2      two```

No need for `.\$` anymore.

Apply a function to certain columns only, by rows, with `purrrlyr`

`dplyr` wasn’t the only package to get an overhaul, `purrr` also got the same treatment.

In the past, I applied a function to certains columns like this:

```mtcars %>%
select(am, gear, carb) %>%
purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2

Now, `by_row()` does not exist in `purrr` anymore, but instead a new package called `purrrlyr` was introduced with functions that don’t really fit inside `purrr` nor `dplyr`:

```mtcars %>%
select(am, gear, carb) %>%
purrrlyr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2
## # A tibble: 6 x 4
##      am  gear  carb sum_am_gear_carb
##   <dbl> <dbl> <dbl>            <dbl>
## 1     1     4     4                9
## 2     1     4     4                9
## 3     1     4     1                6
## 4     0     3     1                4
## 5     0     3     2                5
## 6     0     3     1                4```

Think of `purrrlyr` as `purrr`s and `dplyr`s love child.

Using `dplyr` functions inside your own functions, or what is `tidyeval`

Programming with `dplyr` has been simplified a lot. Before version `0.70`, one needed to use `dplyr` in conjuction with `lazyeval` to use `dplyr` functions inside one’s own fuctions. It was not always very easy, especially if you mixed columns and values inside your functions. Here’s the example from the March blog post:

```extract_vars <- function(data, some_string){

data %>%
select_(lazyeval::interp(~contains(some_string))) -> data

return(data)
}

extract_vars(mtcars, "spam")```

More examples are available in this other blog post.

I will revisit them now with `dplyr`’s new `tidyeval` syntax. I’d recommend you read the Tidy evaluation vignette here. This vignette is part of the `rlang` package, which gets used under the hood by `dplyr` for all your programming needs. Here is the function I called `simpleFunction()`, written with the old `dplyr` syntax:

```simpleFunction <- function(dataset, col_name){
dataset %>%
group_by_(col_name) %>%
summarise(mean_mpg = mean(mpg)) -> dataset
return(dataset)
}

simpleFunction(mtcars, "cyl")
## # A tibble: 3 x 2
##     cyl mean_mpg
##   <dbl>    <dbl>
## 1     4 26.66364
## 2     6 19.74286
## 3     8 15.10000```

With the new synax, it must be rewritten a little bit:

```simpleFunction <- function(dataset, col_name){
col_name <- enquo(col_name)
dataset %>%
group_by(!!col_name) %>%
summarise(mean_mpg = mean(mpg)) -> dataset
return(dataset)
}

simpleFunction(mtcars, cyl)
## # A tibble: 3 x 2
##     cyl mean_mpg
##   <dbl>    <dbl>
## 1     4 26.66364
## 2     6 19.74286
## 3     8 15.10000```

What has changed? Forget the underscore versions of the usual functions such as `select_()`, `group_by_()`, etc. Now, you must quote the column name using `enquo()` (or just `quo()` if working interactively, outside a function), which returns a quosure. This quosure can then be evaluated using `!!` in front of the quosure and inside the usual `dplyr` functions.

Let’s look at another example:

```simpleFunction <- function(dataset, col_name, value){
filter_criteria <- lazyeval::interp(~y == x, .values=list(y = as.name(col_name), x = value))
dataset %>%
filter_(filter_criteria) %>%
summarise(mean_cyl = mean(cyl)) -> dataset
return(dataset)
}

simpleFunction(mtcars, "am", 1)
##   mean_cyl
## 1 5.076923```

As you can see, it’s a bit more complicated, as you needed to use `lazyeval::interp()` to make it work. With the improved `dplyr`, here’s how it’s done:

```simpleFunction <- function(dataset, col_name, value){
col_name <- enquo(col_name)
dataset %>%
filter((!!col_name) == value) %>%
summarise(mean_cyl = mean(cyl)) -> dataset
return(dataset)
}

simpleFunction(mtcars, am, 1)
##   mean_cyl
## 1 5.076923```

Much, much easier! There is something that you must pay attention to though. Notice that I’ve written:

`filter((!!col_name) == value)`

and not:

`filter(!!col_name == value)`

I have enclosed `!!col_name` inside parentheses. I struggled with this, but thanks to help from @dmi3k and @_lionelhenry I was able to understand what was happening (isn’t the #rstats community on twitter great?).

One last thing: let’s make this function a bit more general. I hard-coded the variable `cyl` inside the body of the function, but maybe you’d like the mean of another variable? Easy:

```simpleFunction <- function(dataset, group_col, mean_col, value){
group_col <- enquo(group_col)
mean_col <- enquo(mean_col)
dataset %>%
filter((!!group_col) == value) %>%
summarise(mean((!!mean_col))) -> dataset
return(dataset)
}

simpleFunction(mtcars, am, cyl, 1)
##   mean((cyl))
## 1    5.076923```

«That’s very nice Bruno, but `mean((cyl))` in the output looks ugly as sin» you might think, and you’d be right. It is possible to set the name of the column in the output using `:=` instead of `=`:

```simpleFunction <- function(dataset, group_col, mean_col, value){
group_col <- enquo(group_col)
mean_col <- enquo(mean_col)
mean_name <- paste0("mean_", mean_col)[2]
dataset %>%
filter((!!group_col) == value) %>%
summarise(!!mean_name := mean((!!mean_col))) -> dataset
return(dataset)
}

simpleFunction(mtcars, am, cyl, 1)
##   mean_cyl
## 1 5.076923```

To get the name of the column I added this line:

`mean_name <- paste0("mean_", mean_col)[2]`

To see what it does, try the following inside an R interpreter (remember to us `quo()` instead of `enquo()` outside functions!):

```paste0("mean_", quo(cyl))
## [1] "mean_~"   "mean_cyl"```

`enquo()` quotes the input, and with `paste0()` it gets converted to a string that can be used as a column name. However, the `~` is in the way and the output of `paste0()` is a vector of two strings: the correct name is contained in the second element, hence the `[2]`. There might be a more elegant way of doing that, but for now this has been working well for me.

That was it folks! I do recommend you read the Programming with dplyr vignette here as well as other blog posts, such as the one recommended to me by @dmi3k here.

Have fun with `dplyr 0.70`!

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)