# Using purrr’s map family functions in dplyr::mutate

**R - yoshidk6’s blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

`map`

family functions of the `purrr`

package are very useful for using non-vectorized functions in `dplyr::mutate`

chain (see GitHub – jennybc/row-oriented-workflows: Row-oriented workflows in R with the tidyverse or https://www.jimhester.com/2018/04/12/vectorize/).

I encounter the needs for this especially when dealing with nested data frames.

One of the drawbacks is that name/input argument assignments become confusing when you want to use more than two columns of your data frames (and using `pmap`

family) for the function of interest.
This post first briefly review how `mutate`

works in combination with `map`

or `map2`

, then provide two approaches to avoid confusions around name assignments when using `pmap`

.

- How mutate works with vectorized functions
- Non-vectorized function with one or two input arguments (map or map2)
- Non-vectorized function with three or more input arguments (pmap)

# How `mutate`

works with vectorized functions

In most cases, the processes you want to do in `mutate`

is vectorized and there is no need to use `map`

family function.
This works because the output from the function of interest (`c`

in the example below) has the same length as the original data frame, and `mutate`

only need to append one column to the data frame.

library(tidyverse) df0 <- tibble(a = 1:3, b = 4:6) df0 %>% mutate(c = a + b)

# Non-vectorized function with one or two input arguments (`map`

or `map2`

)

Imagine that we want to create a new column containing arithmetic progressions in each row [ref (in Japanese)].
Since `seq`

function is not vectorized, we cannot directly use this in `mutate`

chain.

df1 <- tibble(a = c(1, 2), b = c(3, 6), c = c(8, 10)) df1 %>% mutate(d = seq(a, b)) # Error in mutate_impl(.data, dots) : Evaluation error: 'from' must be of length 1.

Instead, we can use `map`

family function here.
`map`

family function take list(s) as input arguments and apply the function of interest using each element of the given lists.
Because each column of data frames in R is a list, `map`

works very well in combination.

In this example, we want to provide two input arguments to the `seq`

function, `from`

and `to`

.
`map2`

is the appropriate function for this.

df2 <- df1 %>% mutate(d = map2(a, b, seq)) as.data.frame(df2) # a b c d #1 1 3 8 1, 2, 3 #2 2 6 10 2, 3, 4, 5, 6

The figure below shows how `map`

function handles this process in `mutate`

chain.

Like you do with `map`

function outside `mutate`

, we can use `map_dbl`

or `map_chr`

to create columns with `double`

or `character`

types.

If we want to explicitly specify names of the argument, `.x`

and `.y`

can be used.
See what happens with this:

df2 <- mutate(df1, d = map2(a, b, ~seq(.y, .x))) as.data.frame(df2)

# Non-vectorized function with three or more input arguments (`pmap`

)

Assignment of column names become confusing when using three or more columns, because we don't have shorthand like `.x`

or `.y`

any more.
Let's take a look at the following example using `rnorm`

function.

## Case example

Generate a list of random numbers for each row with

`rnorm`

function. Each row of the original data frame contain different value of`mean`

,`sd`

,`n`

.

We will first prepare a data frame with columns corresponding to `mean`

, `sd`

, `n`

, and apply `rnorm`

function for each row using `pmap`

.
Each element of the new column `data`

contains a vector of random samples *1.
This type of structure is called as "nested data frames" and there are many resources on this, such as 25 Many models | R for Data Science.

## A simple case

If your data frame has __the exact same names and numbers of columns__ to the input arguments of the function of interest, a simple syntax like the one below works *2.

df4 <- tribble(~mean, ~sd, ~n, 1, 0.03, 2, 10, 0.1, 4, 5, 0.1, 4) df4.2 <- df4 %>% mutate(data = pmap(., rnorm)) as.data.frame(df4.2)

One caution is that the syntax like the one below doesn't work.
`pmap`

thinks that you are calling `rnorm(df4$n, df4$mean, df4$sd)`

for each row, and each element of the new column contain three random samples from the same list of `mean`

and `sd`

.
1, the length is taken to be the number required.">*3

df4 %>% mutate(data = pmap(., ~rnorm(n, mean, sd))) %>% as.data.frame() # Wrong answer

## Number of columns > Number of input arguments

In most cases, however, you will have more columns than the input arguments.
`pmap`

complains in this case, saying that you have unused argument.

df5 <- tribble(~mean, ~sd, ~dummy, ~n, 1, 0.03, "a", 2, 10, 0.1, "b", 4, 5, 0.1, "c", 4) df5 %>% mutate(data = pmap(., rnorm)) # Error

There are two ways to avoid this error.

### Make a small list on the fly

The first method is to create a small list that only contains the necessary columns (Ref: Dplyr: Alternatives to rowwise - tidyverse - RStudio Community )

df5.2 <- df5 %>% mutate(data = pmap(list(n=n, mean=mean, sd=sd), rnorm)) as.data.frame(df5.2)

Here, `list(n=n, mean=mean, sd=sd)`

create a new list with three vectors named `n`

, `mean`

, and `sd`

, which serves the same purpose as the `df4`

data frame in the above example.

Mind that if you don't give names to the elements of the new list, the order of the list items will be used to associate with input arguments of `rnorm`

.
My recommendation is to always assign names to the list elements.

df5 %>% mutate(data = pmap(list(n, mean, sd), rnorm)) # Correct but not recommended df5 %>% mutate(data = pmap(list(mean, sd, n), rnorm)) # Wrong answer

### Use `...`

to ignore unused columns

The second method is to absorb unused columns with `...`

(Ref: Map over multiple inputs simultaneously. — map2 • purrr).
A syntax like the one below works because `pmap`

automatically associate names of the input list and names in `function()`

.
In other word, columns names of the data frame must match the variable names in the `function()`

.

df5.3 <- df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n=n, mean=mean, sd=sd))) as.data.frame(df5.3)

Input arguments of `function()`

and `rnorm()`

are __not__ automatically associated with names. It is recommended to explicitly associate input argument name for the function of interest (`rnorm`

in this case).

df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n, mean, sd))) # Correct but not recommended df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(mean, sd, n))) # Wrong answer df5 %>% mutate(data = pmap(., function(mean, sd, n, ...) rnorm(mean, sd, n))) # Wrong answer

A syntax like the one below gives unexpected outputs, as you saw in the `df4`

example.

df5 %>% mutate(data = pmap(., function(...) rnorm(n=n, mean=mean, sd=sd))) # Wrong answer

## Column names are different from the input argument names

You can use either of the two approaches above.

df6 <- tribble(~mean1, ~sd1, ~dummy, ~n1, 1, 0.03, "a", 2, 10, 0.1, "b", 4, 5, 0.1, "c", 4) df6.2 <- df6 %>% mutate(data = pmap(list(mean=mean1, sd=sd1, n=n1), rnorm)) as.data.frame(df6.2) df6.3 <- df6 %>% mutate(data = pmap(., function(n1, mean1, sd1, ...) rnorm(n=n1, mean=mean1, sd=sd1))) as.data.frame(df6.3)

*1:In the examples below (and above), we further use as.data.frame function to exposure actual numbers of vectors

*2:This works even if the order of the columns is different from the order of input arguments

*3:This happens because rnorm is actually vectorized. See ?rnorm: *If length(n) > 1, the length is taken to be the number required.*

**leave a comment**for the author, please follow the link and comment on their blog:

**R - yoshidk6’s blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.