map family functions of the
purrr package are very useful for using non-vectorized functions in
dplyr::mutate chain (see GitHub – jennybc/row-oriented-workflows: Row-oriented workflows in R with the tidyverse or https://www.jimhester.com/2018/04/12/vectorize/).
I encounter the needs for this especially when dealing with nested data frames.
One of the drawbacks is that name/input argument assignments become confusing when you want to use more than two columns of your data frames (and using
pmap family) for the function of interest.
This post first briefly review how
mutate works in combination with
map2, then provide two approaches to avoid confusions around name assignments when using
- How mutate works with vectorized functions
- Non-vectorized function with one or two input arguments (map or map2)
- Non-vectorized function with three or more input arguments (pmap)
mutate works with vectorized functions
In most cases, the processes you want to do in
mutate is vectorized and there is no need to use
map family function.
This works because the output from the function of interest (
c in the example below) has the same length as the original data frame, and
mutate only need to append one column to the data frame.
library(tidyverse) df0 <- tibble(a = 1:3, b = 4:6) df0 %>% mutate(c = a + b)
Non-vectorized function with one or two input arguments (
Imagine that we want to create a new column containing arithmetic progressions in each row [ref (in Japanese)].
seq function is not vectorized, we cannot directly use this in
df1 <- tibble(a = c(1, 2), b = c(3, 6), c = c(8, 10)) df1 %>% mutate(d = seq(a, b)) # Error in mutate_impl(.data, dots) : Evaluation error: 'from' must be of length 1.
Instead, we can use
map family function here.
map family function take list(s) as input arguments and apply the function of interest using each element of the given lists.
Because each column of data frames in R is a list,
map works very well in combination.
In this example, we want to provide two input arguments to the
map2 is the appropriate function for this.
df2 <- df1 %>% mutate(d = map2(a, b, seq)) as.data.frame(df2) # a b c d #1 1 3 8 1, 2, 3 #2 2 6 10 2, 3, 4, 5, 6
The figure below shows how
map function handles this process in
Like you do with
map function outside
mutate, we can use
map_chr to create columns with
If we want to explicitly specify names of the argument,
.y can be used.
See what happens with this:
df2 <- mutate(df1, d = map2(a, b, ~seq(.y, .x))) as.data.frame(df2)
Non-vectorized function with three or more input arguments (
Assignment of column names become confusing when using three or more columns, because we don't have shorthand like
.y any more.
Let's take a look at the following example using
Generate a list of random numbers for each row with
rnormfunction. Each row of the original data frame contain different value of
We will first prepare a data frame with columns corresponding to
n, and apply
rnorm function for each row using
Each element of the new column
data contains a vector of random samples *1.
This type of structure is called as "nested data frames" and there are many resources on this, such as 25 Many models | R for Data Science.
A simple case
If your data frame has the exact same names and numbers of columns to the input arguments of the function of interest, a simple syntax like the one below works *2.
df4 <- tribble(~mean, ~sd, ~n, 1, 0.03, 2, 10, 0.1, 4, 5, 0.1, 4) df4.2 <- df4 %>% mutate(data = pmap(., rnorm)) as.data.frame(df4.2)
One caution is that the syntax like the one below doesn't work.
pmap thinks that you are calling
rnorm(df4$n, df4$mean, df4$sd) for each row, and each element of the new column contain three random samples from the same list of
1, the length is taken to be the number required.">*3
df4 %>% mutate(data = pmap(., ~rnorm(n, mean, sd))) %>% as.data.frame() # Wrong answer
Number of columns > Number of input arguments
In most cases, however, you will have more columns than the input arguments.
pmap complains in this case, saying that you have unused argument.
df5 <- tribble(~mean, ~sd, ~dummy, ~n, 1, 0.03, "a", 2, 10, 0.1, "b", 4, 5, 0.1, "c", 4) df5 %>% mutate(data = pmap(., rnorm)) # Error
There are two ways to avoid this error.
Make a small list on the fly
The first method is to create a small list that only contains the necessary columns (Ref: Dplyr: Alternatives to rowwise - tidyverse - RStudio Community )
df5.2 <- df5 %>% mutate(data = pmap(list(n=n, mean=mean, sd=sd), rnorm)) as.data.frame(df5.2)
list(n=n, mean=mean, sd=sd) create a new list with three vectors named
sd, which serves the same purpose as the
df4 data frame in the above example.
Mind that if you don't give names to the elements of the new list, the order of the list items will be used to associate with input arguments of
My recommendation is to always assign names to the list elements.
df5 %>% mutate(data = pmap(list(n, mean, sd), rnorm)) # Correct but not recommended df5 %>% mutate(data = pmap(list(mean, sd, n), rnorm)) # Wrong answer
... to ignore unused columns
The second method is to absorb unused columns with
... (Ref: Map over multiple inputs simultaneously. — map2 • purrr).
A syntax like the one below works because
pmap automatically associate names of the input list and names in
In other word, columns names of the data frame must match the variable names in the
df5.3 <- df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n=n, mean=mean, sd=sd))) as.data.frame(df5.3)
Input arguments of
rnorm() are not automatically associated with names. It is recommended to explicitly associate input argument name for the function of interest (
rnorm in this case).
df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n, mean, sd))) # Correct but not recommended df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(mean, sd, n))) # Wrong answer df5 %>% mutate(data = pmap(., function(mean, sd, n, ...) rnorm(mean, sd, n))) # Wrong answer
A syntax like the one below gives unexpected outputs, as you saw in the
df5 %>% mutate(data = pmap(., function(...) rnorm(n=n, mean=mean, sd=sd))) # Wrong answer
Column names are different from the input argument names
You can use either of the two approaches above.
df6 <- tribble(~mean1, ~sd1, ~dummy, ~n1, 1, 0.03, "a", 2, 10, 0.1, "b", 4, 5, 0.1, "c", 4) df6.2 <- df6 %>% mutate(data = pmap(list(mean=mean1, sd=sd1, n=n1), rnorm)) as.data.frame(df6.2) df6.3 <- df6 %>% mutate(data = pmap(., function(n1, mean1, sd1, ...) rnorm(n=n1, mean=mean1, sd=sd1))) as.data.frame(df6.3)
*1:In the examples below (and above), we further use as.data.frame function to exposure actual numbers of vectors
*2:This works even if the order of the columns is different from the order of input arguments
*3:This happens because rnorm is actually vectorized. See ?rnorm: If length(n) > 1, the length is taken to be the number required.