# Vectorization, Purrr, and Mutate

Sometimes, R is a bit too intuitive, and I wondered what was wrong with my code the other day was. The problem was vectorized functions within a mutate statement. I usually use the `paste`

function and the `ifelse`

function within `mutate`

so the vectorization is already automatic. However, for a specific task at work, I was working with a non vectorized function and it took me a little bit to figure out what was wrong with my code.

So I decided to write a little post as a reminder for myself, how vectorized functions in `mutate`

work.

Let’s start with some sample data.

sample_df <- dplyr::tibble( list_col = list(c("a", "b", "c"), c("a", "b"), "c", c("e", "f")), d = c(1, 2, 3, 4) ) sample_df ## # A tibble: 4 × 2 ## list_col d ## <list> <dbl> ## 1 <chr [3]> 1 ## 2 <chr [2]> 2 ## 3 <chr [1]> 3 ## 4 <chr [2]> 4

In the data frame above we have 2 columns. A list column with character vectors and one integer column. Now, we want to get the length of the vectors for each row and create a new column. Naively, I tried something like that…

sample_df %>% dplyr::mutate( length_vec = length(list_col) ) ## # A tibble: 4 × 3 ## list_col d length_vec ## <list> <dbl> <int> ## 1 <chr [3]> 1 4 ## 2 <chr [2]> 2 4 ## 3 <chr [1]> 3 4 ## 4 <chr [2]> 4 4

For my task at work, I was working with JSON data but the example above demonstrates the problem I had. Instead of getting the length of each individual vector in the `list_col`

rows, I was getting the length of the `list_col`

list or the number of rows of the data frame. Now if I do …

length(sample_df$list_col) ## [1] 4

… I get a scalar, or a vector of length 1, back. The way R works is that it recycles the output and fills up the column, `length_vec`

with all 4s.

To illustrate this behavior, we can create a data frame like this:

data.frame( a = 1, b = 1:2, c = 1:5, d = letters[1:10] ) ## a b c d ## 1 1 1 1 a ## 2 1 2 2 b ## 3 1 1 3 c ## 4 1 2 4 d ## 5 1 1 5 e ## 6 1 2 1 f ## 7 1 1 2 g ## 8 1 2 3 h ## 9 1 1 4 i ## 10 1 2 5 j dplyr::tibble( a = "letter:", d = letters[1:10] ) ## # A tibble: 10 × 2 ## a d ## <chr> <chr> ## 1 letter: a ## 2 letter: b ## 3 letter: c ## 4 letter: d ## 5 letter: e ## 6 letter: f ## 7 letter: g ## 8 letter: h ## 9 letter: i ## 10 letter: j

For tibbles, we get a warning with the first creation of a data frame because it says, only values of size one are recycled. Also, it will only be repeated a whole number of times if necessary for the data frame.

That’s what basically happened to me.

# Fixing Vectorization with Purrr::map

To fix the issue, we can simply use `purrr`

in the mutate function and then get the length of each vector.

sample_df %>% dplyr::mutate( length_vec = purrr::map_int(list_col, ~ length(.)) ) ## # A tibble: 4 × 3 ## list_col d length_vec ## <list> <dbl> <int> ## 1 <chr [3]> 1 3 ## 2 <chr [2]> 2 2 ## 3 <chr [1]> 3 1 ## 4 <chr [2]> 4 2

To illustrate the problem more, consider the code below.

- For the first function, we are using a for loop o vectorize the
`vec_fn_above_below`

function. - The second function is vectorized by using the
`Vectorize`

function in R. - In the
`mutate`

function, for`cat_3`

, we use`ifelse`

which is by default vectorized in R. - For
`cat_4`

, we vectorize the function by using`purrr::map_int`

.

vec_fn_above_below <- function(column_name) { res <- base::vector(mode = 'character', length = length(column_name)) for (i in seq_along(column_name)) { if(column_name[i] >= 0) { res[i] <- "above" } else { res[i] <- "below" } } return(res) } fn_above_below <- function(column_name) { if(column_name >= 0) { res <- "above" } else { res <- "below" } return(res) } fn_above_below <- base::Vectorize(fn_above_below) df <- dplyr::tibble( numbers = sample(-10:10, size = 10) ) df %>% dplyr::mutate( cat = vec_fn_above_below(numbers), cat_2 = fn_above_below(numbers), cat_3 = ifelse(numbers >= 0, "above", "below"), cat_4 = purrr::map_chr( numbers, function(x) { if(x >= 0) { res <- "above" } else { res <- "below" } return(res) } ), cat_5 = sum(c(identical(cat, cat_2), identical(cat_2, cat_3), identical(cat_3, cat_4))) == 3 ) ## # A tibble: 10 × 6 ## numbers cat cat_2 cat_3 cat_4 cat_5 ## <int> <chr> <chr> <chr> <chr> <lgl> ## 1 -8 below below below below TRUE ## 2 0 above above above above TRUE ## 3 9 above above above above TRUE ## 4 3 above above above above TRUE ## 5 -2 below below below below TRUE ## 6 7 above above above above TRUE ## 7 -9 below below below below TRUE ## 8 -10 below below below below TRUE ## 9 -7 below below below below TRUE ## 10 -3 below below below below TRUE

All categories give the same solution.

All functions give the same results.

# Additional Links

