[This article was first published on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sometimes, R is a bit too intuitive, and I wondered what was wrong with my code the other day was. The problem was vectorized functions within a mutate statement. I usually use the `paste` function and the `ifelse` function within `mutate` so the vectorization is already automatic. However, for a specific task at work, I was working with a non vectorized function and it took me a little bit to figure out what was wrong with my code.

So I decided to write a little post as a reminder for myself, how vectorized functions in `mutate` work.

```sample_df <- dplyr::tibble(
list_col = list(c("a", "b", "c"), c("a", "b"), "c", c("e", "f")),
d = c(1, 2, 3, 4)
)
sample_df

## # A tibble: 4 × 2
##   list_col      d
##   <list>    <dbl>
## 1 <chr >     1
## 2 <chr >     2
## 3 <chr >     3
## 4 <chr >     4```

In the data frame above we have 2 columns. A list column with character vectors and one integer column. Now, we want to get the length of the vectors for each row and create a new column. Naively, I tried something like that…

```sample_df %>%
dplyr::mutate(
length_vec = length(list_col)
)

## # A tibble: 4 × 3
##   list_col      d length_vec
##   <list>    <dbl>      <int>
## 1 <chr >     1          4
## 2 <chr >     2          4
## 3 <chr >     3          4
## 4 <chr >     4          4```

For my task at work, I was working with JSON data but the example above demonstrates the problem I had. Instead of getting the length of each individual vector in the `list_col` rows, I was getting the length of the `list_col` list or the number of rows of the data frame. Now if I do …

```length(sample_df\$list_col)

##  4```

… I get a scalar, or a vector of length 1, back. The way R works is that it recycles the output and fills up the column, `length_vec` with all 4s.

To illustrate this behavior, we can create a data frame like this:

```data.frame(
a = 1,
b = 1:2,
c = 1:5,
d = letters[1:10]
)

##    a b c d
## 1  1 1 1 a
## 2  1 2 2 b
## 3  1 1 3 c
## 4  1 2 4 d
## 5  1 1 5 e
## 6  1 2 1 f
## 7  1 1 2 g
## 8  1 2 3 h
## 9  1 1 4 i
## 10 1 2 5 j

dplyr::tibble(
a = "letter:",
d = letters[1:10]
)

## # A tibble: 10 × 2
##    a       d
##    <chr>   <chr>
##  1 letter: a
##  2 letter: b
##  3 letter: c
##  4 letter: d
##  5 letter: e
##  6 letter: f
##  7 letter: g
##  8 letter: h
##  9 letter: i
## 10 letter: j```

For tibbles, we get a warning with the first creation of a data frame because it says, only values of size one are recycled. Also, it will only be repeated a whole number of times if necessary for the data frame.

That’s what basically happened to me.

# Fixing Vectorization with Purrr::map

To fix the issue, we can simply use `purrr` in the mutate function and then get the length of each vector.

```sample_df %>%
dplyr::mutate(
length_vec = purrr::map_int(list_col, ~ length(.))
)

## # A tibble: 4 × 3
##   list_col      d length_vec
##   <list>    <dbl>      <int>
## 1 <chr >     1          3
## 2 <chr >     2          2
## 3 <chr >     3          1
## 4 <chr >     4          2```

To illustrate the problem more, consider the code below.

• For the first function, we are using a for loop o vectorize the `vec_fn_above_below` function.
• The second function is vectorized by using the `Vectorize` function in R.
• In the `mutate` function, for `cat_3`, we use `ifelse` which is by default vectorized in R.
• For `cat_4`, we vectorize the function by using `purrr::map_int`.
```vec_fn_above_below <- function(column_name) {
res <- base::vector(mode = 'character', length = length(column_name))
for (i in seq_along(column_name)) {
if(column_name[i] >= 0) {
res[i] <- "above"
} else {
res[i] <- "below"
}
}
return(res)
}

fn_above_below <- function(column_name) {
if(column_name >= 0) {
res <- "above"
} else {
res <- "below"
}
return(res)
}
fn_above_below <- base::Vectorize(fn_above_below)

df <- dplyr::tibble(
numbers = sample(-10:10, size = 10)
)

df %>%
dplyr::mutate(
cat = vec_fn_above_below(numbers),
cat_2 = fn_above_below(numbers),
cat_3 = ifelse(numbers >= 0, "above", "below"),
cat_4 = purrr::map_chr(
numbers,
function(x) {
if(x >= 0) {
res <- "above"
} else {
res <- "below"
}
return(res)
}
),
cat_5 = sum(c(identical(cat, cat_2), identical(cat_2, cat_3), identical(cat_3, cat_4))) == 3
)

## # A tibble: 10 × 6
##    numbers cat   cat_2 cat_3 cat_4 cat_5
##      <int> <chr> <chr> <chr> <chr> <lgl>
##  1      -8 below below below below TRUE
##  2       0 above above above above TRUE
##  3       9 above above above above TRUE
##  4       3 above above above above TRUE
##  5      -2 below below below below TRUE
##  6       7 above above above above TRUE
##  7      -9 below below below below TRUE
##  8     -10 below below below below TRUE
##  9      -7 below below below below TRUE
## 10      -3 below below below below TRUE```

All categories give the same solution.

All functions give the same results.