# Vectorisation is your best friend: replacing many elements in a character vector

**NumberTheory » R stuff**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As with any programming language, R allows you to tackle the same problem in many different ways or styles. These styles differ both in the amount of code, readability, and speed. In this post I want to illustrate this by tackling the following problem. We have a `data.frame`

that contains an `ID`

character column:

n = 9e6 df = data.frame(values = rnorm(n), ID = rep(LETTERS[1:3], each = n/3), stringsAsFactors = FALSE) > head(df) values ID 1 -0.7355823 A 2 -0.4729925 A 3 -0.7417259 A 4 1.7633367 A 5 -0.3006790 A 6 0.6785947 A

We want to replace all occurrences of `A`

by `'Text for A'`

, and the same for `B`

and `C`

. One approach is to use a combination of a `for`

-loop and some `if`

statements, in a style that looks more like C:

translator_if_for = function(input_vector) { output_vector = input_vector for(index in seq_along(input_vector)) { if(input_vector[index] == 'A') { output_vector[index] = 'Text for A' } else if(input_vector[index] == 'B') { output_vector[index] = 'Text for B' } else if(input_vector[index] == 'C') { output_vector[index] = 'Text for C' } } return(output_vector) } dum_if_for = translator_if_for(df$ID)

This kind of *imperative* programming style is not typically R-like. The first response of an R-aficionado is to suggest using an `apply`

loop. First we construct a helper function:

translator_function = function(element) { switch(element, A = 'Text for A', B = 'Text for B', C = 'Text for C') }

which uses `switch`

in stead of the set of nested `if`

statements. Next we use `sapply`

to call the helper function on each of the elements in `df$ID`

:

dum_switch_sapply = sapply(df$ID, translator_function)

The advantage here is that we use roughly half the amount of code to express the same functionality, and I find the code more readable (seeing it’s purpose at a glance). Readability however is in the eye of the beholder, and some people used to non-functional programming languages might prefer the more explicit `for`

-loop and `if`

statement.

Ofcourse, R also supports vectorisation, which can be of particular interest if you are interested in performance. FOr a vectorised solution, we first create a lookup vector:

translator_vector = c(A = 'Text for A', B = 'Text for B', C = 'Text for C')

and subset this vector using `df$ID`

:

dum_vectorized = translator_vector[df$ID]

I encourage you to spend a little time figuring out what this subsetting trick does, as I think it is quite a nice trick. The code of this final solution is even shorter, although it does take some careful consideration on the part of the reader to understand what is happening. Careful naming of variables, or encapsulation in a function can solve this issue.

All three solutions yield the same result:

all.equal(dum_if_for, dum_switch_sapply, check.attributes = FALSE) # TRUE all.equal(dum_vectorized, dum_switch_sapply, check.attributes = FALSE) # TRUE

but how long do they take. For this, we benchmark the three solutions:

library(rbenchmark) res = benchmark(if_for_solution = translator_if_for(df$ID), function_solution = sapply(df$ID, translator_function), vector_solution = translator_vector[df$ID], replications = 10) res test replications elapsed relative user.self sys.self 2 function_solution 10 281.326 79.158 276.235 5.193 1 if_for_solution 10 254.751 71.680 253.358 1.484 3 vector_solution 10 3.554 1.000 3.052 0.504

The benchmark clearly shows that the performance of the vectorised solution is vastly superior to the other two, in the order of 70-80 times faster. In addition, the `apply`

base solution is only a factor 1.10 faster than the `for`

-loop based solution. The take home message: `apply`

-loops are not inherently faster, and vectorisation is your friend!

ps: In this case, making the character vector a `factor`

, and simply replacing the `levels`

is probably much much faster even than using a vectorised substitution. However, the point of the post was to compare different coding styles, and this problem was just a convenient example.

**leave a comment**for the author, please follow the link and comment on their blog:

**NumberTheory » R stuff**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.