# Vectorisation is your best friend: replacing many elements in a character vector

January 25, 2014
By

(This article was first published on NumberTheory » R stuff, and kindly contributed to R-bloggers)

As with any programming language, R allows you to tackle the same problem in many different ways or styles. These styles differ both in the amount of code, readability, and speed. In this post I want to illustrate this by tackling the following problem. We have a `data.frame` that contains an `ID` character column:

```n = 9e6
df = data.frame(values = rnorm(n),
ID = rep(LETTERS[1:3], each = n/3),
stringsAsFactors = FALSE)
values ID
1 -0.7355823  A
2 -0.4729925  A
3 -0.7417259  A
4  1.7633367  A
5 -0.3006790  A
6  0.6785947  A```

We want to replace all occurrences of `A` by `'Text for A'`, and the same for `B` and `C`. One approach is to use a combination of a `for`-loop and some `if` statements, in a style that looks more like C:

```translator_if_for = function(input_vector) {
output_vector = input_vector
for(index in seq_along(input_vector)) {
if(input_vector[index] == 'A') {
output_vector[index] = 'Text for A'
} else if(input_vector[index] == 'B') {
output_vector[index] = 'Text for B'
} else if(input_vector[index] == 'C') {
output_vector[index] = 'Text for C'
}
}
return(output_vector)
}
dum_if_for = translator_if_for(df\$ID)```

This kind of imperative programming style is not typically R-like. The first response of an R-aficionado is to suggest using an `apply` loop. First we construct a helper function:

```translator_function = function(element) {
switch(element,
A = 'Text for A',
B = 'Text for B',
C = 'Text for C')
}```

which uses `switch` in stead of the set of nested `if` statements. Next we use `sapply` to call the helper function on each of the elements in `df\$ID`:

`dum_switch_sapply = sapply(df\$ID, translator_function)`

The advantage here is that we use roughly half the amount of code to express the same functionality, and I find the code more readable (seeing it’s purpose at a glance). Readability however is in the eye of the beholder, and some people used to non-functional programming languages might prefer the more explicit `for`-loop and `if` statement.

Ofcourse, R also supports vectorisation, which can be of particular interest if you are interested in performance. FOr a vectorised solution, we first create a lookup vector:

```translator_vector = c(A = 'Text for A',
B = 'Text for B',
C = 'Text for C')```

and subset this vector using `df\$ID`:

`dum_vectorized = translator_vector[df\$ID]`

I encourage you to spend a little time figuring out what this subsetting trick does, as I think it is quite a nice trick. The code of this final solution is even shorter, although it does take some careful consideration on the part of the reader to understand what is happening. Careful naming of variables, or encapsulation in a function can solve this issue.

All three solutions yield the same result:

```all.equal(dum_if_for, dum_switch_sapply, check.attributes = FALSE)
# TRUE
all.equal(dum_vectorized, dum_switch_sapply, check.attributes = FALSE)
# TRUE```

but how long do they take. For this, we benchmark the three solutions:

```library(rbenchmark)

res = benchmark(if_for_solution   = translator_if_for(df\$ID),
function_solution = sapply(df\$ID, translator_function),
vector_solution   = translator_vector[df\$ID],
replications = 10)
res
test replications elapsed relative user.self sys.self
2 function_solution           10 281.326   79.158   276.235    5.193
1   if_for_solution           10 254.751   71.680   253.358    1.484
3   vector_solution           10   3.554    1.000     3.052    0.504```

The benchmark clearly shows that the performance of the vectorised solution is vastly superior to the other two, in the order of 70-80 times faster. In addition, the `apply` base solution is only a factor 1.10 faster than the `for`-loop based solution. The take home message: `apply`-loops are not inherently faster, and vectorisation is your friend!

ps: In this case, making the character vector a `factor`, and simply replacing the `levels` is probably much much faster even than using a vectorised substitution. However, the point of the post was to compare different coding styles, and this problem was just a convenient example.

To leave a comment for the author, please follow the link and comment on their blog: NumberTheory » R stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...