Vectorisation is your best friend: replacing many elements in a character vector

[This article was first published on NumberTheory » R stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As with any programming language, R allows you to tackle the same problem in many different ways or styles. These styles differ both in the amount of code, readability, and speed. In this post I want to illustrate this by tackling the following problem. We have a data.frame that contains an ID character column:

n = 9e6 
df = data.frame(values = rnorm(n), 
                ID = rep(LETTERS[1:3], each = n/3),
                stringsAsFactors = FALSE)
> head(df)
      values ID
1 -0.7355823  A
2 -0.4729925  A
3 -0.7417259  A
4  1.7633367  A
5 -0.3006790  A
6  0.6785947  A

We want to replace all occurrences of A by 'Text for A', and the same for B and C. One approach is to use a combination of a for-loop and some if statements, in a style that looks more like C:

translator_if_for = function(input_vector) {
    output_vector = input_vector
    for(index in seq_along(input_vector)) {
        if(input_vector[index] == 'A') {
            output_vector[index] = 'Text for A'
        } else if(input_vector[index] == 'B') {
            output_vector[index] = 'Text for B'
        } else if(input_vector[index] == 'C') {
            output_vector[index] = 'Text for C'
        }   
    }   
    return(output_vector)
}
dum_if_for = translator_if_for(df$ID)

This kind of imperative programming style is not typically R-like. The first response of an R-aficionado is to suggest using an apply loop. First we construct a helper function:

translator_function = function(element) {
    switch(element,
           A = 'Text for A',
           B = 'Text for B',
           C = 'Text for C')
}

which uses switch in stead of the set of nested if statements. Next we use sapply to call the helper function on each of the elements in df$ID:

dum_switch_sapply = sapply(df$ID, translator_function)

The advantage here is that we use roughly half the amount of code to express the same functionality, and I find the code more readable (seeing it’s purpose at a glance). Readability however is in the eye of the beholder, and some people used to non-functional programming languages might prefer the more explicit for-loop and if statement.

Ofcourse, R also supports vectorisation, which can be of particular interest if you are interested in performance. FOr a vectorised solution, we first create a lookup vector:

translator_vector = c(A = 'Text for A',
                      B = 'Text for B',
                      C = 'Text for C')

and subset this vector using df$ID:

dum_vectorized = translator_vector[df$ID]

I encourage you to spend a little time figuring out what this subsetting trick does, as I think it is quite a nice trick. The code of this final solution is even shorter, although it does take some careful consideration on the part of the reader to understand what is happening. Careful naming of variables, or encapsulation in a function can solve this issue.

All three solutions yield the same result:

all.equal(dum_if_for, dum_switch_sapply, check.attributes = FALSE)
# TRUE
all.equal(dum_vectorized, dum_switch_sapply, check.attributes = FALSE)
# TRUE

but how long do they take. For this, we benchmark the three solutions:

library(rbenchmark)

res = benchmark(if_for_solution   = translator_if_for(df$ID),
                function_solution = sapply(df$ID, translator_function),
                vector_solution   = translator_vector[df$ID],
                replications = 10)
res
               test replications elapsed relative user.self sys.self 
2 function_solution           10 281.326   79.158   276.235    5.193 
1   if_for_solution           10 254.751   71.680   253.358    1.484 
3   vector_solution           10   3.554    1.000     3.052    0.504

The benchmark clearly shows that the performance of the vectorised solution is vastly superior to the other two, in the order of 70-80 times faster. In addition, the apply base solution is only a factor 1.10 faster than the for-loop based solution. The take home message: apply-loops are not inherently faster, and vectorisation is your friend!

ps: In this case, making the character vector a factor, and simply replacing the levels is probably much much faster even than using a vectorised substitution. However, the point of the post was to compare different coding styles, and this problem was just a convenient example.

To leave a comment for the author, please follow the link and comment on their blog: NumberTheory » R stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)