**R Programming – Thomas Bryce Kelly**, and kindly contributed to R-bloggers)

*This post is an elaboration of an iRKernel Notebook which can be found at http://misc.tkelly.org/Which_Witch.html. For a previous article on the iPython Notebook, see here.*

Every programming language comes pre-packaged with certain basic functions, or *methods*, that are considered standard. These generally include methods for sorting objects, conversions between basic types (e.g. integer float string), et cetera and are referred to as the *Standard Library*. While functions in the standard library of any widely-used language are guaranteed to be reliable and robust, it wouldn’t be too hard to imagine one of two things:

- A. Optimization is not always a priority when reliability is a must.

- B. If you make something general-purpose then it’s easy to sacrifice the bit of performance that comes with specificity and assumptions.

While (A) is more a gut feeling, (B) is almost always going to be true since any function that takes a wide variety of input (or has to deal with **potential** variability) has to branch and run checks. The same function without those checks or error-catching should, in theory, always be faster. As we all know, theory doesn’t necessarily work in the real world.

My goal here was to take the source code for a simply {base} function in R and try to make it faster. First let’s take a look at a section of the source in the aptly named *which.r*. The which() function takes statement involving an array (or matrix^{(1)}I’m not interested in using it for matrices.) and returns the indices of the elements that qualify. So this is a long winded way of saying:

x = c(1:6)^2 y = which(x%%2 == 0) print(x) print(y) > 1 4 9 16 25 36 > 2 4 6

On to the *which.r* file.

# File src/library/base/R/which.R # Part of the R package, http://www.R-project.org # # Copyright (C) 1995-2013 The R Core Team # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # A copy of the GNU General Public License is available at # http://www.r-project.org/Licenses/ which <- function(x, arr.ind = FALSE, useNames = TRUE) { wh <- .Internal(which(x)) if (arr.ind && !is.null(d <- dim(x))) arrayInd(wh, d, dimnames(x), useNames=useNames) else wh }

Ignoring all the comments, there is not a whole lot to the function. After it defines as a *.Internal, *it branches depending on if you want named indices or just the index itself. Since I’m not sure where to find source for *.Internal *functions, my hands are bound for the time being.

For the first attempt, I decided to write the function from scratch using basic R commands that I know. Without being able to see the code for the internal which() function, this test would at least show me if I can even get close on my own.

which2 = function(x) { c(1:length(x))[x<0] }

Nothing magical here, the function simply makes a logical vector and then returns the indices of the true ones. The other way to speed up the function is to make a streamlined version of the built in one.

which3 <- function(x) { .Internal(which(x)) }

Notice the *if* statement in the original, that is called a ‘branch’ and it is a sure-fire way to slow down the execution of a function. Every time there’s a branch, the processor has to wait until all the cards are dealt before moving on. Take for example a statement like:

*if( dice roll = 2) then …*

In this case the computer has to wait for the dice roll before continuing. ^{(2)}This is also a good example a different and more complex notion called predictive branching, but I’ll leave that for another time. If this is removed then the processor can simply move forward.

So how do these three different functions compare? Using *microbenchmark* I tested the execution time for all three functions.

silent("microbenchmark") x = c(1:100) y = sin(x)+0.5 results = microbenchmark(which(y<0), which2(y), which3(y<0)) plot(results, main="Comparison of which() functions", ylab="Time", xlab="Function", ylim=c(0,6000)) avg = summary(results)$median text(x=1,y=0, paste0(format({avg[1]/avg[1]},digits=2), "x")) text(x=2,y=0, paste0(format({avg[2]/avg[1]},digits=2), "x")) text(x=3,y=0, paste0(format({avg[3]/avg[1]},digits=2), "x"))

Not surprisingly, the streamlined function is quicker by a small, yet significant, amount. Also not surprisingly, the function I wrote is not nearly as fast. Without seeing the voodoo of the *.Internal* function I can’t even speculate as to the reason, but I would bet that the function is written in C rather than R.

Hope this was helpful and perhaps even inspiration for your own investigation.

Notes

1. | ↑ | I’m not interested in using it for matrices. |

2. | ↑ | This is also a good example a different and more complex notion called predictive branching, but I’ll leave that for another time. |

**leave a comment**for the author, please follow the link and comment on their blog:

**R Programming – Thomas Bryce Kelly**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...