R tips and tricks – get the gist

[This article was first published on R – Eran Raviv, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In scientific programming speed is important. Functions written for general public use have a lot of control-flow checks which are not necessary if you are confident enough with your code.To quicken your code execution I suggest to strip run-of-the-mill functions to their bare bones. You can save serious wall-clock time by using only the laborers code lines. Below is a walk-through example of what I mean.

I use the quantile function for the example. There are many ways to compute the estimate of a quantile, and all those various ways are coded into the one quantile function. The function has the default argument type = 7 which indicates the particular way we wish to estimate our quantiles. Given that R is an open-source language you can easily find the code for any function, then you can “fish out” only the lines that you actually need. While the code for the quantile function is around 90 lines (given below), the real labor is carried out mainly by lines 49 to 58 – the main workhorse (for the type=7 default).

Now, let’s write our own version of the quantile function; call it lean_quantile. Then we make sure our lean_quantile does what its meant to do, and compare the execution time.

lean_quantile <- function(x, probs = seq(0, 1, 0.25)) {
  n <- length(x)
  np <- length(probs)
  index <- 1 + (n - 1) * probs
  lo <- floor(index)
  hi <- ceiling(index)
  x <- sort(x, partial = unique(c(lo, hi)))
  qs <- x[lo]
  i <- which(index > lo)
  h <- (index - lo)[i]
  qs[i] <- (1 - h) * qs[i] + h * x[hi[i]]
  qs
}

Check that our lean_quantile does what its meant to do:

tmpp <- rnorm(10)
all( quantile(tmpp) == lean_quantile(tmpp) )
[1] TRUE

Now we can compare the execution time (more on timing and profiling code):

library(microbenchmark)
# citation("microbenchmark")
bench <- microbenchmark(quantile(rnorm(10)), 
                        lean_quantile(rnorm(10)), 
                        times=10^4)
bench
# Unit: microseconds
#                     expr  min   lq mean median   uq  max neval
#      quantile(rnorm(10)) 79.1 84.3 96.3   86.1 89.0 3907 10000
# lean_quantile(rnorm(10)) 27.9 31.6 36.2   33.4 34.8 4741 10000

Execution time is reduced by over 60%. Also, we did not have to work very hard for it. We can do more, diving further and improve the sort function which our lean_quantile uses, but you get the idea.

Is it a free lunch? Of course not.

It takes long to master efficient programming, and the functions you find in the public domain are probably well scrutinized – before and after they go up there. When you mingle with the internals you risk making a mistake, erasing an important line or creating unintended consequences and messing up the original behavior. So meticulous checks are good to do.

While some functions are written so efficiently that you will find very little value in pulling out just the workhorse, with most functions written for the general public you will certainly be able to squeeze out some time-profit. As you can see this “get the gist” tip has excellent potential to save a lot of waiting time.

#  Part of the R package, https://www.R-project.org
#
#  Copyright (C) 1995-2014 The R Core Team
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  https://www.R-project.org/Licenses/

quantile <- function(x, ...) UseMethod("quantile")

quantile.POSIXt <- function(x, ...)
    .POSIXct(quantile(unclass(as.POSIXct(x)), ...), attr(x, "tzone"))

quantile.default <-
    function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
             type = 7, ...)
{
    if(is.factor(x)) {
        if(!is.ordered(x) || ! type %in% c(1L, 3L))
            stop("factors are not allowed")
        lx <- levels(x)
    } else lx <- NULL
    if (na.rm)
	x <- x[!is.na(x)]
    else if (anyNA(x))
	stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
    eps <- 100*.Machine$double.eps
    if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
	stop("'probs' outside [0,1]")
    n <- length(x)
    if(na.p <- any(!p.ok)) { # set aside NA & NaN
        o.pr <- probs
        probs <- probs[p.ok]
        probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
    }
    np <- length(probs)
    if (n > 0 && np > 0) {
        if(type == 7) { # be completely back-compatible
            index <- 1 + (n - 1) * probs
            lo <- floor(index)
            hi <- ceiling(index)
            x <- sort(x, partial = unique(c(lo, hi)))
            qs <- x[lo]
	    i <- which(index > lo)
	    h <- (index - lo)[i] # > 0	by construction
##	    qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
##	    qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
	    qs[i] <- (1 - h) * qs[i] + h * x[hi[i]]
        } else {
            if (type  j),		# type 1
			    ((nppm > j) + 1)/2, # type 2
			    (nppm != j) | ((j %% 2L) == 1L)) # type 3
            } else {
                ## Types 4 through 9 are continuous sample qs.
                switch(type - 3,
                   {a <- 0; b <- 1},    # type 4
                       a <- b <- 0.5,   # type 5
                       a <- b <- 0,     # type 6
                       a <- b <- 1,     # type 7 (unused here)
                       a <- b <- 1 / 3, # type 8
                       a <- b <- 3 / 8) # type 9
                ## need to watch for rounding errors here
                fuzz <- 4 * .Machine$double.eps
                nppm <- a + probs * (n + 1 - a - b) # n*probs + m
                j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
                h <- nppm - j
                if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
            }
            x <- sort(x, partial =
                      unique(c(1, j[j>0L & j0L & j 0L) {
	names(qs) <- format_perc(probs)
    }
    if(na.p) { # do this more elegantly (?!)
        o.pr[p.ok] <- qs
        names(o.pr) <- rep("", length(o.pr)) # suppress  names
        names(o.pr)[p.ok] <- names(qs)
        o.pr
    } else qs
}

Footnotes

As a side note, would be nice to do that in Python also, but the source code for the numpy quantile function is heavily “decorated”. Comment if you know how to create the Python counterpart.

Coding All-in-One For Dummies

To leave a comment for the author, please follow the link and comment on their blog: R – Eran Raviv.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)