Stop! (In the name of a sensible interface)

Posted on August 12, 2011 by richierocks in R bloggers | 0 Comments

[This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I talked about using the number of lines in a function as a guide to whether you need to break it down into smaller pieces. There are many other useful metrics for the complexity of a function, most notably cyclomatic complexity, which tracks the number of different routes that code can take. It’s non-trivial to calculate such a measure, and it seems that there is nothing currently available to calculate it for R functions. (The internet is curently on the case.) For now, we’ll use an easier, simpler measure of the complexity of a function: how many times if, ifelse or switch is called.

Let’s take a look at how complex the contents of base R are. First, as in the previous post, we need to retrieve all the functions. Since I seem to be trying to do this regularly, I’m wrapping the code into a function.

get_all_fns <- function(pattern = ".+")
{
  fn_names <- apropos(pattern)
  fns <- lapply(fn_names, get)
  names(fns) <- fn_names
  Filter(is.function, fns)
}
fns <- get_all_fns()

As before, we use deparse to turn the function’s body into an array of strings to examine. This time, we are looking for calls to if, ifelse or switch.

get_complexity <- function(fn)
{
  body_lines <- deparse(body(fn))
  flow <- c("if", "ifelse", "switch")
  rx <- paste(flow, " *\\(", collapse = "|", sep = "")
  body_lines <- body_lines[grepl(rx, body_lines)]
  length(body_lines)
}
complexity <- sapply(fns, get_complexity)

Let’s take a look at the distribution of this complexity measure.

library(ggplot2)
hist_complexity <- ggplot(data.frame(complexity = complexity), aes(complexity)) +
  geom_histogram(binwidth = 3)
hist_complexity

Zero cases is the most common, which is nice to see, but we have some serial offenders over on the right hand side of the plot. Let’s see who the culprits are.

head(sort(complexity, decreasing=TRUE))
       library          arima    help.search       read.DIF         coplot [<-.data.frame
            84             81             71             66             65             63

Hmm, it's the same set of functions from the monster-function list before. This is to be expected in some ways, though it would be nicer if we had another measure to pick out dubious functions. One such measure that springs to mind is the number of exceptions that can be thrown. This is quite a subtle measure to read, since in general, code should "fail early and fail often". That is, you want lots of exceptions to catch any problems, and you want them to be thrown as soon as possible, so you don't waste time calculating things that were going to fail anyway. Thus more possible exceptions is better, except that too many means that if so many things can go wrong, then your function is too complicated.

Finding the number of possible exceptions works exactly the same as our previous example, only this time we look for calls to stop and stopifnot.

get_n_exceptions <- function(fn)
{
  body_lines <- deparse(body(fn))
  flow <- c("stop", "stopifnot")
  rx <- paste(flow, " *\\(", collapse = "|", sep = "")
  body_lines <- body_lines[grepl(rx, body_lines)]
  length(body_lines)
}
n_exceptions <- sapply(fns, get_n_exceptions)

Once again we examine the distribution …

hist_exceptions <- ggplot(data.frame(n_exceptions = n_exceptions), aes(n_exceptions)) +
  geom_histogram(binwidth = 1)
hist_exceptions

and it seems that most code contains no exception throwing code. This is acceptable for non-user facing functions, since user input is the biggest cause of problems.

head(sort(n_exceptions, decreasing=TRUE))
      read.DIF        library [<-.data.frame          arima         arima0        glm.fit
            17             16             15             14             13             13

The function with the most potential exceptions to throw is read.DIF. File handling is notoriously problematic, so that’s fair enough. Load the survival package for a better example. The Surv function lets you define a censored vector, and it has an interface that’s either really clever or stupidly complicated. You can specify the censoring in many different ways, so the error checking gets rather complicated, and then it requires 20 calls to stop to prevent disaster.

So when you are writing a function and you see the 20th call to stop, that’s a hint that you may need to stop (if you want a sensible interface).

Tagged: complexity, programming-technique, r