Stop! (In the name of a sensible interface)

[This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I talked about using the number of lines in a function as a guide to whether you need to break it down into smaller pieces. There are many other useful metrics for the complexity of a function, most notably cyclomatic complexity, which tracks the number of different routes that code can take. It’s non-trivial to calculate such a measure, and it seems that there is nothing currently available to calculate it for R functions. (The internet is curently on the case.) For now, we’ll use an easier, simpler measure of the complexity of a function: how many times if, ifelse or switch is called.

Let’s take a look at how complex the contents of base R are. First, as in the previous post, we need to retrieve all the functions. Since I seem to be trying to do this regularly, I’m wrapping the code into a function.

get_all_fns <- function(pattern = ".+")
  fn_names <- apropos(pattern)
  fns <- lapply(fn_names, get)
  names(fns) <- fn_names
  Filter(is.function, fns)
fns <- get_all_fns()

As before, we use deparse to turn the function’s body into an array of strings to examine. This time, we are looking for calls to if, ifelse or switch.

get_complexity <- function(fn)
  body_lines <- deparse(body(fn))
  flow <- c("if", "ifelse", "switch")
  rx <- paste(flow, " *\\(", collapse = "|", sep = "")
  body_lines <- body_lines[grepl(rx, body_lines)]
complexity <- sapply(fns, get_complexity)

Let’s take a look at the distribution of this complexity measure.

hist_complexity <- ggplot(data.frame(complexity = complexity), aes(complexity)) +
  geom_histogram(binwidth = 3)

Histogram of complexity, by number of calls to if, ifelse or switch

Zero cases is the most common, which is nice to see, but we have some serial offenders over on the right hand side of the plot. Let’s see who the culprits are.

head(sort(complexity, decreasing=TRUE))
       library          arima       read.DIF         coplot [<
            84             81             71             66             65             63

Hmm, it's the same set of functions from the monster-function list before. This is to be expected in some ways, though it would be nicer if we had another measure to pick out dubious functions. One such measure that springs to mind is the number of exceptions that can be thrown. This is quite a subtle measure to read, since in general, code should "fail early and fail often". That is, you want lots of exceptions to catch any problems, and you want them to be thrown as soon as possible, so you don't waste time calculating things that were going to fail anyway. Thus more possible exceptions is better, except that too many means that if so many things can go wrong, then your function is too complicated.

Finding the number of possible exceptions works exactly the same as our previous example, only this time we look for calls to stop and stopifnot.

get_n_exceptions <- function(fn)
  body_lines <- deparse(body(fn))
  flow <- c("stop", "stopifnot")
  rx <- paste(flow, " *\\(", collapse = "|", sep = "")
  body_lines <- body_lines[grepl(rx, body_lines)]
n_exceptions <- sapply(fns, get_n_exceptions)

Once again we examine the distribution …

hist_exceptions <- ggplot(data.frame(n_exceptions = n_exceptions), aes(n_exceptions)) +
  geom_histogram(binwidth = 1)

Histogram of number of exceptions

and it seems that most code contains no exception throwing code. This is acceptable for non-user facing functions, since user input is the biggest cause of problems.

head(sort(n_exceptions, decreasing=TRUE))
      read.DIF        library [<          arima         arima0
            17             16             15             14             13             13

The function with the most potential exceptions to throw is read.DIF. File handling is notoriously problematic, so that’s fair enough. Load the survival package for a better example. The Surv function lets you define a censored vector, and it has an interface that’s either really clever or stupidly complicated. You can specify the censoring in many different ways, so the error checking gets rather complicated, and then it requires 20 calls to stop to prevent disaster.

So when you are writing a function and you see the 20th call to stop, that’s a hint that you may need to stop (if you want a sensible interface).

Tagged: complexity, programming-technique, r

To leave a comment for the author, please follow the link and comment on their blog: 4D Pie Charts » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)