Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I talked about using the number of lines in a function as a guide to whether you need to break it down into smaller pieces. There are many other useful metrics for the complexity of a function, most notably cyclomatic complexity, which tracks the number of different routes that code can take. It’s non-trivial to calculate such a measure, and it seems that there is nothing currently available to calculate it for R functions. (The internet is curently on the case.) For now, we’ll use an easier, simpler measure of the complexity of a function: how many times `if`, `ifelse` or `switch` is called.

Let’s take a look at how complex the contents of base R are. First, as in the previous post, we need to retrieve all the functions. Since I seem to be trying to do this regularly, I’m wrapping the code into a function.

```get_all_fns <- function(pattern = ".+")
{
fn_names <- apropos(pattern)
fns <- lapply(fn_names, get)
names(fns) <- fn_names
Filter(is.function, fns)
}
fns <- get_all_fns()
```

As before, we use `deparse` to turn the function’s body into an array of strings to examine. This time, we are looking for calls to `if`, `ifelse` or `switch`.

```get_complexity <- function(fn)
{
body_lines <- deparse(body(fn))
flow <- c("if", "ifelse", "switch")
rx <- paste(flow, " *\\(", collapse = "|", sep = "")
body_lines <- body_lines[grepl(rx, body_lines)]
length(body_lines)
}
complexity <- sapply(fns, get_complexity)
```

Let’s take a look at the distribution of this complexity measure.

```library(ggplot2)
hist_complexity <- ggplot(data.frame(complexity = complexity), aes(complexity)) +
geom_histogram(binwidth = 3)
hist_complexity
``` Zero cases is the most common, which is nice to see, but we have some serial offenders over on the right hand side of the plot. Let’s see who the culprits are.

```head(sort(complexity, decreasing=TRUE))
library          arima    help.search       read.DIF         coplot [<-.data.frame
84             81             71             66             65             63
```

Hmm, it's the same set of functions from the monster-function list before. This is to be expected in some ways, though it would be nicer if we had another measure to pick out dubious functions. One such measure that springs to mind is the number of exceptions that can be thrown. This is quite a subtle measure to read, since in general, code should "fail early and fail often". That is, you want lots of exceptions to catch any problems, and you want them to be thrown as soon as possible, so you don't waste time calculating things that were going to fail anyway. Thus more possible exceptions is better, except that too many means that if so many things can go wrong, then your function is too complicated.

Finding the number of possible exceptions works exactly the same as our previous example, only this time we look for calls to `stop` and `stopifnot`.

```get_n_exceptions <- function(fn)
{
body_lines <- deparse(body(fn))
flow <- c("stop", "stopifnot")
rx <- paste(flow, " *\\(", collapse = "|", sep = "")
body_lines <- body_lines[grepl(rx, body_lines)]
length(body_lines)
}
n_exceptions <- sapply(fns, get_n_exceptions)
```

Once again we examine the distribution …

```hist_exceptions <- ggplot(data.frame(n_exceptions = n_exceptions), aes(n_exceptions)) +
geom_histogram(binwidth = 1)
hist_exceptions
```

and it seems that most code contains no exception throwing code. This is acceptable for non-user facing functions, since user input is the biggest cause of problems.

```head(sort(n_exceptions, decreasing=TRUE))
read.DIF        library [<-.data.frame          arima         arima0        glm.fit
17             16             15             14             13             13
```

The function with the most potential exceptions to throw is `read.DIF`. File handling is notoriously problematic, so that’s fair enough. Load the `survival` package for a better example. The `Surv` function lets you define a censored vector, and it has an interface that’s either really clever or stupidly complicated. You can specify the censoring in many different ways, so the error checking gets rather complicated, and then it requires 20 calls to `stop` to prevent disaster.

So when you are writing a function and you see the 20th call to `stop`, that’s a hint that you may need to stop (if you want a sensible interface).

Tagged: complexity, programming-technique, r        