Monster functions (Raaargh!)

Posted on August 12, 2011 by richierocks in R bloggers | 0 Comments

[This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s widely considered good programming practice to have lots of little functions rather than a few big functions. The reasons behind this are simple. When your program breaks, it’s much nicer to debug a five line function than a five hundred line function. Additionally, by breaking up your code into little chunks, you often find that some of those chunks are reusable in other contexts, saving you re-writing code in your next project. The process of breaking your code down into these smaller chunks is called refactoring.

The concept of a line of code is surprisingly fluid in R. Since you can add whitespace more or less where you like, the same code can take up one line in your editor of hundreds, if you so choose. Assuming that most programmers will write in a reasonably standard way, we can get a rough idea of how many lines there are in an R function by calling deparse on its body. deparse is less scary than it sounds. Parsing means turning a load of text into something meaningful; thus deparsing means turning something meaningful into a load of text. deparse essentially works like as.character for expressions. (Actually, you can call as.character on expressions, but the results are often dubious.)

A very interesting question is “how much of base R could do with refactoring into smaller pieces?”. To answer this, our first task is to get all the functions.

fn_names <- apropos(".+")
fns <- lapply(fn_names, get)
names(fns) <- fn_names
fns <- Filter(is.function, fns)

apropos finds all the functions on your search path (i.e., from all the packages that have been loaded). Try this code with a freshly loaded version of R, and again with all your packages loaded. The function below will do that for you.

load_all_packages <- function()
{
  invisible(sapply(
    rownames(installed.packages()),
    require,
    character.only = TRUE
  ))
}
load_all_packages()

The number of lines in each function is very straightforward to get from here.

n_lines_in_body <- function(fn)
{
  length(deparse(body(fn)))
}
n_lines <- sapply(fns, n_lines_in_body)

Let’s take a look at the distribution of those lengths.

library(ggplot2)
hist_line_count <- ggplot(data.frame(n_lines = n_lines), aes(n_lines)) +
  geom_histogram(binwidth = 5)
hist_line_count

So about half the functions are five lines or less, which is all well and good. Notice that the x-axis extends all the way over to 400 though, so there clearly are some monsters in there.

head(sort(n_lines, decreasing = TRUE))

      library         arima   help.search        coplot loadNamespace       plot.lm
          409           328           320           316           305           299

So library is the number one culprit for being over long and complicated. In fairness to it though, it does mostly consist of sub-functions, so there clearly has been a lot of refactoring done on it; its just that the individual bits are contained within it rather than elsewhere. arima is more of a mess; it looks like the code is so old that no-one dare touch it anymore. None of these functions are really bad though. To see a package that really need some refactoring work, load up Hmisc and rerun this analysis. Now eight of the top ten longest functions come from this package. Quick challenge for you: hunting through other packages, can you find a function that beats Hmisc’s transcan at 591 lines?

Tagged: r, refactoring