Since 2000, I’ve been working on R/qtl, an R package for mapping the genetic loci (called quantitative trait loci, QTL) that contribute to variation in quantitative traits in experimental crosses. The Bioinformatics paper about it is my most cited; also see my 2014 JORS paper, “Fourteen years of R/qtl: Just barely sustainable.”
It’s a bit of a miracle that R/qtl works and gives the right answers, as it includes essentially no formal tests. The only regular tests are that the examples in the help files don’t produce any errors that halt the code.
I’ve recently been working on R/qtl2, a reimplementation of R/qtl to better handle high-dimensional data and more complex crosses, such as Diversity Outbred mice. In doing so, I’m trying to make use of the software engineering principles that I’ve learned over the last 15 years, which pretty much correspond to the ideas in “Best Practices for Scientific Computing” (Greg Wilson et al., PLOS Biology 12(1): e1001745, doi:10.1371/journal.pbio.1001745).
I’m still working on “Make names consistent, distinctive, and meaningful”, but I’m doing pretty well on writing shorter functions with less repeated code, and particularly importantly I’m writing extensive unit tests.
The basic idea is to break up your code into small functions that ideally do just one thing. (These are the “units”.) And then write tests that explicitly check whether those functions are giving the correct answers. That is, don’t just test that the code runs; you want to know that your code is giving the right answers!
Last week, I was working on implementing a basic genome scan function. I included the option of doing weighted least squares with known weights, because it’s easy enough to do, and R/qtl had done that. (I’ve never actually used that feature, but it’s been there.) And my results were matching those that R/qtl provided. Well, mostly: in the case of multiple phenotypes, with some phenotypes having missing values, and when doing a weighted analysis, the results from R/qtl and R/qtl2 were quite different.
I expect you may be questioning the logic of testing code by comparing it to software that has not been formally tested. I was thinking the same thing. And so I wrote some tests that compared results to those using R‘s
lm() function. And (surprise) the results from
lm() matched R/qtl2 but not R/qtl. So it was R/qtl that was the problem.
So that had me debugging the
scanone function in R/qtl, the R code for which is 831 lines long and was written according to the principle, “I’ll just add this one thing and then clean things up later,” with “later” really meaning “never”, and “this one thing” really meaning “all the things.”
I did, after resorting to a bunch of print/cat statements, find the bug. In subsetting the data to remove individuals with missing values, I’d forgotten to subset the weights. Back in 2012, I’d fixed a related bug, but apparently it was only a partial fix. In thinking about the bug again while writing this post, and in looking back at the 2012 bug fix, I realized that I probably had the same bug in
scantwo (which is 1443 lines long). Sure enough, and so now there’s yet another bug fix, though not yet on CRAN.
It’s not that we don’t test our code, it’s that we don’t store our tests so they can be re-run automatically.
This is my favorite comment on testing; it totally applies to (old) me.
The bottom line:
- Write unit tests.
- When you find a bug, first write a test for it and then fix the bug.
- When you find a bug, look for other possible instances of that bug, and fix them too.
To learn more about testing in R packages, see the Testing chapter in Hadley’s R packages book. Also see the slides for my lecture on testing and debugging in my Tools for Reproducible Research course.