If you were to ask any R-user for the reason for R’s success, you’re almost guaranteed to hear the words “open source”. As the second most popular open source language (behind Python) R has exploded in popularity in recent years. This, however, brings with it challenges that must be addressed.
Here at Mango we have written our own unit tests for many packages and bundled them together to create what we call ValidR. Essentially it creates a single fully tested instance of R, enabling R to be used in companies with strict regulatory requirements. It is, therefore, especially popular within the pharmaceutical industry.
As a part of the ValidR process, we like to look at the tests written by package authors themselves. Through some simple coding in R (which I have included below), we find that the number of packages currently available as downloads from CRAN sits at a bewildering 9,231. Delving a little deeper, we find that these have been created by an equally bewildering 7,882 unique authors. It is impressive that CRAN is able to ensure that all of these authors are writing their packages accurately and consistently maintaining them, let alone including sufficient testing.
Unfortunately, this is not the case. If we search, using some beloved dplyr, for packages in CRAN that incorporate some form of unit testing, we come to a shockingly low figure of just 17%. For a coding language that is the predominant choice of users with statistical backgrounds, perhaps this is not surprising and perhaps something that many of us do not deem concerning. However this is something that will make people believe that R cannot be trusted, harming its success and, inevitably, our own as R users. This is an issue that must be resolved, so what is being done?
Gabor Csardi, one of the Senior Consultants here at Mango, is currently writing a package named ‘goodPractice’ (a work in progress, but available on GitHub if you want to take a look). This package is designed to advise package authors on how best to write their code, whether this is certain functions that shouldn’t be used or syntax best avoided.
Ideally, projects such as goodPractice and ValidR would not be necessary, and hopefully in the future they will become obsolete. However, for now they serve a key role in ensuring R’s long term success and pave a way for improving R’s testing standards.
# Script counts the number of packages on CRAN that use a formal unit testing framework library(dplyr) # What's on CRAN? download.file("http://cran.R-project.org/web/packages/packages.rds", "packages.rds", mode="wb") cranPacks <- readRDS("packages.rds") #Number of unique package authors authors <- cranPacks[,17] numAuthors <- length(unique(authors)) # Reduce the size down a bit to keep just the interesting columns cranPacks <- as.data.frame(cranPacks)[, 1:7] # Just the packages that use the formal testing framework unitTestFramework <- cranPacks %>% filter(grepl("testthat", Depends) | grepl("testthat", Imports) | grepl("testthat", Suggests) | grepl("RUnit", Depends) | grepl("RUnit", Imports) | grepl("RUnit", Suggests) | grepl("svUnit", Depends) | grepl("svUnit", Imports) | grepl("svUnit", Suggests) | grepl("testit", Depends) | grepl("testit", Imports) | grepl("tesit", Suggests) | Package %in% c("testthat", "RUnit", "svUnit", "testit")) # Proportion of packages that use a testing framework nrow(unitTestFramework) / nrow(cranPacks) ##  0.1712033