Good R Packages
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Joseph Rickert
What makes for a good R package? With over 8,000 packages up on CRAN the quantity of packages is clearly not an issue for R users. Developing an instinct to recognize quality, however, both requires and deserves some effort. I regularly spend time on Dirk Eddelbuettel’s CRANberries site investigating new packages and monitoring changes in old favorites in order to recommend packages for inclusion in MRAN’s Package Spotlight page. As a consequence, I think I’m getting a feel for quality and I believe it comes down to this: A good R package clearly says what it does and then really does what it says.
It should not be surprising that documentation is the key. For an R package, the first obvious place for an author to provide quality documentation is the vignette. Hadley Wickham writes:
A vignette is a long-form guide to your package. . . A vignette is like a book chapter or an academic paper: it can describe the problem that your package is designed to solve, and then show the reader how to solve it. . . Vignettes are also useful if you want to explain the details of your package. For example, if you have implemented a complex statistical algorithm, you might want to describe all the details in a vignette so that users of your package can understand what’s going on under the hood, and be confident that you’ve implemented the algorithm correctly.
Unfortunately, less than 25% of all R packages have vignettes!
As a person who is in the habit of writing, I find it astounding that anyone would do all the creative work to develop the contents of a package and then make the effort to get it through the CRAN process and release into the open source domain without taking the basic step to explain its value, and maybe even discretely sing its praises. Nevertheless, most package authors, even some who have otherwise done great work balk at writing engaging documentation.
Although they are not visible in the plot above, there are a few packages whose authors have made extravagant attempts at documentation. The following table lists all packages with 10 or more vignettes.
  Name         Date          Vignettes
1 caschrono    2014-03-21       10
2 catdata      2014-11-11       45
3 copula       2015-10-26       13
4 gamclass     2015-08-20       11
5 ggvis        2015-06-06       10
6 HSAUR        2015-07-28       17
7 HSAUR2       2015-07-28       19
8 HSAUR3       2015-07-29       22
9 Sleuth2      2016-01-08       13
10 Sleuth3     2016-01-08       13
11 tigerstats  2015-09-23       20
(Yes, catdata does really have 45 vignettes, but the package is the documentation for a book.)
To be fair and thorough though, I should mention that are some pretty lame vignettes out there. I sometimes do find myself putting a new package on the short list for a Spotlight evaluation just because it has a vignette, only to be disappointed when it I get around to looking at it.
I should also note that writing a vignette is neither the only way, nor the most sophisticated way to document an R package. Top shelf packages implementing a statistical algorithm or new computational method are often complemented up by a paper published in the Journal of Statistical Software or some other peer reviewed publication. kernlab, for example, uses a version of its JSS paper as a vignette. Other packages, such as statnet are backed up by informative websites, and it is also becoming common for packages authors to provide links to their GitHub development sites. See data.table's GitHub page, for example, or Dirk's Rcpp CRAN page which provides links to extensive documentation listing multiple vignettes as well as multiple websites. I find it particularly convenient that the package pdf links to these sites and also to the supporting JSS paper.
So, what about the other part of my definition: A good R package … really does what it says. I find that good documentation does correlate positively with good code, but beyond that the best way to make a quick assessment of the quality of a package is to see if it is included in any of the CRAN Task Views. These are lists of packages organized and curated by experts who make heroic efforts to keep them current and comprehensive, if not complete. Amazingly, 27% of CRAN packages are listed in at least one Task View! This is higher than the percentage of packages that have vignettes.
MRAN provides a lookup feature that is convenient for checking a package's quality potential. With one click you can see the names of any vignettes that may be associated with a package along with any Task Views in which it may be listed.
Of course, the crucial step for checking quality and utility of a package is to try the package yourself. Doing this also provides an opportunity for you the user to contribute to the common good by just using the package in your work, talking about its merits, and maybe even providing constructive feedback to the author if you think that would be helpful.
The data for this post comes from a JSON file available on the MRAN website. The code below was used for the post.
library(jsonlite)
library(ggplot2)
 
# Read in package data as JSON file and form into a data frame
json_file <- "https://mran.revolutionanalytics.com/packagedata/allpackages.json"
json_data <- as.data.frame(fromJSON(paste(readLines(json_file),collapse="")))
 
dF <- json_data[,c(1,2,6,7)]
names(dF) <- c("Name","Date","Task_View","Vignettes")
dF$Vignettes <- ifelse(is.na(dF$Vignettes)==TRUE,0,dF$Vignettes)
dF$Task_View <- ifelse(is.na(dF$Task_View)==TRUE,"None",dF$Task_View)
head(dF)
 
# Look at some summary statistics
summary(dF$Vignettes)
table(dF$Vignettes)
 
# Find number of vignettes
have_vig <- sum(dF$Vignettes>0)
have_vig #1961
 
pct_vig <- have_vig / length(dF$Vignettes)
pct_vig #0.2353859
 
# Find package with >= 10 Task Views
dF[dF$Vignettes>=10,] 
 
# Modify data frame for printing
dF10 <- dF[dF$Vignettes>=10, -3]
row.names(dF10) <- NULL
dF10
 
# Plot histogram
p <- ggplot(dF, aes(x=Vignettes))
p + geom_histogram(binwidth=1) +
  ggtitle("The sad tale but long tail of vignettes")
 
# Find number of packages in task views
have_TV <- dF[dF$Task_View!="None",]
pct_TV <- dim(have_TV)[1] / length(dF$Vignettes)
pct_TV #0.2724763
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
