In our last blog post we announced the addition of GitHub and Bioconductor R packages to Rdocumentation. For the more technical amongst you, I’ll give a short, high-level description of what’s under the hood at Rdocumentation. Along with that I’ll zoom in on some of the challenges encountered while adding GitHub and Bioconductor repositories.
Rdocumentation in a (technical) nutshell
In a nutshell, the Rdocumentation web server communicates with an R server that’s running in the background. Using a cron job, this R server executes the following steps on a daily basis:
- Check for all available packages and their version numbers using
- Compare these with the ones on Rdocumentation.
- Install/update the ones that are out of sync.
- Generate the documentation for the newly installed/updated packages and store it in a zip file.
The Rdocumentation web server then picks up the newly generated documentation from the R server, parses it, and stores it in its database.
This setup effectively creates a fully automated documentation service. However, installing all R packages on a single machine is by no means a trivial task. Many packages depend on certain (often linux-specific) libraries such as C++ header files from various development packages. These dependencies cause installation failure and require manual intervention. We hope to get to the point where we can run a setup procedure on a server to prepare it for installation of all R packages, but for now this is a work in progress. Another problem is that when R updates, many packages break on installation. We’ve opted to ignore this for now, and to not update packages that don’t install on R’s latest version.
Adding GitHub and Bioconductor repositories
The first version of Rdocumentation only included the packages available on CRAN. Our latest update expanded the package portfolio with the available packages on Bioconductor and GitHub.
Implementing Bioconductor packages was very similar to implementing CRAN packages, but with a few caveats. The biggest one to overcome was that Bioconductor packages sometimes download massive datasets (> 1GB) upon installation, which makes installing and updating a very time consuming and storage space consuming task. To overcome this, we used the `parallel` package to run package installations in threads that were killed (with a SIGKILL signal to the process) if they didn’t terminate after some time. This way we avoided cluttering our machine, and the few packages we loose with this technique is worth the performance gain.
Adding GitHub support was very different. Credits go to Hadley Wickham’s r-on-github script. His script uses the GitHub api to search for all R repositories and their details (owner, stars, latest update, etc.). We only made some minor changes to his script to filter repositories on the amount of stars they have, this to cut out the many test repositories. The following graph plots the amount of R repositories based on the amount of stars they have.
We decided that 3 or more stars was an acceptable metric to decide that a repository is “popular enough” for Rdocumentation. An arbitrary measure, but given the amounts shown in the graph above it seems that even taking 1 or more stars already discards the big majority of repositories. Once the repository information is collected,
devtools is used to install all of the packages on the server. After an initial install of all packages, only packages that have been updated/created within the last week on GitHub are considered for obvious performance reasons.
Any questions/remarks? Drop me a line at [email protected]