|The Homeless Econometrician|
The amazing growth and success of CRAN (Comprehensive R Archive Network) is marked by the thousands of packages have been developed and released by a highly active user base. Yet even so, one of the founders and primary maintainers of CRAN Kurt Hornik in the Autrian Journal of Statistics (2012) is asking the question, “Are There Too Many Packages?“
As I understand it, some of the primary concerns regarding the immense proliferation of packages are: the lack of long term maintenance of many packages, the superabundance of packages, the inconsistent quality of individual packages, the lack of hierarchical dependency of packages, and insufficient meta package analysis.
1 Lack of long term maintenance of packages. This has been a challenge that I have faced when using R packages which I believe will provide the solution to my problem but these packages frequently are not maintained at the same rate as the R base system.
And how could they be? The base system is updated several times a year while there are thousands of packages. To update each of those packages for minor changes in the base system seems foolish and excessive. However, as the current structure of R stands, to fail to update these packages results in packages which previously worked, no longer functioning. This is a problem I have experienced and is frankly very annoying.
One solution might be to limit the number of packages to those which have a sufficient developer base to ensure long term maintenance. However, this would likely stifle the creativity and productivity of the wide R developer base.
Another solution is to limit the number of base system updates in order to limit the likelihood that a package will become outdated and need updating.
A third option, which I believe is the most attractive, is to allow code to specify what version of R it is stable on and for R to act for the commands in that package as though it is running on a previous version of R. This idea is inspired by how Stata handles user written commands. These commands simply specify version number for which the command was written under. No matter what later version of Stata is used, the command should still work.
I understand that such an implementation would require additional work from the R core team for each subsequent update. However, such an investment may be worth it in the long run as it would decrease the maintenance in response to R base updates.
2 The super abundance of R packages. The concern is that there are so many packages that users might find it difficult to wade through them in order to find the right package. I don’t really see this as a problem. If someone wanted to learn to use all R packages then of course this task would be nearly impossible. However, with me as I believe with most people, I learn to use new functions within packages to solve specific problems. I don’t really care how many packages there are out there. All I care is that when I ask a question on google or StackOverflow about how to do x or y, someone can point me to the package and command combination necessary to accomplish the task.
3 The inconsistent quality of individual packages. It is not always clear if user written packages are really doing what they claim to be doing. I know personally I and am constantly on the look out for checks to make sure my code is doing what I think it is doing, yet still I consistently find myself making small errors which only show up through painstaking experimentation and debugging.
CRAN has some automated procedures in which packages are tested to ensure that all of their functions work without errors under normal circumstances. However, as far as I know, there are no automated tests to ensure the commands are not silently giving errors by doing the wrong thing. These kind of error controls are entirely left up to the authors and users. This concern comes to mind because one of my friends recently was running two different Bayesian estimation packages which were supposed to produce identical results yet each returned distinctly different results with one set having significant estimates and the other not. If he had not thought to try two different packages then he would never have thought of the potential errors inherent in the package authorship.
A solution to inconsistent package quality controls may be to have a multitiered package release structure in which packages are first released in “beta form” but require an independent reviewing group to check functionality and write up reports before attaining “full” release status. Such an independent package review structure may be accomplished by developing an open access R-journal specifically geared towards the review, release, and revision of R packages.
4 The lack of hierarchical dependencies. This is a major point mentioned in Kurt Hornik’s paper. He looks at package dependencies and found that the majority of packages have no dependencies upon other packages. This indicates that while there are many packages out there, most packages are not building on the work of other packages. This produces the unfortunate situation in which it seems that many package developers are recreating the work of other package developers. I am not really sure if there is anything that can be done about this or if it really is an issue.
It does not bother me that many users recode similar or duplicate code because I think the coding of such code helps the user better understand the R system, the user’s problem, and the user’s solution. There is however the issue that the more times a problem is coded, the more likely someone will code an error. This beings us back to point 3 in which errors must be rigorously pursued and ruthlessly exterminated through use of an independent error detection system.
5 Insufficient Meta Package Analysis. A point that Kurt Hornik also raises is that there are a lot of R packages out there but not a lot of information about how those packages are being used. In order to further this goal, it might be useful to build into future releases of R the option to report usage statistics on which packages and functions are being used in combination with which other packages. Package developers might find such information useful when evaluating what functions to update.
Overall, it is impossible not to recognize CRAN as a huge success. CRAN has been extremely effective at providing a database for the distribution of many R packages dealing with an myriad of user demands. In a way, this post and the article that inspired it are only presenting the problems associated with success. Yet, given the great success of CRAN, how should we move it forward? This post presents some possible solutions.
Finally, I would like to say thank you to all of the fantastic R developers who have released so many packages. I do not claim credit for any of the thoughts expressed in this post. As a newcomer to R, I am not personally aware of the many thoughtful dialogues that must have already transpired regarding the issues raised in this post. I am sure more thoughtful and considerate minds than mine have already given what they believe are the best solutions to the problems here raised.