Unifying packages to help users access R repositories
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With over 11,000 packages in the main CRAN (Consolidated R Archive Network) collection, users of R face a challenge when looking for tools to carry out their statistical and computational tasks, especially in subject areas outside their main domain of expertise. Moreover, a lot of packages have similar or overlapping features. Some are, to be kind, not very well written, use methods which are outdated, or are simply inappropriate to a given situation.
In an attempt to address this, I joined with Julia Silge and Spencer Graves to organize a special session “Navigating the R package universe” at the UseR2017 meeting in Brussels. A lot of other people contributed to the session — see https://github.com/nashjc/Rnavpkg, and especially the wiki portion, https://github.com/nashjc/Rnavpkg/wiki. The presentation materials are there, along with links to various follow-on material such as notes and this and other blog posts. In particular, there is a general follow-up blog post at https://juliasilge.com/blog/navigating-packages/.
There are lots of ideas about how to help users find the right package, but we found that we could consider them as generally falling into three categories:
- unification of packages that have similar goals or features (of which more in this posting)
- guidance in the form of CRAN Task Views and similar infrastructure (see Julia’s posting https://juliasilge.com/blog/package-guidance/)
- search tools for R packages (pending posting by Spencer Graves)
There are overlaps in these categories, and we are certain some ideas have been left out. Let us know!
Approaches to unification of packages
In the breakout session on unification at UseR2017, many ideas were suggested, sometimes in rapid-fire fashion. Alice Daish kindly took notes on her cell phone, but I will apologize in advance if I have missed anything important.
My own view of unification with regard to the package universe is to reduce the effective span of the number of “things” the user must consider. If the rule of thumb for presentations is to have no more than (fill in your favourite single digit integer) bullets per screen, surely a user should not have to face thousands of choices? Following this objective, we can consider:
- developing wrappers that provide a common syntax or interface to multiple packages
- finding ways to identify packages that should not generally be used so they can be archived out of the immediate or first-level view of users who are searching for tools
- promoting collaboration between developers of similar packages with the aim of consolidating those packages
- encouraging those working on guidance and search to provide filtering of their output to reduce the choice burden on users.
None of these approaches can be considered “easy”.
Wrappers
My own efforts to unifying packages have been in the field of optimization, particularly function minimization. With Ravi Varadhan, I developed the optimx package to allow a common interface to a number of base and package tools. It is an indicator of the difficulty of doing this well that I later refactored optimx to optimr to allow for easier maintenance of the package. Note that because these packages call other packages, failures in the called functions may give rise to error messages in the unifying package. To avoid too many CRAN alerts, optimrx can be found on R-forge in the optimizer project https://r-forge.r-project.org/projects/optimizer/ and can call more solvers than optimr. I believe that optimrx makes a worthwhile step towards simplifying the user’s task in carrying out function minimization where some of the parameters can also be bounded or temporarily fixed. A similar wrapper for global optimizers has been written by Hans Werner Borchers. gloptim is so far only on R-forge in the optimizer project. As I mentioned, writing these codes takes a lot of effort. They are never truly finished because new solvers become available from time to time.
A sample of some other packages which unify a number of methods:
- mlr: Machine Learning in R
- caret: Classification And REgression Training for developing predictive models
- Matrix: Matrix methods for dense and sparse matrices (What would we do without them?)
We could also consider the nls() function in the stats package that is distributed with base R to unify several methods for nonlinear least squares. Unfortunately, the default method and derivative approximation are quite often mis-applied by non-experts (and even some experts!). For the Marquardt stabilization, users must look in packages like minpack.lm and nlsr. Duncan Murdoch and I wrote the last package to allow for analytic derivatives of expressions in nonlinear modelling, but it would be nice to be able to use them in conjunction with separable models (essentially those with linear structure with some nonlinear terms). While nlsr has a wrapnls() function to call nls(), we don’t yet have a way to access minpack.lm or some other packages with relevant features. There is always more work to do!
Archiving
During the breakout session on unification, there were a lot of comments about packages that should be merged, as well as a few about packages that should be abandoned. A subtext of the discussion that is relevant to R in general is that younger workers were quite clear that they felt they could not voice their opinions openly because some packages had been written by members of well-established groups who might provide employment opportunities or other perks. In this, the fact that two of the session organizers (JN and SG) are retired and the third (JS) does not work in academia was noted. We are relatively protected from backlash to honest commentary. Here I genuinely mean honest and not capriciously critical. We must recognize that many packages have been important in the life of R but now need renewal or replacement. I have been trying to get my own contributions to the optim() function deprecated for some time! However, there are workers who, I am sure, will for a variety of reasons be unwilling to discuss ways to reorganize the package repositories.
Ultimately, the users will vote with their feet (or rather mouse/touchpad). However, there will be a lot of wasted effort that does them and R no good whatever.
Filtered presentation
The CRAN repository is, at present, offered to users as a list alphabetically or by date on the different mirrors, (e.g., https://cloud.r-project.org/). While a way to see every package available is necessary, it is far from friendly to users, either novices or experts. Clearly it is feasible to restructure the list(s) for different audiences. Moreover, this restructuring or filtering need not be on the CRAN sites, but could be delegated to sites dealing with particular applications or methods. Here is a prime area of overlap with the guidance and search ideas for navigating the packages. In particular, it will be helpful if automated tools are developed that do the lion’s share of the work for this.
Another approach to filtering is to present R tools within a different framework. https://www.jamovi.org/ uses a spreadsheet paradigm. https://www.tidyverse.org/ collects tools that deal with data science and adds a streamlined command set. Unfortunately, there is a learning cost to using new frameworks like these. They simplify the use of R in one domain, but add to the general collection of packages and tools.
Directions
Unification of R packages with the aim to reduce user costs in finding the right tools is never going to be trivial. Moreover, it is not as directly rewarding as writing an (apparently) new package or method, and getting a paper published abut it. On the other hand, making R infrastructure more accessible to users is key to continued vitality of the overall resource and community.
John C. Nash 2017-9-3
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.