Quantifying R Package Dependency Risk

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We recently commented on excess package dependencies as representing risk in the R package ecosystem.

The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practice (or at least correlates with other good practices)?

Well, it turns out we can quantify it: each additional non-core package declared as an “Imports” or “Depends” is associated with an extra 11% relative chance of a package having an indicated issue on CRAN. At over 5 non-core “Imports” plus “Depends” a package has significantly elevated risk.

The number of dependent packages in use versus modeled issue probability can be summed up in the following graph.

Unnamed chunk 6 2

In the above graph the dashed horizontal line is the overall rate that packages have issues on CRAN. Notice the curve crosses the line well before 5 non-trivial dependencies.

In fact packages importing more than 5 non-trivial dependencies have issues on CRAN at an empirical rate of 35%, (above the model prediction at 5 dependencies) and double the overall rate of 17%. Doubling a risk is considered very significant. And almost half the packages using more than 10 non-trivial dependencies have known issues on CRAN.

A very short analysis deriving the above can be found here.

Obviously we are using lack of problems on CRAN as a rough approximation for package quality, and number of non-trivial package Imports and Depends as rough proxy for package complexity. It would be interesting to quantify and control for other measures (including package complexity and purpose).

Our theory is the imports are not so much causing problems, but are a “code smell” correlated with other package issues. We feel this is evidence that the principles that guide some package developers to prefer packages with well defined purposes and low dependencies are part of a larger set of principles that lead to higher quality software.

A table of all scored packages and their problem/risk estimate can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)