Differences in the network structure of CRAN and BioConductor

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Andrie de Vries This week at JSM2015, the annual conference of the American Statistical Association, Joseph Rickert and I gave a presentation on the topic of “The network structure of CRAN and BioConductor” (link to abstract). Our work tested the hypothesis if one can detect statistical differences in the network graph formed by the dependencies between packages. In the dependency graph, each package is a vertex and each dependency is an edge connecting two vertices.

Building on previous work

This presentation combines earlier work that we have discussed in blog posts during the year:

The hypothesis

Before starting the work, we formed a hypothesis that CRAN and BioConductor have discernably different package network structures. This hypothesis is based on the intuition that these two repositories have different management structures:
  • On CRAN, packages of almost any type are welcome. The CRAN maintainers have some strict policies on how a package should behave to get on CRAN (have documentation, have examples, build without warnings, etc.). However, CRAN does not prescribe anything about the subject matter or content of any package.
  • In contrast, BioConductor is more focused and centrally managed. Packages must add something to the topic of high-throughput genomic data. For a great introduction, read Peter Hickey’s contributed blog post, A Short Introduction to Bioconductor.

What we found

Firstly, we used the igraph package to compute descriptive network statistics. Among these, we found the clustering coefficient and the degree distribution most illuminating. Firstly, we found that BioConductor has a higher clustering coefficient than CRAN. The clustering coefficient (also called transitivity) measures the probability that the adjacent vertices of a vertex are connected. You can see this visually in the network graphs. It appears as if the BioConductor graph is more compact, while the CRAN graph has many packages on the perimeter that are only loosely connected to the rest of the graph. Cluster-diagram   We used a simple bootstrapping algorithm to simulate the local clustering coefficient of induced subgraphs. In this plot, CRAN (in red) has a much lower distribution of clustering coefficient than BioConductor (in blue). Bootstrap-cluster-coef   The second statistical summary is the degree distribution. The degree of a node is the number of adjacent edges. Note in particular the degree distribution with nodes of degree zero, i.e. unconnected nodes. BioConductor has a much lower fraction of packages with zero connections.  It seems that the BioConductor policy encourages package authors to re-use exising material and write packages that work better together. CRAN-BioC-degree-distribution

On slideshare:

The presentation is available on slideshare.

Getting involved

The scripts we used are available at github. We think this is an important topic to study, since it could help to disover:
  • Better search algorithms for finding packages that are useful to solve a specific problem
  • Recommendations for packages to use

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)