The network structure of CRAN

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Andrie de Vries

My experience of UseR!2015 drew to an end shortly after I gave a Kaleidoscope presentation discussing “The Network Structure of CRAN“.

My talk drew heavily on two previous blog posts, Finding the essential R packages using the pagerank algorithm and Finding clusters of CRAN packages using igraph

However, in this talk I went further, attempting to create a single visualiziation of all ~6,700 packages on CRAN. To do this, I did all the analysis in R, then exported a GraphML file, and used Gephi to create a network visualization. 

My first version of the graph was in a single colour, where each node is a package, and each segment is a dependency on another package. Although this graph indicates dense areas, it reveals little of the deeper structure of the network.

 

CRAN-bw

To examine the structure more closely, I did two things:

  • Use the page.rank() algorithm to compute package importance, then changed the font size so that more “important” packages have a bigger font
  • Used the walktrap.community() algorithm to assign colours to “clusters”.  This algorithm uses random walks of a short length to find clusters of densely connected nodes

CRAN-colour

 

This image (click to enlarge) quite clearly highlights several clusters:

  • MASS, in yellow. This is a large cluster of packages that includes lattice and Matrix, together with many others that seem to expose statistical functionality
  • Rcpp, in light blue. Rcpp allows any package or script to use C++ code for highly performant code
  • ggplot2, in darker blue. This cluster, sometimes called the Hadleyverse, contains packages such as plyr, dplyr and their dependencies, e.g. scales and RColorBrewer.
  • sp, in green. This cluster contains a large number of packages that expose spatial statistics features, including spatstat, maps and mapproj 

It turns out that Rcpp has a slightly higher page rank than MASS. This made Dirk Eddelbuettel very happy:

Eddelbuettel-tweet

You can find my slides at SlideShare and my source code on github.

Finally, my thanks to Gabor Csardi, maintainer of the igraph package, who listened to my ideas and gave helpful hints prior to the presentation.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)