by Andrie de Vries
My talk drew heavily on two previous blog posts, Finding the essential R packages using the pagerank algorithm and Finding clusters of CRAN packages using igraph.
However, in this talk I went further, attempting to create a single visualiziation of all ~6,700 packages on CRAN. To do this, I did all the analysis in R, then exported a GraphML file, and used Gephi to create a network visualization.
My first version of the graph was in a single colour, where each node is a package, and each segment is a dependency on another package. Although this graph indicates dense areas, it reveals little of the deeper structure of the network.
To examine the structure more closely, I did two things:
- Use the page.rank() algorithm to compute package importance, then changed the font size so that more “important” packages have a bigger font
- Used the walktrap.community() algorithm to assign colours to “clusters”. This algorithm uses random walks of a short length to find clusters of densely connected nodes
This image (click to enlarge) quite clearly highlights several clusters:
- MASS, in yellow. This is a large cluster of packages that includes lattice and Matrix, together with many others that seem to expose statistical functionality
- Rcpp, in light blue. Rcpp allows any package or script to use C++ code for highly performant code
- ggplot2, in darker blue. This cluster, sometimes called the Hadleyverse, contains packages such as plyr, dplyr and their dependencies, e.g. scales and RColorBrewer.
- sp, in green. This cluster contains a large number of packages that expose spatial statistics features, including spatstat, maps and mapproj
It turns out that Rcpp has a slightly higher page rank than MASS. This made Dirk Eddelbuettel very happy: