The network structure of CRAN

July 8, 2015
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

My experience of UseR!2015 drew to an end shortly after I gave a Kaleidoscope presentation discussing "The Network Structure of CRAN".

My talk drew heavily on two previous blog posts, Finding the essential R packages using the pagerank algorithm and Finding clusters of CRAN packages using igraph

However, in this talk I went further, attempting to create a single visualiziation of all ~6,700 packages on CRAN. To do this, I did all the analysis in R, then exported a GraphML file, and used Gephi to create a network visualization. 

My first version of the graph was in a single colour, where each node is a package, and each segment is a dependency on another package. Although this graph indicates dense areas, it reveals little of the deeper structure of the network.

 

CRAN-bw

To examine the structure more closely, I did two things:

  • Use the page.rank() algorithm to compute package importance, then changed the font size so that more "important" packages have a bigger font
  • Used the walktrap.community() algorithm to assign colours to "clusters".  This algorithm uses random walks of a short length to find clusters of densely connected nodes

CRAN-colour

 

This image (click to enlarge) quite clearly highlights several clusters:

  • MASS, in yellow. This is a large cluster of packages that includes lattice and Matrix, together with many others that seem to expose statistical functionality
  • Rcpp, in light blue. Rcpp allows any package or script to use C++ code for highly performant code
  • ggplot2, in darker blue. This cluster, sometimes called the Hadleyverse, contains packages such as plyr, dplyr and their dependencies, e.g. scales and RColorBrewer.
  • sp, in green. This cluster contains a large number of packages that expose spatial statistics features, including spatstat, maps and mapproj 

It turns out that Rcpp has a slightly higher page rank than MASS. This made Dirk Eddelbuettel very happy:

Eddelbuettel-tweet

You can find my slides at SlideShare and my source code on github.

Finally, my thanks to Gabor Csardi, maintainer of the igraph package, who listened to my ideas and gave helpful hints prior to the presentation.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)