Graphical Display of R Package Dependencies
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In some work that I am currently involved in, we have to decide which GUI engine we should use. As an obvious starter, we decided to have a look at what other people are using in their packages. While cran helpfully displays all the R packages that are available, it doesn’t (I don’t think), give a nice summary of the package dependencies. After clicking on a few dozen packages and examining their dependencies, I decided that a quick script was in order.
General idea
- Scrap the package names the main cran package web-site
- For each package, scrap the associated web-page and retrieve its dependencies.
For example, ADaCGH has a large number of packages under the “DEPENDS” section.
Pre-processing
To make life easier, I made a few simplifications to the data:
- any dependencies on R, MASS, stats, methods and utils were removed when plotting;
- I removed any bioconductor and omega hat packages;
- version numbers in the DEPENDS section were ignored.
It should be stressed that I’m only picking up what is listed in the DEPENDS section. For example, suppose a package depends on both”ggplot2″ and “plyr”. Since “ggplot2″ depends on “plyr” the package author may only list “ggplot2″
Results
The top six packages based on the DEPENDS section are:
- lattice – 165 times
- survival – 107
- mvtnorm – 103
- tcltk – 76
- graphics – 76
- grid – 60
You could argue that I should remove “graphics” by the same arbitrary criteria I used when removing “MASS”. The total number of packages that are referred to in the DEPENDS section is just over 782 (out of a possible 3000 packages). The following graph plots the package name against the number of times it appears in the DEPENDS section of another package. There is a clear exponential decay highlighting a few key packages.
In fact the top 40 packages, account for 50% of all dependencies, and that’s after the dependencies on R, utils, methods,.. were removed.
I also constructed a graphical network using cytoscape. However, it’s quite large (~2MB). You can download the network separately. To construct this network, I only used packages that had three or more dependencies. There were a dozen or so smaller graphs that I pruned.
R Details
- To scape the web-pages I used regular expressions. Yes, I know you shouldn’t use regular expressions for parsing html, and should use a proper html parser, but
- the web-pages were all well formed since they were generated from the package DESCRIPTION file
- I needed practice with regular expressions
- the R code is at the end of this post
- You can download a csv file of the list edges from here
require("stringr")
####################
## Get dependencies
####################
getDependencies = function(pkg_name) {
url_st = "http://cran.r-project.org/web/packages"
url_end = "index.html"
url = paste(url_st, pkg_name, url_end, sep="/")
cran_web = paste(readLines(url), collapse="")
if(regexpr("<td valign=top>Depends:</td><td>", cran_web) == -1)
return()
## Get the table
hrefs = gsub('(.*<td valign=top>Depends:</td><td>)',"", cran_web)
## Clean the td & tr tags
hrefs = gsub('</td></tr>.*',"", hrefs)
## Remove R from dependencies
hrefs = gsub('R .*?<',"<", hrefs)
## Remove versions
hrefs = gsub("\\(&[ge; 0-9\\.\\-]*)", "", hrefs)
## Remove Bioconductor
hrefs = gsub("<a href=\"http://www.bioconductor.org/packages/release/[a-z]*/html/([a-zA-Z0-9\\.]*)\"><span class=\"BioC\">([A-Za-z0-9\\.]*)</span></a>",
"", hrefs)
## Remove Omegahat
hrefs =
gsub("<a href=\"http://www.omegahat.org/[A-Za-z0-9]*\"><span class=\"Ohat\">[A-Za-z0-9]*</span></a>",
"", hrefs)
## Get dependencies
depends_on = gsub("<a href=\"../([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a>", "\\1", hrefs)
##Unlist and remove white space
depends_on = strsplit(depends_on, ",")[[1]]
depends_on = as.vector(sapply(depends_on, str_trim))
depends_on = depends_on[sapply(depends_on, nchar)>0]
return(depends_on)
}
###########
#Main Page
url = "http://cran.r-project.org/web/packages/"
cran_web_page = paste(readLines(url), collapse="")
main_table = gsub('.*<table summary="Available CRAN packages.">(.*)</table>.*', "\\1", cran_web_page)
main_table = gsub('<tr id="available-packages-[A-Z]"/>', "", main_table)
depends_on =
gsub('<tr valign="top"><td><a href=\"../../web/packages/([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a></td><td>.*?</td></tr>',
"\\1 ", main_table)
cran_packages = unlist(strsplit(depends_on, " "))
from = vector("character", 10000)
to = vector("character", 10000)
j = 1
for(i in 1:length(cran_packages)) {
dependencies = getDependencies(cran_packages[i])
cat(i, ":", dependencies, "\n")
if(!is.null(dependencies) &&
length(dependencies) > 0) {
l = length(dependencies) - 1
from[j:(j+l)] = cran_packages[i]
to[j:(j+l)] = dependencies
j = j + l + 1
}
}
dep_df = data.frame(from=from, to=to)
dep_df = dep_df[1:j,]
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
