In some work that I am currently involved in, we have to decide which GUI engine we should use. As an obvious starter, we decided to have a look at what other people are using in their packages. While cran helpfully displays all the R packages that are available, it doesn’t (I don’t think), give a nice summary of the package dependencies. After clicking on a few dozen packages and examining their dependencies, I decided that a quick script was in order.
General idea
- Scrap the package names the main cran package web-site
- For each package, scrap the associated web-page and retrieve its dependencies.
For example, ADaCGH has a large number of packages under the “DEPENDS” section.
Pre-processing
To make life easier, I made a few simplifications to the data:
- any dependencies on R, MASS, stats, methods and utils were removed when plotting;
- I removed any bioconductor and omega hat packages;
- version numbers in the DEPENDS section were ignored.
It should be stressed that I’m only picking up what is listed in the DEPENDS section. For example, suppose a package depends on both”ggplot2″ and “plyr”. Since “ggplot2″ depends on “plyr” the package author may only list “ggplot2″
Results
The top six packages based on the DEPENDS section are:
- lattice – 165 times
- survival – 107
- mvtnorm – 103
- tcltk – 76
- graphics – 76
- grid – 60
You could argue that I should remove “graphics” by the same arbitrary criteria I used when removing “MASS”. The total number of packages that are referred to in the DEPENDS section is just over 782 (out of a possible 3000 packages). The following graph plots the package name against the number of times it appears in the DEPENDS section of another package. There is a clear exponential decay highlighting a few key packages.
In fact the top 40 packages, account for 50% of all dependencies, and that’s after the dependencies on R, utils, methods,.. were removed.
I also constructed a graphical network using cytoscape. However, it’s quite large (~2MB). You can download the network separately. To construct this network, I only used packages that had three or more dependencies. There were a dozen or so smaller graphs that I pruned.
R Details
- To scape the web-pages I used regular expressions. Yes, I know you shouldn’t use regular expressions for parsing html, and should use a proper html parser, but
- the web-pages were all well formed since they were generated from the package DESCRIPTION file
- I needed practice with regular expressions
- the R code is at the end of this post
- You can download a csv file of the list edges from here
require("stringr")
####################
## Get dependencies
####################
getDependencies = function(pkg_name) {
url_st = "http://cran.r-project.org/web/packages"
url_end = "index.html"
url = paste(url_st, pkg_name, url_end, sep="/")
cran_web = paste(readLines(url), collapse="")
if(regexpr("<td valign=top>Depends:</td><td>", cran_web) == -1)
return()
## Get the table
hrefs = gsub('(.*<td valign=top>Depends:</td><td>)',"", cran_web)
## Clean the td & tr tags
hrefs = gsub('</td></tr>.*',"", hrefs)
## Remove R from dependencies
hrefs = gsub('R .*?<',"<", hrefs)
## Remove versions
hrefs = gsub("\\(&[ge; 0-9\\.\\-]*)", "", hrefs)
## Remove Bioconductor
hrefs = gsub("<a href=\"http://www.bioconductor.org/packages/release/[a-z]*/html/([a-zA-Z0-9\\.]*)\"><span class=\"BioC\">([A-Za-z0-9\\.]*)</span></a>",
"", hrefs)
## Remove Omegahat
hrefs =
gsub("<a href=\"http://www.omegahat.org/[A-Za-z0-9]*\"><span class=\"Ohat\">[A-Za-z0-9]*</span></a>",
"", hrefs)
## Get dependencies
depends_on = gsub("<a href=\"../([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a>", "\\1", hrefs)
##Unlist and remove white space
depends_on = strsplit(depends_on, ",")[[1]]
depends_on = as.vector(sapply(depends_on, str_trim))
depends_on = depends_on[sapply(depends_on, nchar)>0]
return(depends_on)
}
###########
#Main Page
url = "http://cran.r-project.org/web/packages/"
cran_web_page = paste(readLines(url), collapse="")
main_table = gsub('.*<table summary="Available CRAN packages.">(.*)</table>.*', "\\1", cran_web_page)
main_table = gsub('<tr id="available-packages-[A-Z]"/>', "", main_table)
depends_on =
gsub('<tr valign="top"><td><a href=\"../../web/packages/([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a></td><td>.*?</td></tr>',
"\\1 ", main_table)
cran_packages = unlist(strsplit(depends_on, " "))
from = vector("character", 10000)
to = vector("character", 10000)
j = 1
for(i in 1:length(cran_packages)) {
dependencies = getDependencies(cran_packages[i])
cat(i, ":", dependencies, "\n")
if(!is.null(dependencies) &&
length(dependencies) > 0) {
l = length(dependencies) - 1
from[j:(j+l)] = cran_packages[i]
to[j:(j+l)] = dependencies
j = j + l + 1
}
}
dep_df = data.frame(from=from, to=to)
dep_df = dep_df[1:j,]
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...


Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).