Graphical Display of R Package Dependencies

March 23, 2011
By

(This article was first published on Why? » R, and kindly contributed to R-bloggers)

In some work that I am currently involved in, we have to decide which GUI engine we should use. As an obvious starter, we decided to have a look at what other people are using in their packages. While cran helpfully displays all the R packages that are available, it doesn’t (I don’t think), give a nice summary of the package dependencies. After clicking on a few dozen packages and examining their dependencies, I decided that a quick script was in order.

General idea

  1. Scrap the package names the main cran package web-site
  2. For each package, scrap the associated web-page and retrieve its dependencies.

For example, ADaCGH has a large number of packages under the “DEPENDS” section.

Pre-processing

To make life easier, I made a few simplifications to the data:

  • any dependencies on R, MASS, stats, methods and utils were removed when plotting;
  • I removed any bioconductor and omega hat packages;
  • version numbers in the DEPENDS section were ignored.

It should be stressed that I’m only picking up what is listed in the DEPENDS section. For example, suppose a package depends on both”ggplot2″ and “plyr”. Since “ggplot2″ depends on “plyr” the package author may only list “ggplot2″

Results

The top six packages based on the DEPENDS section are:

  • lattice – 165 times
  • survival – 107
  • mvtnorm – 103
  • tcltk – 76
  • graphics – 76
  • grid – 60

You could argue that I should remove “graphics” by the same arbitrary criteria I used when removing “MASS”. The total number of packages that are referred to in the DEPENDS section is just over 782 (out of a possible 3000 packages).  The following graph plots the package name against the number of times it appears in the DEPENDS section of another package. There is a clear exponential decay highlighting a few key packages.

In fact the top 40 packages, account for 50% of all dependencies, and that’s after the dependencies on R, utils, methods,.. were removed.

I also constructed a graphical network using cytoscape. However, it’s quite large (~2MB). You can download the network separately. To construct this network, I only used packages that had three or more dependencies. There were a dozen or so smaller graphs that I pruned.

R Details

  • To scape the web-pages I used regular expressions. Yes, I know you shouldn’t use regular expressions for parsing html, and should use a proper html parser, but
    • the web-pages were all well formed since they were generated from the package DESCRIPTION file
    • I needed practice with regular expressions
    • the R code is at the end of this post
  • You can download a csv file of the list edges from here
require("stringr")

####################
## Get dependencies
####################
getDependencies = function(pkg_name) {
  url_st = "http://cran.r-project.org/web/packages"
  url_end = "index.html"
  url = paste(url_st, pkg_name, url_end, sep="/")

  cran_web = paste(readLines(url), collapse="")

  if(regexpr("<td valign=top>Depends:</td><td>", cran_web) == -1)
    return()

  ## Get the table
  hrefs = gsub('(.*<td valign=top>Depends:</td><td>)',"", cran_web)

  ## Clean the td & tr tags
  hrefs = gsub('</td></tr>.*',"", hrefs)

  ## Remove R from dependencies
  hrefs = gsub('R .*?<',"<", hrefs)
  ## Remove versions
  hrefs = gsub("\\(&[ge; 0-9\\.\\-]*)", "", hrefs)

  ## Remove Bioconductor
  hrefs = gsub("<a href=\"http://www.bioconductor.org/packages/release/[a-z]*/html/([a-zA-Z0-9\\.]*)\"><span class=\"BioC\">([A-Za-z0-9\\.]*)</span></a>",
       "", hrefs)

  ## Remove Omegahat
  hrefs =
  gsub("<a href=\"http://www.omegahat.org/[A-Za-z0-9]*\"><span class=\"Ohat\">[A-Za-z0-9]*</span></a>",
  "", hrefs)

  ## Get dependencies
  depends_on = gsub("<a href=\"../([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a>",  "\\1", hrefs)

  ##Unlist and remove white space
  depends_on = strsplit(depends_on, ",")[[1]]
  depends_on = as.vector(sapply(depends_on, str_trim))
  depends_on = depends_on[sapply(depends_on, nchar)>0]
  return(depends_on)
}

###########
#Main Page
url = "http://cran.r-project.org/web/packages/"
cran_web_page = paste(readLines(url), collapse="")

main_table = gsub('.*<table summary="Available CRAN packages.">(.*)</table>.*', "\\1", cran_web_page)
main_table = gsub('<tr id="available-packages-[A-Z]"/>', "", main_table)

depends_on =
  gsub('<tr valign="top"><td><a href=\"../../web/packages/([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a></td><td>.*?</td></tr>',
       "\\1 ", main_table)

cran_packages = unlist(strsplit(depends_on, " "))
from = vector("character", 10000)
to = vector("character", 10000)

j = 1
for(i in 1:length(cran_packages)) {
  dependencies = getDependencies(cran_packages[i])
  cat(i, ":", dependencies, "\n")
  if(!is.null(dependencies) &&
     length(dependencies) > 0) {
    l = length(dependencies) - 1
    from[j:(j+l)] = cran_packages[i]
    to[j:(j+l)] = dependencies
    j = j + l + 1
  }
}

dep_df = data.frame(from=from, to=to)
dep_df = dep_df[1:j,]

To leave a comment for the author, please follow the link and comment on his blog: Why? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.