Graphical Display of R Package Dependencies

[This article was first published on Why? » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In some work that I am currently involved in, we have to decide which GUI engine we should use. As an obvious starter, we decided to have a look at what other people are using in their packages. While cran helpfully displays all the R packages that are available, it doesn’t (I don’t think), give a nice summary of the package dependencies. After clicking on a few dozen packages and examining their dependencies, I decided that a quick script was in order.

General idea

  1. Scrap the package names the main cran package web-site
  2. For each package, scrap the associated web-page and retrieve its dependencies.

For example, ADaCGH has a large number of packages under the “DEPENDS” section.

Pre-processing

To make life easier, I made a few simplifications to the data:

  • any dependencies on R, MASS, stats, methods and utils were removed when plotting;
  • I removed any bioconductor and omega hat packages;
  • version numbers in the DEPENDS section were ignored.

It should be stressed that I’m only picking up what is listed in the DEPENDS section. For example, suppose a package depends on both”ggplot2″ and “plyr”. Since “ggplot2″ depends on “plyr” the package author may only list “ggplot2″

Results

The top six packages based on the DEPENDS section are:

  • lattice – 165 times
  • survival – 107
  • mvtnorm – 103
  • tcltk – 76
  • graphics – 76
  • grid – 60

You could argue that I should remove “graphics” by the same arbitrary criteria I used when removing “MASS”. The total number of packages that are referred to in the DEPENDS section is just over 782 (out of a possible 3000 packages).  The following graph plots the package name against the number of times it appears in the DEPENDS section of another package. There is a clear exponential decay highlighting a few key packages.

In fact the top 40 packages, account for 50% of all dependencies, and that’s after the dependencies on R, utils, methods,.. were removed.

I also constructed a graphical network using cytoscape. However, it’s quite large (~2MB). You can download the network separately. To construct this network, I only used packages that had three or more dependencies. There were a dozen or so smaller graphs that I pruned.

R Details

  • To scape the web-pages I used regular expressions. Yes, I know you shouldn’t use regular expressions for parsing html, and should use a proper html parser, but
    • the web-pages were all well formed since they were generated from the package DESCRIPTION file
    • I needed practice with regular expressions
    • the R code is at the end of this post
  • You can download a csv file of the list edges from here
require("stringr")

####################
## Get dependencies
####################
getDependencies = function(pkg_name) {
  url_st = "http://cran.r-project.org/web/packages"
  url_end = "index.html"
  url = paste(url_st, pkg_name, url_end, sep="/")

  cran_web = paste(readLines(url), collapse="")

  if(regexpr("Depends:", cran_web) == -1)
    return()

  ## Get the table
  hrefs = gsub('(.*Depends:)',"", cran_web)

  ## Clean the td & tr tags
  hrefs = gsub('.*',"", hrefs)

  ## Remove R from dependencies
  hrefs = gsub('R .*?<',"<", hrefs)
  ## Remove versions
  hrefs = gsub("\\(&[ge; 0-9\\.\\-]*)", "", hrefs)

  ## Remove Bioconductor
  hrefs = gsub("([A-Za-z0-9\\.]*)",
       "", hrefs)

  ## Remove Omegahat
  hrefs =
  gsub("[A-Za-z0-9]*",
  "", hrefs)

  ## Get dependencies
  depends_on = gsub("[0-9A-Za-z\\.]*",  "\\1", hrefs)

  ##Unlist and remove white space
  depends_on = strsplit(depends_on, ",")[[1]]
  depends_on = as.vector(sapply(depends_on, str_trim))
  depends_on = depends_on[sapply(depends_on, nchar)>0]
  return(depends_on)
}

###########
#Main Page
url = "http://cran.r-project.org/web/packages/"
cran_web_page = paste(readLines(url), collapse="")

main_table = gsub('.*(.*)
.*', "\\1", cran_web_page) main_table = gsub('', "", main_table) depends_on = gsub('[0-9A-Za-z\\.]*.*?', "\\1 ", main_table) cran_packages = unlist(strsplit(depends_on, " ")) from = vector("character", 10000) to = vector("character", 10000) j = 1 for(i in 1:length(cran_packages)) { dependencies = getDependencies(cran_packages[i]) cat(i, ":", dependencies, "\n") if(!is.null(dependencies) && length(dependencies) > 0) { l = length(dependencies) - 1 from[j:(j+l)] = cran_packages[i] to[j:(j+l)] = dependencies j = j + l + 1 } } dep_df = data.frame(from=from, to=to) dep_df = dep_df[1:j,]

To leave a comment for the author, please follow the link and comment on their blog: Why? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)