Exploring R Packages with cranly

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post, I showed a very simple example of using the R function tools::CRAN_package_db() to analyze information about CRAN packages. CRAN_package_db() extracts the metadata CRAN stores on all of its 12,000 plus packages and arranges it into a “database”, actually a complicated data frame in which some columns have vectors or lists as entries.

It’s simple to run the function and it doesn’t take very long on my Mac Book Air.

p_db <- tools::CRAN_package_db()

The following gives some insight into what’s contained in the data frame.

dim(p_db)
## [1] 12635    65
matrix(names(p_db),ncol=2)
##       [,1]                      [,2]                 
##  [1,] "Package"                 "Collate.windows"    
##  [2,] "Version"                 "Contact"            
##  [3,] "Priority"                "Copyright"          
##  [4,] "Depends"                 "Date"               
##  [5,] "Imports"                 "Description"        
##  [6,] "LinkingTo"               "Encoding"           
##  [7,] "Suggests"                "KeepSource"         
##  [8,] "Enhances"                "Language"           
##  [9,] "License"                 "LazyData"           
## [10,] "License_is_FOSS"         "LazyDataCompression"
## [11,] "License_restricts_use"   "LazyLoad"           
## [12,] "OS_type"                 "MailingList"        
## [13,] "Archs"                   "Maintainer"         
## [14,] "MD5sum"                  "Note"               
## [15,] "NeedsCompilation"        "Packaged"           
## [16,] "Additional_repositories" "RdMacros"           
## [17,] "Author"                  "SysDataCompression" 
## [18,] "Authors@R"               "SystemRequirements" 
## [19,] "Biarch"                  "Title"              
## [20,] "BugReports"              "Type"               
## [21,] "BuildKeepEmpty"          "URL"                
## [22,] "BuildManual"             "VignetteBuilder"    
## [23,] "BuildResaveData"         "ZipData"            
## [24,] "BuildVignettes"          "Published"          
## [25,] "Built"                   "Path"               
## [26,] "ByteCompile"             "X-CRAN-Comment"     
## [27,] "Classification/ACM"      "Reverse depends"    
## [28,] "Classification/ACM-2012" "Reverse imports"    
## [29,] "Classification/JEL"      "Reverse linking to" 
## [30,] "Classification/MSC"      "Reverse suggests"   
## [31,] "Classification/MSC-2010" "Reverse enhances"   
## [32,] "Collate"                 "MD5sum"             
## [33,] "Collate.unix"            "Package"

Looking at a few rows and columns gives a feel for how complicated its structure is.

p_db[1:10, c(1,2,4,5)]
##        Package Version                                             Depends
## 1           A3   1.0.0                      R (>= 2.15.0), xtable, pbapply
## 2       abbyyR   0.5.4                                        R (>= 3.2.0)
## 3          abc     2.1 R (>= 2.10), abc.data, nnet, quantreg, MASS, locfit
## 4     abc.data     1.0                                         R (>= 2.10)
## 5      ABC.RAP   0.9.0                                        R (>= 3.1.0)
## 6  ABCanalysis   1.2.1                                         R (>= 2.10)
## 7     abcdeFBA     0.4              Rglpk,rgl,corrplot,lattice,R (>= 2.10)
## 8     ABCoptim  0.15.0                                                <NA>
## 9        ABCp2     1.2                                                MASS
## 10       abcrf     1.7                                           R(>= 3.1)
##                                                                   Imports
## 1                                                                    <NA>
## 2                                  httr, XML, curl, readr, plyr, progress
## 3                                                                    <NA>
## 4                                                                    <NA>
## 5                                                  graphics, stats, utils
## 6                                                                 plotrix
## 7                                                                    <NA>
## 8                                            Rcpp, graphics, stats, utils
## 9                                                                    <NA>
## 10 readr, MASS, matrixStats, ranger, parallel, stringr, Rcpp (>=\n0.11.2)

So, having spent a little time leaning how vexing working with this data can be, I was delighted when I discovered Ioannis Kosmidis’ cranly package during my March “Top 40” review. cranly is a very impressive package, built along tidy principles, that is helpful for learning about individual packages, analyzing the structure of package and author relationships, and searching for packages.

library(cranly)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

The first really impressive feature is a “one button” clean function that does an amazing job of getting the data in shape to work with. In my preliminary work, I struggled just to get the author data clean. In the approach that I took, getting rid of text like [aut, cre] to get a count of authors took more regular expression work than I wanted to deal with. But clean_CRAN_db does a good job of cleaning up the whole database. Note that the helper function clean_up_author has a considerable number of hard-coded text strings that must have taken hours to get right.

package_db <- clean_CRAN_db(p_db)

Once you have the clean data, it is easy to run some pretty interesting analyses. This first example, straight out of the package vignette, builds the network of package relationships based on which packages import which, and then plots a summary for the top 20 most imported packages.

package_network <- build_network(package_db)
package_summaries <- summary(package_network)
plot(package_summaries, according_to = "n_imported_by", top = 20)

There is also a built-in function to compute the importance or relevance of a package using the page rank algorithm.

plot(package_summaries, according_to = "page_rank", top = 20)

The build_network function also offers the opportunity to investigate the collaboration of package authors by building a network from the authors’ perspective.

author_network <- build_network(object = package_db, perspective = "author")

Here, we look at J.J. Allaire’s network. exact = FALSE means that the algorithm is not using exact matching.

plot(author_network, author = "JJ Allaire", exact = FALSE)

It is also possible to study individual packages. Here, I plot the very simple dependency tree for the time series package xts. There is a very good argument to be made that the simpler the dependency tree the more stable and reliable the package.

xts_tree <- build_dependence_tree(package_network, "xts")
plot(xts_tree)

As a final example, consider how the package_with() function might be used to search for Bayesian packages by searching for packages with “Bayes” or “MCMC” in the description. I don’t believe that this exhausts the possibilities of cranly, but it should be clear that the package is a very useful tool for looking into the mysteries of CRAN.

Bayesian_packages <- package_with(package_network, name = c("Bayes", "MCMC"))
plot(package_network, package = Bayesian_packages, legend=FALSE)

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)