Fishing for packages in CRAN

June 18, 2015
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

It is incredibly challenging to keep up to date with R packages. As of today (6/16/15), there are 6,789 listed on CRAN. Of course, the CRAN Task Views are probably the best resource for finding what's out there. A tremendous amount of work goes into maintaining and curating these pages and we should all be grateful for the expertise, dedication and efforts of the task view maintainers. But, R continues to grow at a tremendous rate. (Have a look at growth curve in Bob Muenchen's 5/22/15 post R Now Contains 150 Times as Many Commands as SAS). CRANberries, a site that tracks new packages and package updates, indicates that over the last few months the list of R packages has been growing by about 100 packages per month. How can anybody hope to keep current?

So, on any given day, expect that finding out what R packages exist that may pertain to any particular topic will require some work. What follows, is a beginners guide to fishing for packages in CRAN. This example looks for "Bayesian" packages using some simple web page scraping and elementary text mining.

The Bayesian Inference Task View lists 144 packages. This is probably everything that is really important, but let's see what else is to be found that has anything at all to do with Bayesian Inference. In the first block of code, R's available.packages() function fetches the list of packages available from my Windows PC. (This is an extremely interesting function and I don't do justice to it here.) Then, this list is used to scrape the package descriptions from the various package webpages. The loop takes some time to run so I saved the package descriptions both in a csv file and a in a .RData workspace.

library(svTools)
library(RCurl)
library(tm)
#-----------------------------------------
# TWO HELPER FUNCTIONS
# Funcion to get ackage description from CRAN package page
getDesc <- function(package){
  l1 <- regexpr("",package)
  ind1 <- as.integer(l1[[1]]) + 9
  l2 <- regexpr("Version",package)
  ind2 <- as.integer(l2[[1]]) - (46 + nchar("package"))
  desc <- substring(package,ind1,ind2)
  return(desc)
}
 
# Function to get CRAN package page
getPackage <- function(name){
  url <- paste("http://cran.r-project.org/web/packages/",name,"/index.html",sep="")
  txt <- getURL(url,ssl.verifypeer=FALSE)
  return(txt)
}
#--------------------------------------------
# SCRAPE PACKAGE DATA FROM CRAN
# Get the list of R packages
packages <- as.data.frame(available.packages())
head(packages)
dim(packages)
 
pkgNames <- rownames(packages)
rm(packages)           # Dont need this any more
pkgDesc <- vector()
for (i in 1:length(pkgNames)){
 
  pkgDesc[i] <- getDesc(getPackage(pkgNames[i]))
}

length(pkgDesc) #6598
 
#----------------------------------------------
# SOME HOUSEKEEPING
# cranP <- data.frame(pkgNames,pkgDesc)
# write.csv(cranP,"C:/DATA/CRAN/CRAN_pkgs_6_15_15")
# save.image("pkgs.RData")
# load("pkgs.RData")

When I did this a few days ago 6,598 packages were available. The next section of code turns the vector of package descriptions into a document corpus and creates a document term matrix with a row for each package and 20,781worth of terms. Taking the transpose of the term matrix makes it easier to see what is going on. The matrix is extremely sparse (only one 1 shows up) as this small portion of the matrix illustrates and all of the terms are pretty much useless. Removing the sparse terms cuts the matrix down to only 372 terms.

# SOME SIMPLE TEXT MINING
# Make a corpus  out of package descriptions
pCorpus <- VCorpus(VectorSource(pkgDesc))
pCorpus
inspect(pCorpus[1:3])
 
# Function to prepare corpus
prepC <- function(corpus){
  c <- tm_map(corpus, stripWhitespace)
  c <- tm_map(c,content_transformer(tolower))
  c <- tm_map(c,removeWords,stopwords("english"))
  c <- tm_map(c,removePunctuation)
  c <- tm_map(c,removeNumbers)
  return(c)}
 
pCorpusPrep <- prepC(pCorpus)
 
#------------------------------------------------------------
# Create the document term matrix
dtm <- DocumentTermMatrix(pCorpusPrep)
dtm
# <>
#   Non-/sparse entries: 142840/136970198
# Sparsity           : 100%
# Maximal term length: 83
# Weighting          : term frequency (tf)
 
 
# Work with the transpose to list keywords as rows
inspect(t(dtm[100:105,90:105]))
 
# Docs
# Terms          100 101 102 103 104 105
# accomodated    0   0   0   0   0   0
# accompanied    0   0   0   0   0   0
# accompanies    0   0   0   0   0   0
# accompany      0   0   0   0   0   0
# accompanying   0   0   0   0   0   0
# accomplished   0   0   0   0   0   0
# accomplishes   0   0   0   0   0   0
# accordance     0   0   0   0   0   0
# according      0   0   1   0   0   0
# accordingly    0   0   0   0   0   0
# accordinglyp   0   0   0   0   0   0
# account        0   0   0   0   0   0
# accounted      0   0   0   0   0   0
# accounting     0   0   0   0   0   0
# accountp       0   0   0   0   0   0
# accounts       0   0   0   0   0   0
 
 
# Reduce the number of sparse terms
dtms <- removeSparseTerms(dtm,0.99)
 
dim(dtms)  # 6598  372

I am pretty much counting on some luck here, hoping that "Bayesian" will be one of the remaining 372 terms. This last bit of code finds 229 packages associated with the keyword "Bayesian"

# Find the Bayesian packages
dtmsT <- t(dtms)
keywords <- row.names(dtmsT)                 
bi <- which(keywords == "bayesian")  # Find the index of an interesting keyword
 
bayes <- inspect(dtmsT)[bi,]         # Vexing that it prints to console
bayes_packages_index <- names(bayes[bayes==1])
 
# Here are the "Bayesian" packages
bayes_packages <- pkgNames[as.numeric(bayes_packages_index)]
length(bayes_packages) #229
 
# Here are the descriptions of the "Bayesian" packages
bayes_pkgs_desc <- pkgDesc[bayes==1])

Here is the list of packages found.

  BP

Not all of these "fish" are going to be worth keeping, but at least we have reduced the search to something manageable. In 10 or 15 minutes of fishing you might catch something interesting.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)