Site icon R-bloggers

CRAN Packages on GitHub (and some CRAN DESCRIPTION observations)

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Just about a week ago @thosjleeper posited something on twitter w/r/t how many CRAN packages had associations with GitHub (i.e. how many used GitHub for development). The DESCRIPTION file (that comes with all R packages) has some fields that can house this information and most folks who do use GitHub for development of R seem to use the URL field to host the repository URL, which looks something like this:

URL: http://github.com/ropenscilabs/geoparser

(that’s from @ma_salmon’s ++good geoparser package)

There may be traces of GitHub URLs in other fields, but I took this initial naïve assumption and added a step to my daily home CRAN mirror scripts (what, you don’t have your own CRAN mirror at home, too?) to generate two files that you, the R community, can use whenever you want to inspect GitHub R packages:

You can use:

load(url("http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata"))

or

jsonlite::fromJSON("http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json")

to read these files but they change only daily, so you might want to download.file() them vs waste bandwidth re-reading them intra-day.

As of this post there were 1,544 packages meeting my naïve criteria.

One interesting side-discovery of this effort was to discover that there are 122 “distinct” DESCRIPTION fields in-use, but some of those are mixed-case versions of each other (118 total unique ones). Plus, there doesn’t seem to be as hard and fast a rule on the fields as one might think. Some examples:

You can see the usage counts for all the fields in the table below:

But, I digress.

Who has the most CRAN packages with associated GitHub repositories? The code below mostly answers this. I say “mostly” since I don’t handle edge cases in the URL field (look at it to see what I mean). It’s also possible that there are traces of GitHub in other fields, and I’ll address those in my local CRAN parser at some point. Feel free to post your findings, fixes or enhancements in the comments.

library(dplyr)
library(tidyr)
library(stringi)
library(DT)

download.file("http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.Rdata",
              "ghcran.Rdata")

load("ghcran.Rdata")

ghcran$URL %>% 
  stri_match_first_regex("://github.com/(.*?/[[:alnum:]_\\-\\.]+)") %>% 
  as.data.frame(stringsAsFactors=FALSE) %>% 
  setNames(c("url", "repos")) %>% 
  filter(!is.na(repos)) %>% 
  separate(repos, c("author", "repo"), "/", extra="drop") %>% 
  count(author) %>% 
  arrange(desc(n)) %>% 
  datatable()

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.