Analyzing data on CRAN packages

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There's a handy new function in R 3.4.0 for anyone interested in data about CRAN packages. It's not documented, but it's pretty simple:

tools::CRAN_package_db()

returns a data frame with one row for every package on CRAN and 65 columns of data on those packages, as shown below.

> names(tools::CRAN_package_db())
 [1] "Package"                 "Version"                 "Priority"               
 [4] "Depends"                 "Imports"                 "LinkingTo"              
 [7] "Suggests"                "Enhances"                "License"                
[10] "License_is_FOSS"         "License_restricts_use"   "OS_type"                
[13] "Archs"                   "MD5sum"                  "NeedsCompilation"       
[16] "Additional_repositories" "Author"                  "Authors@R"              
[19] "Biarch"                  "BugReports"              "BuildKeepEmpty"         
[22] "BuildManual"             "BuildResaveData"         "BuildVignettes"         
[25] "Built"                   "ByteCompile"             "Classification/ACM"     
[28] "Classification/ACM-2012" "Classification/JEL"      "Classification/MSC"     
[31] "Classification/MSC-2010" "Collate"                 "Collate.unix"           
[34] "Collate.windows"         "Contact"                 "Copyright"              
[37] "Date"                    "Description"             "Encoding"               
[40] "KeepSource"              "Language"                "LazyData"               
[43] "LazyDataCompression"     "LazyLoad"                "MailingList"            
[46] "Maintainer"              "Note"                    "Packaged"               
[49] "RdMacros"                "SysDataCompression"      "SystemRequirements"     
[52] "Title"                   "Type"                    "URL"                    
[55] "VignetteBuilder"         "ZipData"                 "Published"              
[58] "Path"                    "X-CRAN-Comment"          "Reverse depends"        
[61] "Reverse imports"         "Reverse linking to"      "Reverse suggests"       
[64] "Reverse enhances"        "MD5sum"   

In a recent blog post, Julia Silge analyzes this database to find out some interesting statistics on CRAN packages. For example, by mining the Description field of the packages, we can see which words (other than stopwords like “the” and “and”) are most commonly used. Unsurprisingly, most are associated with data and data analysis methods:

CRAN-common_words

Julia also analysed which packages include tests (via the testthat or RUnit packages), or provide a link for providing bug reports, or provide vignettes (dynamically-created documentation). By this analysis, more than 60% of packages provide none of these:

CRAN-practices

This is a bit of an underestimate though, as many packages (especially older packages) do include a tests folder that doesn't rely on those packages, but it's not apparent how to identify those packages from the data. Tests can also be included in the help files for package functions. In particular, the CRAN maintainers will reject any package that includes no tests (except those that don't require them, like data packages), so this estimate doesn't look right to me. Nonetheless, the new CRAN_package_db function provides a useful data source for exploring the rich world of CRAN packages. You can see further examples at Julia's blog post, linked below.

data science ish: Mining CRAN DESCRIPTION Files

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)