About URLs in DESCRIPTION

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Among DESCRIPTION usual fields is the free-text URL field where package authors can store various links: to the development website, docs, upstream tool, etc. In this post, we shall explain why storing URLs in DESCRIPTION is important, where else you should add URLs and what kind of URLs are stored in CRAN packages these days.

Why put URLs in DESCRIPTION?

In the following we’ll assume your package has some sort of online development repository (GitHub? GitLab? R-Forge?) and a documentation website (handily created via pkgdown?). Adding URLs to your package’s online homes is extremely useful for several reasons.

As a side note: Yes, you can store several URLs under URL, even if the field name is singular. See for instance rhub’s DESCRIPTION ???? ????

URL: https://github.com/r-hub/rhub, https://r-hub.github.io/rhub/

Why put URLs in DESCRIPTION?

  • It will help your users find your package’s pretty documentation from the CRAN page, instead of just the less pretty PDF manual.

  • Likewise, from the CRAN page your contributors can directly find where to submit patches.

  • If your package has a package-level man page, and it should (e.g. as drafted by usethis::use_package_doc() and then generated by roxygen2), then after typing say library("rhub") and then ?rhub, your users will find the useful links.

  • Other tools such as helpdesk and the pkgsearch RStudio addin can help surface the URLs you store in DESCRIPTION.

  • Indirectly, having a link to the docs website and development repo will increase their page rank, see useful comments in this discussion, so potential users and contributors find them more easily by simply searching for your package.

Quick tip, you can add GitHub URLs (URL and BugReports) to DESCRIPTION by running usethis::use_github_links(). ????

Where else put your URLs?

For the same reasons as previously, you should make the most of all places that can store your package’s URL(s). Have you put your package’s docs URL

Have you used any of your package’s URLs

Don’t miss any opportunity to point users and contributors in the right direction!

What URLs do people use in DESCRIPTION files of CRAN packages?

In the following, we shall parse the URL field of the CRAN packages database.

db <- tools::CRAN_package_db()

db <- tibble::as_tibble(db[, c("Package", "URL")])
db <- dplyr::distinct(db)

There are 15315 packages on CRAN at the time of writing, among which 8040 with something written in the URL field. We can parse this data.

db <- db[!is.na(db$URL),]

library("magrittr")

# function from https://github.com/r-hub/pkgsearch/blob/26c4cc24b9296135b6238adc7631bc5250509486/R/addin.R#L490-L496

url_regex <- function() "(https?://[^\\s,;>]+)"

find_urls <- function(txt) {
  mch <- gregexpr(url_regex(), txt, perl = TRUE)
  res <- regmatches(txt, mch)[[1]]

  if(length(res) == 0) {
    return(list(NULL))
  } else {
    list(unique(res))
  }
}

db %>%
  dplyr::group_by(Package)  %>%
  dplyr::mutate(actual_url = find_urls(URL))%>%
  dplyr::ungroup() %>%
  tidyr::unnest(actual_url) %>%
  dplyr::group_by(Package, actual_url) %>%
  dplyr::mutate(url_parts = list(urltools::url_parse(actual_url))) %>%
  dplyr::ungroup() %>%
  tidyr::unnest(url_parts) %>%
  dplyr::mutate(scheme = trimws(scheme)) -> parsed_db

There are 7192 with at least one valid URL.

What are the packages with most links?

mostlinks <- dplyr::count(parsed_db, Package, sort = TRUE)
mostlinks

## # A tibble: 7,192 x 2
##    Package           n
##    <chr>         <int>
##  1 RcppAlgos         7
##  2 BIFIEsurvey       5
##  3 BigQuic           5
##  4 dendextend        5
##  5 PGRdup            5
##  6 vwline            5
##  7 ammistability     4
##  8 augmentedRCBD     4
##  9 dcGOR             4
## 10 dialr             4
## # … with 7,182 more rows

The package with the most links in URL is RcppAlgos.

What is the most popular scheme, http or https?

dplyr::count(parsed_db, scheme, sort = TRUE)

## # A tibble: 2 x 2
##   scheme     n
##   <chr>  <int>
## 1 https   5910
## 2 http    2496

There is a bit less that one third of http links.

Can we identify popular domains?

dplyr::count(parsed_db, domain, sort = TRUE)

## # A tibble: 1,855 x 2
##    domain                    n
##    <chr>                 <int>
##  1 github.com             4660
##  2 www.r-project.org       164
##  3 cran.r-project.org      143
##  4 r-forge.r-project.org    82
##  5 bitbucket.org            67
##  6 sites.google.com         54
##  7 arxiv.org                52
##  8 gitlab.com               44
##  9 docs.ropensci.org        38
## 10 www.github.com           32
## # … with 1,845 more rows

GitHub seems to be the most popular development platform, as least from this sample of CRAN packages that indicate an URL. It is also possible that some developers set up their own GitLab server with a own domain. Many packages link to www.r-project.org which is not very informative, or to their own CRAN page which can be informative.

Other relatively popular domains are sites.google.com and arxiv.org. There are problably links to other venues for scientific publications than arxiv.org. What about doi.org?

dplyr::filter(parsed_db, domain %in% c("doi.org", "dx.doi.org")) %>%
  dplyr::select(Package, actual_url)

## # A tibble: 44 x 2
##    Package                actual_url                                    
##    <chr>                  <chr>                                         
##  1 abcrlda                https://dx.doi.org/10.1109/LSP.2019.2918485   
##  2 adwave                 https://doi.org/10.1534/genetics.115.176842   
##  3 ammistability          https://doi.org/10.5281/zenodo.1344756        
##  4 anMC                   https://doi.org/10.1080/10618600.2017.1360781 
##  5 ANOVAreplication       https://dx.doi.org/10.17605/OSF.IO/6H8X3      
##  6 AssocAFC               https://doi.org/10.1093/bib/bbx107            
##  7 augmentedRCBD          https://doi.org/10.5281/zenodo.1310011        
##  8 CorrectOverloadedPeaks http://dx.doi.org/10.1021/acs.analchem.6b02515
##  9 dataMaid               https://doi.org/10.18637/jss.v090.i06         
## 10 disclapmix             http://dx.doi.org/10.1016/j.jtbi.2013.03.009  
## # … with 34 more rows

The “earlier but no longer preferred” dx.doi.org is still in use.

rOpenSci docs server also make an appearance.

Note that you could do a similar analysis of the BugReports field. We’ll leave that as an exercise to the reader. ????

Conclusion

In this note, we explained why having URLs in DESCRIPTION of your package can help users and contributors find the right venues for their needs, and we had a look at URLs currently stored in the DESCRIPTIONs of CRAN packages, in particular discussing current popular domains. How do you ensure the users of your package can find its best online home(s)? How do you look for online home(s) of the packages you use?

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)