How to get help with R package development? R-package-devel and beyond

April 10, 2019
By

[This article was first published on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

No matter how good your docs reading and search engine querying skills
are, sometimes as an R package developer you’ll need to ask questions to
your peers. Where to find them? R-hub has its own
feedback and
discussion venues, but what about R
package development in general? In this blog post, we’ll have a look at
the oldest specific channel for R package development help, the R-package-devel
mailing list
.
We’ll also mention other, younger, fora for such questions.

Brief presentation of R-package-devel

R-package-devel is one of the official mailing lists of the R
project
. In the words of its
creators
, “This
list is to get help about package development in R. The goal of the list
is to provide a forum for learning about the package development
process. We hope to build a community of R package developers who can
help each other solve problems, and reduce some of the burden on the
CRAN maintainers. If you are having problems developing a package or
passing R CMD check, this is the place to ask!”
. It was born on May
22d,
2015

which is fairly recent; before then, R package development questions
were asked on R-devel, which is now “intended for questions and
discussion about code development in R.”.

To participate, you need to subscribe, and your first post will be
manually moderated. After that, you’re free to post to ask or help!

Now, this is all good, but how about getting a more data-driven overview
of the mailing list by downloading its
archives
?

Building a data.frame of R-package-devel archives

R-package-devel archives are
online, organized by
thread/date/author within quarters. In order to be able to get a quick
glimpse at them, we first downloaded and parsed them.

Downloading all threads

In a first step, we politely
scraped the archives to extract all filenames we’d then download.

session <- polite::bow("https://stat.ethz.ch/pipermail/r-package-devel/",
            user_agent = "Maëlle Salmon https://masalmon.eu/")

library("magrittr")

polite::scrape(session) %>%
  rvest::xml_nodes("a") %>%
  xml2::xml_attr("href") %>%
  .[grepl("\\.txt\\.gz", .)] -> filenames

We then downloaded each file with a pause of 5 seconds between them,
still with the goal to remain polite.

fs::dir_create("archives")

download_one <- function(filename){
  message(filename)
  Sys.sleep(5)
  download.file(glue::glue("https://stat.ethz.ch/pipermail/r-package-devel/{filename}"),
                file.path("archives", filename))

}


purrr::walk(filenames, download_one)

We did this at the beginning of April, so the last complete quarter we
got was the first quarter of 2019. Below is what archives held:

fs::dir_tree("archives")
## archives
## ├── 2015q2.txt.gz
## ├── 2015q3.txt.gz
## ├── 2015q4.txt.gz
## ├── 2016q1.txt.gz
## ├── 2016q2.txt.gz
## ├── 2016q3.txt.gz
## ├── 2016q4.txt.gz
## ├── 2017q1.txt.gz
## ├── 2017q2.txt.gz
## ├── 2017q3.txt.gz
## ├── 2017q4.txt.gz
## ├── 2018q1.txt.gz
## ├── 2018q2.txt.gz
## ├── 2018q3.txt.gz
## ├── 2018q4.txt.gz
## ├── 2019q1.txt.gz
## └── 2019q2.txt.gz

Parsing the threads

To parse emails, we used the tm.plugin.mail R
package
,
plus one home-baked function derived from it to remove citation lines
starting with | rather than > (see e.g. this
email
).
We weren’t too sure as to how to best parse emails into data.frames
thus we asked for help on rOpenSci
forum
,
thanks Scott for the good advice!

We first converted all the threads to a tm.plugin.mail format.

# archives holds all the txt.gz files
filenames <- fs::dir_ls("archives")
folders <- gsub("archives\\/", "", filenames)

purrr::map2(filenames, folders,
            tm.plugin.mail::convert_mbox_eml)

We wrote our own | citation-removing function.

# adapted from tm.plugin.mail source code
removeCitation2 <-
  function(x)
  {
    citations <- grep("^[[:blank:]]*\\|", x, useBytes = TRUE)

    headers <- grep("wrote:$|writes:$", x)
    ## A quotation header must immediately preceed a quoted part,
    ## possibly with one empty line in between.
    headers <- union(headers[(headers + 1L) %in% citations],
                     headers[(headers + 2L) %in% citations])
    citations <- union(headers, citations)


    if (length(citations)) x[-citations] else x
  }

We then wrote functions
rectangling emails
(that are within threads), threads (that are within folders) and
folders, and applied them to the whole archive.

rectangle_email <- function(email){
  email <- tm.plugin.mail::removeCitation(email, removeQuoteHeader = TRUE)
  email <- tm.plugin.mail::removeMultipart(email)
  email <- tm.plugin.mail::removeSignature(email)

  if(is.null(email$meta$heading)){
    return(NULL)
  }
  email$content <- removeCitation2(email$content)
  tibble::tibble(author = email$meta$author,
                 datetime = as.POSIXct(email$meta$datetimestamp),
                 subject = email$meta$heading,
                 content = as.character(
                   glue::glue_collapse(email$content,
                                               "\n")))
}

rectangle_thread <- function(thread, ID, folder){
  df <- purrr::map_df(as.list(thread), rectangle_email)

  if(nrow(df) == 0){
    return(NULL)
  }

  df$thread <- paste(folder, ID, sep = "-")
  return(df)
}

rectangle_folder <- function(folder){
  emails <- tm::VCorpus(tm::DirSource(folder),
                        readerControl = list(reader = tm.plugin.mail::readMail))

  threads <- tm.plugin.mail::threads(emails)
  purrr::map2_df(split(emails, threads$ThreadID),
                 unique(rep(threads$ThreadID)),
                rectangle_thread, folder = folder)
}

emails <- purrr::map_df(folders, rectangle_folder)

readr::write_csv(emails, file.path("data", "emails.csv"))
fs::dir_delete(folders)

The emails.csv file held a gigantic data.frame with one line per
email, which was our initial goal. 🎉

emails <- readr::read_csv("data/emails.csv")
str(emails)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3708 obs. of  5 variables:
##  $ author  : chr  "edd at debian.org (Dirk Eddelbuettel)" "maechler at lynne.stat.math.ethz.ch (Martin Maechler)" "Dan.Kelley at Dal.Ca (Daniel Kelley)" "maechler at lynne.stat.math.ethz.ch (Martin Maechler)" ...
##  $ datetime: POSIXct, format: "2015-05-22 11:38:22" "2015-05-22 14:00:13" ...
##  $ subject : chr  "[R-pkg-devel] Welcome all!" "[R-pkg-devel] Welcome all!" "[R-pkg-devel] how to call PROJ.4 C code in a package?" "[R-pkg-devel] how to call PROJ.4 C code in a package?" ...
##  $ content : chr  "\nThanks to all (176 as of now) of you for subscribing.  We hope this will turn\ninto a useful forum.\n\nI woul"| __truncated__ "\n\n\nYes, the list is subscriber-only.  If you are not subscribed,\nyour post is rejected immediately.\n\nHowe"| __truncated__ "The ?oce? package (for oceanographic analysis) presently includes PROJ.4 C-language source code, as a way to wo"| __truncated__ "\n\n\n\n... Well, not quite: That's not R, that was you, when you\ninstalled the CRAN package called  'proj4'.\"| __truncated__ ...
##  $ thread  : chr  "2015q2.txt.gz-1" "2015q2.txt.gz-1" "2015q2.txt.gz-2" "2015q2.txt.gz-2" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   author = col_character(),
##   ..   datetime = col_datetime(format = ""),
##   ..   subject = col_character(),
##   ..   content = col_character(),
##   ..   thread = col_character()
##   .. )

Rough exploratory data analysis (EDA) of R-package-devel-archives

Activity of the list

In total, the archives we parsed hold 3708 emails, in 1104 threads which
we define by subject (length(unique(emails$subject))). Over time,
there is no clear trend in the activity neither upwards nor downwards.

library("ggplot2")
library("magrittr")
library("ggalt")

emails %>%
  dplyr::mutate(week = lubridate::round_date(datetime, unit = "week")) %>%
  ggplot() +
  geom_bar(aes(week)) +
  hrbrthemes::theme_ipsum(base_size = 16,
                          axis_title_size = 16) +
  xlab("Time (weeks)") +
  ylab("Number of emails")

R-package-devel weekly number of emails over time

In this context of R package development help at least, email is not
dead!

Who?

As mentioned earlier, the oldest email was send on 2015-05-22 10:56:21
(min(emails$datetime)). Even without cleaning email addresses too
much, it’s clear that there are some super-posters around,

dplyr::count(emails, author, sort = TRUE)
## # A tibble: 670 x 2
##    author                                                           n
##                                                            
##  1 edd at debian.org (Dirk Eddelbuettel)                          270
##  2 murdoch.duncan at gmail.com (Duncan Murdoch)                   234
##  3 ligges at statistik.tu-dortmund.de (Uwe Ligges)                185
##  4 [email protected]@n @ending from [email protected]@com (Duncan Murdoch)          68
##  5 h.wickham at gmail.com (Hadley Wickham)                         60
##  6 [email protected] @ending from @[email protected]@[email protected]@de (Uwe Ligges)       46
##  7 edd @ending from [email protected]@org (Dirk Eddelbuettel)                 45
##  8 glennmschultz at me.com (Glenn Schultz)                         43
##  9 csardi.gabor at gmail.com (=?UTF-8?B?R8OhYm9yIENzw6FyZGk=?=)    42
## 10 i.ucar86 at gmail.com (=?UTF-8?B?ScOxYWtpIMOaY2Fy?=)            42
## # … with 660 more rows

The above shows that there are clearly duplicates, which we won’t try to
solve in this post. With this not cleaned dataset, we find 670 distinct
authors. There are less than that, but still quite a lot!

What?

Of particular interest is trying to summarize the topics discussed,
because they’ll reflect common hurdles that could be supported with
tooling (is it difficult to reproduce issues uncovered by CRAN
platforms? Yay for R-hub!) or documentation (questions about R-hub are
thus gems!). Apart from reading all old threads, which is more addicting
than one might think, there are some ways to extract information from
the data, a few of which we’ll present here.

What are the most mentioned URL domains?

urls <- unlist(qdapRegex::rm_url(emails$content, extract = TRUE))
urls <- urls[!is.na(urls)]
urls <- urltools::url_parse(urls)
dplyr::count(urls, domain, sort = TRUE)
## # A tibble: 364 x 2
##    domain                        n
##                         
##  1 github.com                  437
##  2 cran.r-project.org          321
##  3 stat.ethz.ch                238
##  4 win-builder.r-project.org   158
##  5 stackoverflow.com            38
##  6 www.pfeg.noaa.gov            38
##  7 dirk.eddelbuettel.com        36
##  8 www.r-project.org            36
##  9 www.keittlab.org             25
## 10 hughparsonage.github.io      22
## # … with 354 more rows

Some of these are links shared to convey information (e.g. links from
GitHub or CRAN), others are actually… URLs from signatures, which we
therefore haven’t been able to completely remove, too bad!

What are the subjects of emails in which R-hub ended up being mentioned?
To assess that we wrote and tested a R-hub regular expression,
"[Rr](-)?( )?[Hh]ub".

grepl("[Rr](-)?( )?[Hh]ub",
      c("R hub", "Rhub", "R-hub"))
## [1] TRUE TRUE TRUE
rhub <- emails[grepl("[Rr](-)?( )?[Hh]ub", emails$content)|
              grepl("[Rr](-)?( )?[Hh]ub", emails$subject),]

set.seed(42)
sample(unique(rhub$subject), 7)
## [1] "[R-pkg-devel] ORCID disappearing in auto-generated Authors: field?"                      
## [2] "[R-pkg-devel]  Error appearing only with check_win_devel() - could be ggplot2 R version?"
## [3] "[R-pkg-devel] Build of PDF vignette fails on r-oldrel-osx-x86_64"                        
## [4] "[R-pkg-devel] r-hub failing?"                                                            
## [5] "[R-pkg-devel] Replicate solaris errors"                                                  
## [6] "[R-pkg-devel] registering native routines"                                               
## [7] "[R-pkg-devel] Cannot reproduce errors for an already-accepted package"

For having read these threads, some of them contain actual promotion of
R-hub services, others feature links to R-hub builder logs, which is
quite cool; as well as questions about R-hub which we want to have
covered in the docs.

What are the most frequent words?

stopwords <- rcorpora::corpora("words/stopwords/en")$stopWords

word_counts <- emails %>%
  dplyr::select(subject) %>%
  dplyr::mutate(subject = trimws(
    gsub(
      "\\[R\\-pkg\\-devel\\]", "", subject))) %>%
  unique() %>%
  tidytext::unnest_tokens(word, subject, token = "tweets") %>%
  dplyr::filter(!word %in% stopwords) %>%
  dplyr::count(word, sort = TRUE) %>%
  dplyr::mutate(word = reorder(word, n))

ggplot(word_counts[1:15,]) +
  geom_lollipop(aes(word, n),
                size = 2, col = "salmon") +
  hrbrthemes::theme_ipsum(base_size = 16,
                          axis_title_size = 16) +
  coord_flip()

Most common words in R package devel archives

Nothing too surprising here, especially the clear dominance of
“package”, followed by “cran”, “check” and “error”! Now, we could apply the same script again and again to show
the most frequent bigrams (pairs of words appearing together), trigrams,
etc., but we’ll take a stab at a different approach in the next section,
topic modeling.

To end this EDA before attempting to model topics of the archives,
here’s the longest thread of all times for you!

dplyr::count(emails, subject) %>%
  dplyr::arrange(- n) %>%
  dplyr::top_n(1)
## # A tibble: 1 x 2
##   subject                                       n
##                                        
## 1 [R-pkg-devel] tibbles are not data frames    50

You can read it
online
,
it’s quite interesting! 🔥

Topic modeling of R-package-devel archives

Two blog posts of Julia Silge’s motivated us to try out topic modeling,
which according to the “Tidy text mining” book by Julia Silge and David
Robinson
is “a
method for unsupervised classification of such documents, similar to
clustering on numeric data, which finds natural groups of items even
when we’re not sure what we’re looking for.”. In this section, we simply
adapted the code of a blog post of Julia Silge’s, “Training,
evaluating, and interpreting topic
models”
to which you
should refer for more, and very good, explanations, of the topic (hehe).
A special thanks to Julia for answering our questions about the choice
of the number of topics!

We first cleaned up the data a bit and re-formatted it. When tokenizing
i.e. splitting by word we used the Twitter tokenizer because as
explained by Julia in her blog post “it often performs the most sensibly
with text from online forums”.

threads <- emails %>%
  dplyr::mutate(subject = gsub("\\[R\\-pkg\\-devel\\]", "", subject),
                subject = trimws(subject)) %>%
  dplyr::group_by(subject) %>%
  dplyr::summarise(text = glue::glue_collapse(content, sep = "\n")) %>%
  dplyr::mutate(text = paste(subject, text, sep = "\n")) %>%
  dplyr::mutate(text = stringr::str_replace_all(text, "'|"|/", "'"), ## weird encoding
               text = stringr::str_replace_all(text, "", " "),             ## links
               text = stringr::str_replace_all(text, ">|<|&", " "),      ## html yuck
               text = stringr::str_replace_all(text, "&#[:digit:]+;", " "),        ## html yuck
               text = stringr::str_remove_all(text, "<[^>]*>"),        ## html yuck
               text = stringr::str_remove_all(text, "\\[\\[alternative HTML version deleted\\]\\]"),                    ## mmmmm, more html yuck
               threadID = dplyr::row_number())

tidy_threads <- threads %>%
  tidytext::unnest_tokens(word, text, token = "tweets") %>%
  dplyr::anti_join(tidytext::get_stopwords()) %>%
  dplyr::filter(!stringr::str_detect(word, "[0-9]+")) %>%
  dplyr::add_count(word) %>%
  dplyr::filter(n > 100) %>%
  dplyr::select(-n)

threads_sparse <- tidy_threads %>%
  dplyr::count(threadID, word) %>%
  tidytext::cast_sparse(threadID, word, n)

We then performed topic modeling itself, with numbers of topics ranging
from 5 to 50, and plotted the model diagnostics.

library("stm")
library(furrr)
plan(multiprocess)

many_models <- tibble::tibble(K = seq(5, 50, by = 1)) %>%
  dplyr::mutate(topic_model = future_map(K, ~stm(threads_sparse, K = .,
                                          verbose = FALSE)))
heldout <- make.heldout(threads_sparse)

k_result <- many_models %>%
  dplyr::mutate(exclusivity = purrr::map(topic_model, exclusivity),
         semantic_coherence = purrr::map(topic_model, semanticCoherence, threads_sparse),
         eval_heldout = purrr::map(topic_model, eval.heldout, heldout$missing),
         residual = purrr::map(topic_model, checkResiduals, threads_sparse),
         bound =  purrr::map_dbl(topic_model, function(x) max(x$convergence$bound)),
         lfact = purrr::map_dbl(topic_model, function(x) lfactorial(x$settings$dim$K)),
         lbound = bound + lfact,
         iterations = purrr::map_dbl(topic_model, function(x) length(x$convergence$bound)))

library("ggplot2")
k_result %>%
  dplyr::transmute(K,
                   `Lower bound` = lbound,
                   Residuals = purrr::map_dbl(residual, "dispersion"),
                   `Semantic coherence` = purrr::map_dbl(semantic_coherence, mean),
                   `Held-out likelihood` = purrr::map_dbl(eval_heldout, "expected.heldout")) %>%
  tidyr::gather(Metric, Value, -K) %>%
  ggplot(aes(K, Value, color = Metric)) +
  geom_line(size = 1.5, alpha = 0.7, show.legend = FALSE) +
  facet_wrap(~Metric, scales = "free_y") +
  labs(x = "K (number of topics)",
       y = NULL,
       title = "Model diagnostics by number of topics",
       subtitle = "We should use domain knowledge to choose a good number of topics :-)") +
  hrbrthemes::theme_ipsum(base_size = 16,
                          axis_title_size = 16)

Model diagnostics by number of topics. Held-out likelihood and lower bound keep increasing while residuals keep decreasing with the number of topics, but semantic coherence also decreases with the number of topics.

The model diagnostics plot didn’t help a ton because there was no clear
best number of topics. We chose to go with 20 topics, because semantic
coherence was not at its lowest yet at this number.

We then proceeded like Julia to obtain a plot summarizing the topics,

topic_model <- k_result %>%
  dplyr::filter(K == 20) %>%
  dplyr::pull(topic_model) %>%
  .[[1]]

topic_model
## A topic model with 20 topics, 1032 documents and a 391 word dictionary.
td_beta <- tidytext::tidy(topic_model)

td_gamma <- tidytext::tidy(topic_model, matrix = "gamma",
                           document_names = rownames(threads_sparse))

top_terms <- td_beta %>%
  dplyr::arrange(beta) %>%
  dplyr::group_by(topic) %>%
  dplyr::top_n(7, beta) %>%
  dplyr::arrange(-beta) %>%
  dplyr::select(topic, term) %>%
  dplyr::summarise(terms = list(term)) %>%
  dplyr::mutate(terms = purrr::map(terms, paste, collapse = ", ")) %>%
  tidyr::unnest()

gamma_terms <- td_gamma %>%
  dplyr::group_by(topic) %>%
  dplyr::summarise(gamma = mean(gamma)) %>%
  dplyr::arrange(dplyr::desc(gamma)) %>%
  dplyr::left_join(top_terms, by = "topic") %>%
  dplyr::mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

gamma_terms %>%
  dplyr::top_n(20, gamma) %>%
  ggplot(aes(topic, gamma, label = terms, fill = topic)) +
  geom_col(show.legend = FALSE) +
  geom_text(hjust = 0, nudge_y = 0.0005, size = 3,
            family = "IBMPlexSans") +
  coord_flip() +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 0.5),
                     labels = scales::percent_format()) +
  hrbrthemes::theme_ipsum() +
  labs(x = NULL, y = expression(gamma),
       title = "20 topics by prevalence in the r-pkg-devel archives",
       subtitle = "With the top words that contribute to each topic")

20 topics by prevalence in the r-pkg-devel archives with the top words that contribute to each topic

A first thing we notice about the topics is that some of them contain
the signatures of superposters: Topic 7 (“cran, package, uwe, best,
version, ligges, check”) features Uwe Ligges and CRAN; Topic 9 (“dirk,
library, c, use, package, rcpp, thanks”) puts Dirk Eddelbuettel together
with a package he maintains, Rcpp, as well as with the language C,
which makes sense (although it might be C++?). Some topics’
representative words look like fragments of code (e.g. Topic 5: “double,
+, int, c, =, package, using”), which is due to the emails containing
both text and output logs from R CMD check without special
nodes/formatting from code. Still, one could dive into Topic 15 (“file,
files, vignettes, help, documentation, rd”) to find discussions around
package documentation, and Topic 4 (“windows, r, version, rdevel,
directory, using, linux”) could refer to check results across different
R versions and OS which is a good topic for R-hub, so some of these
topics might be useful, and might help with mining the archives of, let
us remind this, 1104 threads.

Here’s how we would extract the subjects of the threads whose most
probably topic is Topic 4. It is not a very good method since Topic 4
could be the most probable topic for the document without being that
much more probable than other topics, but that’s a start.

td_gamma %>%
  dplyr::group_by(document) %>%
  dplyr::filter(gamma[topic == 4] == max(gamma)) %>%
  dplyr::pull(document) %>%
  unique() -> ids

set.seed(42)
unique(threads$subject[threads$threadID %in% ids]) %>%
  sample(7)
## [1] "Version of make on CRAN Windows build machines"                           
## [2] "Windows binaries"                                                         
## [3] "Error appearing only with check_win_devel() - could be ggplot2 R version?"
## [4] "robust download function in R (similar to wget)?"                         
## [5] "Question about selective platform for my R package"                       
## [6] "object 'nativeRoutines' not found"                                        
## [7] "R CMD check yielding different results for me than CRAN reviewer"

And here’s the same for Topic 15.

td_gamma %>%
  dplyr::group_by(document) %>%
  dplyr::filter(gamma[topic == 15] == max(gamma)) %>%
  dplyr::pull(document) %>%
  unique() -> ids

set.seed(42)
unique(threads$subject[threads$threadID %in% ids]) %>%
  sample(7)
## [1] "Roxygen: function documentation to get \\item{...} in .rd file"
## [2] "separate Functions: and Datasets: indices?"                    
## [3] "documentation of generic '['"                                  
## [4] "R-devel problem with temporary files or decompression?"        
## [5] "No reference output from knitr vignettes"                      
## [6] "Indexing HTML vignette topics"                                 
## [7] "Problem installing built vignettes"

All in all, our rough topic modelling could help us exploring the
threads to identify common bottlenecks/showstoppers for R package
developers.

Conclusion: R-package-devel and beyond

We’d be thrilled if you build on our rudimentary analysis here to
generate a more thorough analysis of R-package-devel, please tell us if
you do! As far as R mailing lists analyses are concerned, we only know
of the talk “Network Text Analysis of R Mailing Lists” given at UseR!
Rennes 2009 by Angela
Bohn
,
of this search tool of R-help by Romain
François

and of a blog post by David
Smith
.

R-package-devel is definitely an active and specific venue for R package
development questions, thanks a lot to its maintainers and contributors! Now, there are at least two alternatives that are
actual discussion forums built on
Discourse, where the threads are arguably
easier to browse thanks to their being actual topics with all answers
below each other, and thanks to Markdown-based formatting of code,
rendering of URL’s cards in the presence of metadata, etc.:

Besides, short R package development questions have most probably their
place on Stack Overflow.

In all these three venues, you can’t answer by email but you can
subscribe to categories/tags and have email notifications turned on so
no worries if you like email a lot. 😉

Find your own favorite venue(s) for asking and answering (but beware of
cross-posting!),
and advertise them in your networks! In R-hub docs, a whole topic is
dedicated to getting help. We
hope you do find answers to all your R-hub and R package development
questions!

To leave a comment for the author, please follow the link and comment on their blog: R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)