Rectangling onboarding
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Our onboarding reviews, that ensure that packages contributed by the community undergo a transparent, constructive, non adversarial and open review process, take place in the issue tracker of a GitHub repository. Development of the packages we onboard also takes place in the open, most often in GitHub repositories.
Therefore, when wanting to get data about our onboarding system for giving a data-driven overview, my mission was to extract data from GitHub and git repositories, and to put it into nice rectangles (as defined by Jenny Bryan) ready for analysis. You might call that the first step of a “tidy git analysis” using the term coined by Simon Jackson. So, how did I collect data?
A side-note about GitHub
In the following, I’ll mention repositories. All of them are git repositories, which means they’re folders under version control, where roughly said all changes are saved via commits and their messages (more or less) describing what’s been changed in the commit. Now, on top of that these repositories live on GitHub which means they get to enjoy some infratructure such as issue trackers, milestones, starring by admirers, etc. If that ecosystem is brand new to you, I recommend reading this book, especially its big picture chapter.
Package review processes: weaving the threads
Each package submission is an issue thread in our onboarding repository, see an example here. The first comment in that issue is the submission itself, followed by many comments by the editor, reviewers and authors. On top of all the data that’s saved there, mostly text data, we have a private Airtable workspace where we have a table of reviewers and their reviews, with direct links to the issue comments that are reviews.
Getting issue threads
Unsurprisingly, the first step here was to “get issue threads”. What do I mean? I wanted a table of all issue threads, one line per comment, with columns indicating the time at which something was written, and columns digesting the data from the issue itself, e.g. guessing the role from the commenter from other information: the first user of the issue is the “author”.
I used to use GitHub API V3 and then heard about GitHub API V4 which blew my mind. As if I weren’t impressed enough by the mere existence of this API and its advantages,
I discovered the rOpenSci
ghql
package allows one to interact with such an API and that its docs actually use GitHub API V4 as an example!Carl Boettiger told me about his way to rectangle JSON data, using jq, a language for processing JSON, via a dedicated rOpenSci package,
jqr
.
I have nothing against GitHub API V3 and
gh
and purrr
workflows, but I was
curious and really enjoyed learning these new tools and writing this
code. I had written a gh
/purrr
code for getting the same information
and it felt clumsier, but it might just be because I wasn’t
perfectionist enough when writing it! I achieved writing the correct
GitHub V4 API query to get just what I needed by using its online
explorer. I then succeeded
in transforming the JSON output into a rectangle by reading Carl’s post
but also by taking advantage of another online explorer, jq
play where I pasted my output via
writeClipboard
. That’s nearly always the way I learn about query
tools: using some sort of explorer and then pasting the code into a
script. When I am more experienced, I can skip the explorer part.
The first function I wrote was one for getting the issue number of the last onboarding issue, because then I looped/mapped over all issues.
library("ghql") library("httr") library("magrittr") # function to get number of last issue get_last_issue <- function(){ query = '{ repository(owner: "ropensci", name: "onboarding") { issues(last: 1) { edges{ node{ number } } } } }' token <- Sys.getenv("GITHUB_GRAPHQL_TOKEN") cli <- GraphqlClient$new( url = "https://api.github.com/graphql", headers = add_headers(Authorization = paste0("Bearer ", token)) ) ## define query ### creat a query class first qry <- Query$new() qry$query('issues', query) last_issue <-cli$exec(qry$queries$issues) last_issue %>% jqr::jq('.data.repository.issues.edges[].node.number') %>% as.numeric() } get_last_issue() ## [1] 201
Then I wrote a function for getting all the precious info I needed from
an issue thread. At the time it lived on its own in an R script, now
it’s gotten included in my gs
package as
get_issue_thread
so you can check out the code there, along with other useful recipes for
analyzing GitHub data.
Then I launched this code to get all data! It was very satisfying.
#get all threads issues <- purrr::map_df(1:get_last_issue(), get_issue_thread) # for the one(s) with 101 comments get the 100 last comments long_issues <- issues %>% dplyr::count(issue) %>% dplyr::filter(n == 101) %>% dplyr::pull(issue) issues2 <- purrr::map_df(long_issues, get_issue_thread, first = FALSE) all_issues <- dplyr::bind_rows(issues, issues2) all_issues <- unique(all_issues) readr::write_csv(all_issues, "data/all_threads_v4.csv")
Digesting them and complementing them with Airtable data
In the previous step we got a rectangle of all threads, with information from the first issue comment (such as labels) distributed to all the comments of the threads.
issues <- readr::read_csv("data/all_threads_v4.csv") issues <- janitor::clean_names(issues) issues <- dplyr::rename(issues, user = author) issues <- dplyr::select(issues, - dplyr::contains("topic")) issues %>% head() %>% dplyr::select(- body) %>% knitr::kable()
title | author_association | assignee | created_at | closed_at | user | comment_url | package | pulled | issue | meta | x6_approved | out_of_scope | x4_review_s_in_awaiting_changes | x0_presubmission | question | x3_reviewer_s_assigned | holding | legacy | x1_editor_checks | x5_awaiting_reviewer_s_response | x2_seeking_reviewer_s |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rrlite | OWNER | sckott | 2015-03-10 23:22:45 | 2015-03-31 00:16:28 | richfitz | NA | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-10 23:26:11 | 2015-03-31 00:16:28 | richfitz | https://github.com/ropensci/onboarding/issues/1#issuecomment-78170639 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-11 19:29:32 | 2015-03-31 00:16:28 | karthik | https://github.com/ropensci/onboarding/issues/1#issuecomment-78351979 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-11 21:08:59 | 2015-03-31 00:16:28 | sckott | https://github.com/ropensci/onboarding/issues/1#issuecomment-78372187 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-11 21:13:11 | 2015-03-31 00:16:28 | karthik | https://github.com/ropensci/onboarding/issues/1#issuecomment-78373054 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-11 21:33:45 | 2015-03-31 00:16:28 | richfitz | https://github.com/ropensci/onboarding/issues/1#issuecomment-78377124 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Now we need a few steps more:
transforming NA into FALSE for variables corresponding to labels,
getting the package name from Airtable since the titles of issues are not uniformly formatted,
knowing which comment is a review,
deducing the role of the user writing the comment (author/editor/reviewer/community manager/other).
Below binary variables are transformed and only rows corresponding to approved packages are kept.
# labels replace_1 <- function(x){ !is.na(x[1]) } # binary variables ncol_issues <- ncol(issues) issues <- dplyr::group_by(issues, issue) %>% dplyr::arrange(created_at) %>% dplyr::mutate_at(9:(ncol_issues-1), replace_1) %>% dplyr::ungroup() # keep only issues that are finished issues <- dplyr::filter(issues, package, !x0_presubmission, !out_of_scope, !legacy, !x1_editor_checks, x6_approved) issues <- dplyr::select(issues, - dplyr::starts_with("x"), - package, - out_of_scope, - legacy, - meta, - holding, - pulled, - question)
Then, thanks to the airtabler
package we can add the name of the
package, and identify review comments.
# airtable data airtable <- airtabler::airtable("appZIB8hgtvjoV99D", "Reviews") airtable <- airtable$Reviews$select_all() airtable <- dplyr::mutate(airtable, issue = as.numeric(stringr::str_replace(onboarding_url, ".*issues\\/", ""))) # we get the name of the package # and we know which comments are reviews reviews <- dplyr::select(airtable, review_url, issue, package) %>% dplyr::mutate(is_review = TRUE) issues <- dplyr::left_join(issues, reviews, by = c("issue", "comment_url" = "review_url")) issues <- dplyr::mutate(issues, is_review = !is.na(is_review))
Finally, the non elegant code below attributes a role to each user
(commenter is its more precise version that differentiates reviewer 1
from reviewer 2). I could have used dplyr
case_when
.
# non elegant code to guess role issues <- dplyr::group_by(issues, issue) issues <- dplyr::arrange(issues, created_at) issues <- dplyr::mutate(issues, author = user[1]) issues <- dplyr::mutate(issues, package = unique(package[!is.na(package)])) issues <- dplyr::mutate(issues, assignee = assignee[1]) issues <- dplyr::mutate(issues, reviewer1 = ifelse(!is.na(user[is_review][1]), user[is_review][1], "")) issues <- dplyr::mutate(issues, reviewer2 = ifelse(!is.na(user[is_review][2]), user[is_review][2], "")) issues <- dplyr::mutate(issues, reviewer3 = ifelse(!is.na(user[is_review][3]), user[is_review][3], "")) issues <- dplyr::ungroup(issues) issues <- dplyr::group_by(issues, issue, created_at, user) # regexp because in at least 1 case assignee = 2 names glued together issues <- dplyr::mutate(issues, commenter = ifelse(stringr::str_detect(assignee, user), "editor", "other")) issues <- dplyr::mutate(issues, commenter = ifelse(user == author, "author", commenter)) issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer1, "reviewer1", commenter)) issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer2, "reviewer2", commenter)) issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer3, "reviewer3", commenter)) issues <- dplyr::mutate(issues, commenter = ifelse(user == "stefaniebutland", "community_manager", commenter)) issues <- dplyr::ungroup(issues) issues <- dplyr::mutate(issues, role = commenter, role = ifelse(stringr::str_detect(role, "reviewer"), "reviewer", role)) issues <- dplyr::select(issues, - author, - reviewer1, - reviewer2, - reviewer3, - assignee, - author_association, - comment_url) readr::write_csv(issues, "data/clean_data.csv")
The role “other” corresponds to anyone chiming in, while the community manager role is planning blog posts with the package author. We indeed have a series of guest blog posts from package authors that illustrate the review process as well as their onboarded packages.
Here is the final table. I unselect “body” because formatting in the text could break the output here, but I do have the text corresponding to each comment.
issues %>% dplyr::select(- body) %>% head() %>% knitr::kable()
title | created_at | closed_at | user | issue | package | is_review | commenter | role |
---|---|---|---|---|---|---|---|---|
rrlite | 2015-03-31 00:25:14 | 2015-04-13 23:26:38 | richfitz | 6 | rrlite | FALSE | author | author |
rrlite | 2015-04-01 17:30:51 | 2015-04-13 23:26:38 | sckott | 6 | rrlite | FALSE | editor | editor |
rrlite | 2015-04-01 17:36:03 | 2015-04-13 23:26:38 | karthik | 6 | rrlite | FALSE | other | other |
rrlite | 2015-04-02 03:36:09 | 2015-04-13 23:26:38 | jeroen | 6 | rrlite | FALSE | reviewer2 | reviewer |
rrlite | 2015-04-02 03:50:43 | 2015-04-13 23:26:38 | gaborcsardi | 6 | rrlite | FALSE | other | other |
rrlite | 2015-04-02 03:53:57 | 2015-04-13 23:26:38 | richfitz | 6 | rrlite | FALSE | author | author |
There are 2521 comments, corresponding to 70 onboarded packages.
Submitted repositories: down to a few metrics
As mentioned earlier, onboarded packages are most often developped on GitHub. After onboarding they live in the ropensci GitHub organization, previously some of them were onboarded into ropenscilabs but they should all be transferred soon. In any case, their being on GitHub means it’s possible to get their history to have a glimpse at work represented by onboarding!
Getting all onboarded repositories
Using rOpenSci git2r
package I
cloned all onboarded repositories in a “repos” folder. Since I didn’t
know which package was in ropensci or ropenscilabs, I tried both.
airtable <- airtabler::airtable("appZIB8hgtvjoV99D", "Reviews") airtable <- airtable$Reviews$select_all() safe_clone <- purrr::safely(git2r::clone) # github link either ropensci or ropenscilabs clone_repo <- function(package_name){ print(package_name) url <- paste0("https://github.com/ropensci/", package_name, ".git") local_path <- paste0(getwd(), "/repos/", package_name) clone_from_ropensci <- safe_clone(url = url, local_path = local_path, progress = FALSE) if(is.null(clone_from_ropensci$result)){ url <- paste0("https://github.com/ropenscilabs/", package_name, ".git") clone_from_ropenscilabs <- safe_clone(url = url, local_path = local_path, progress = FALSE) if(is.null(clone_from_ropenscilabs$result)){ message("OUILLE") } } } pkgs <- unique(airtable$package) pkgs <- pkgs[!pkgs %in% fs::dir_ls()] pkgs <- pkgs[pkgs != "rrricanes"] purrr::walk(pkgs, clone_repo)
I didn’t clone “rrricanes” because it was too big!
Getting commits reports
I then got the commit logs of each repo for various reasons:
commits themselves show how much code and documentation editing was done during review
I wanted to be able to
git reset hard
the repo at its state at submission, for which I needed the commit logs.
I used the gitsum
package to get commit
logs because its dedicated high-level functions made it easier than with
git2r
.
library("magrittr") get_report <- function(package_name){ message(package_name) local_path <- paste0(getwd(), "/repos/", package_name) if(length(fs::dir_ls(local_path)) != 0){ gitsum::init_gitsum(local_path, over_write = TRUE) report <- gitsum::parse_log_detailed(local_path) report <- dplyr::select(report, - nested) report$package <- package_name if(!"datetime" %in% names(report)){ report <- dplyr::mutate(report, hour = as.numeric(stringr::str_sub(timezone, 1, 3)), minute = as.numeric(stringr::str_sub(timezone, 4, 5)), datetime = date + lubridate::hours(-1 * hour) + lubridate::minutes(-1 * minute)) report <- dplyr::select(report, - hour, - minute, - timezone) } report <- dplyr::select(report, - date) return(report) }else{ return(NULL) } } packages <- fs::dir_ls("repos") packages <- stringr::str_replace_all(packages, "repos\\/", "") purrr::map_df(packages, get_report) %>% readr::write_csv("output/gitsum_reports.csv")
Getting repositories as at submission
Crossing information from the issue threads and from commit logs, I could find the latest commit before submission and create a copy of each repo before resetting it at this state. This is the closest to a Time-Turner that I have!
library("magrittr") # get issues opening datetime issues <- readr::read_csv("data/clean_data.csv") issues <- dplyr::group_by(issues, package) issues <- dplyr::summarise(issues, opened = min(created_at)) # now for each package keep only commits before that commits <- readr::read_csv("output/gitsum_reports.csv") commits <- dplyr::left_join(commits, issues, by = "package") commits <- dplyr::group_by(commits, package) commits <- dplyr::filter(commits, datetime <= opened) # and from them keep the latest one, # that's the latest commit before submission! commits <- dplyr::filter(commits, datetime == max(datetime), !is_merge) commits <- dplyr::summarize(commits, hash = hash[1]) # small helper function get_sha <- function(commit){ commit@sha } set_archive <- function(package_name, commit){ message(package_name) # copy the entire repo to another location local_path <- paste0(getwd(), "/repos/", package_name) local_path_archive <- paste0(getwd(), "/repos_at_submission/", package_name) fs::dir_copy(local_path, local_path_archive) # get all commits -- it's fast which is why I don't use gitsum report here commits <- git2r::commits(git2r::repository(local_path_archive)) # get their sha sha <- purrr::map_chr(commits, get_sha) # all of this to extract the commit with the sha of the latest commit before submission # in other words the latest commit before submission commit <- commits[sha == commit][[1]] # do a hard reset at that commit git2r::reset(commit, reset_type = "hard") } purrr::walk2(commits$package, commits$hash, set_archive)
Outlook: getting even more data? Or analyzing this dataset
There’s more data to be collected or prepared! From GitHub issues, using GitHub archive one could get the labelling history: when did an issue go from “editor-checks” to “seeking-reviewers” for instance? It’d help characterize the usual speed of the process. One could also try to investigate the formal and less formal links between the onboarded repository and the review: did commits and issues mention the onboarding review (with words), or even actually put a link to it? Are actors in the process little or very active on GitHub for other activities, e.g. could we see that some reviewers create or revive their GitHub account especially for reviewing?
Rather than enlarging my current dataset, I’ll present its analysis in
two further blog posts answering the questions “How much work is
rOpenSci onboarding?” and “How to characterize the social weather of
rOpenSci onboarding?”. In case you’re too impatient, in the meantime you
can dive into this blog post by Augustina Ragwitz about measuring
open-source influence beyond
commits
and this one by rOpenSci co-founder Scott Chamberlain about exploring
git commits with git2r
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.