Site icon R-bloggers

Exploring Github Topics

[This article was first published on R on Chemometrics & Spectroscopy using R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As the code driving FOSS for Spectroscopy has matured, I began to think about how to explore Github in a systematic way for additional repositories with tools for spectroscopy. It turns out that a Github repo can have topics assigned to it, and you can use the Github API to search them. Wait, what? I didn’t know one could add topics to a repo, even though there is a little invite right there under the repo name:

Naturally I turned to StackOverflow to find out how to do this, and quickly encountered this question. It was asked when the topics feature was new, so one needs to do things just a bit differently now, but there is a way forward.

Before we get to implementation, let’s think about limitations:

Let’s get to it! First, create a Github access token on your local machine using the instructions in this gist. Next, load the needed libraries:

set.seed(123)
library("httr")
library("knitr")
library("kableExtra")

Specify your desired search terms, and create a list structure to hold the results:

search_terms <- c("NMR", "infrared", "raman", "ultraviolet", "visible", "XRF", "spectroscopy")
results <- list()

Create the string needed to access the Github API, then GET the results, and stash them in the list we created:

nt <- length(search_terms) # nt = no. of search terms
for (i in 1:nt) {
  search_string <- paste0("https://api.github.com/search/repositories?q=topic:", search_terms[i])
    request <- GET(search_string, config(token = github_token))
  stop_for_status(request) # converts http errors to R errors or warnings
  results[[i]] <- content(request)
}
names(results) <- search_terms

Figure out how many results we have found, set up a data frame and then put the results into the table. The i, j, and k counters required a little experimentation to get right, as content(request) returns a deeply nested list and only certain items are desired.

nr <- 0L # nr = no. of responses
for (i in 1:nt) { # compute total number of results/items found
    nr <- nr + length(results[[i]]$items)
}

DF <- data.frame(term = rep(NA_character_, nr),
  repo_name = rep(NA_character_, nr),
  repo_url = rep(NA_character_, nr),
  stringsAsFactors = FALSE)

k <- 1L
for (i in 1:nt) {
    ni <- length(results[[i]]$items) # ni = no. of items
    for (j in 1:ni) {
        DF$term[k] <- names(results)[[i]]
        DF$repo_name[k] <- results[[i]]$items[[j]]$name
        DF$repo_url[k] <- results[[i]]$items[[j]]$html_url
        k <- k + 1L
    }
}
# remove duplicated repos which result when repos have several of our
# search terms of interest.
DF <- DF[-which(duplicated(DF$repo_name)),]

Now put it all in a table we can inspect manually, send to a web page so it’s clickable, or potentially write it out as a csv (If you want this as a csv you should probably write the results out a bit differently). In this case I want the results as a table in web page so I can click the repo links and go straight to them.

namelink <- paste0("[", DF$repo_name, "](", DF$repo_url, ")")
DF2 <- data.frame(DF$term, namelink, stringsAsFactors = FALSE)
names(DF2) <- c("Search Term", "Link to Repo")

We’ll show just 10 random rows as an example:

keep <- sample(1:nrow(DF2), 10)
options(knitr.kable.NA = '')
kable(DF2[keep, ]) %>%
  kable_styling(c("striped", "bordered"))
< !-- Since authenticating to Github in the Hugo + Netlify environment is tricky, we'll insert the results of running the code above next. That way the page looks correct w/o running the code. -->
Search Term Link to Repo
31 infrared pycroscopy
79 ultraviolet woudc-data-registry
51 infrared ir-repeater
14 NMR spectra-data
67 raman Raman-spectra
42 infrared PrecIR
50 infrared esp32-room-control-panel
118 spectroscopy LiveViewLegacy
43 infrared arduino-primo-tutorials
101 XRF web_geochemistry

Obviously, these results must be inspected carefully as terms like “infrared” will pick up projects that deal with infrared remote control of robots and so forth. As far as my case goes, I have a lot of new material to look through…

A complete .Rmd file that carries out the search described above, and has a few enhancements, can be found at this gist.

To leave a comment for the author, please follow the link and comment on their blog: R on Chemometrics & Spectroscopy using R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.