In praise of Commonmark: wrangle (R)Markdown files without regex

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

You might have read my blog post analyzing the social weather of
rOpenSci
onboarding
,
based on a text analysis of GitHub issues. I extracted text out of
Markdown-formatted threads with regular expressions. I basically
hammered away at the issues using tools I was familiar with until it
worked! Now I know there’s a much better and cleaner way, that I’ll
present in this note. Read on if you want to extract insights about
text, code, links, etc. from R Markdown reports, Hugo website sources,
GitHub issues… without writing messy and smelly code!

Introduction to Markdown rendering and parsing

This note will appear to you, dear reader, as an html page, either here
on ropensci.org or on R-Bloggers, but I’m writing it as an R Markdown
document, using Markdown syntax. I’ll knit it to Markdown and then
Hugo’s Markdown processor,
Blackfriday, will transform
it to html. Elements such as # blabla thus get transformed to

blabla

. Awesome!

The rendering of Markdown to html or XML can also be used as a way to
parse it, which is what the spelling package does in order to
identify text
segments

of R Markdown files, before spell checking them only, not code. I had an
aha moment when seeing this spelling strategy: why did I ever use
regex to parse Markdown for text analysis?! Transforming it to XML
first, and then using XPath, would be much cleaner!

As a side-note, realizing how to simplify my old code made me think of
Jenny Bryan’s inspiring useR! keynote talk about code
smells
. I asked her
whether code full of regular expressions instead of dedicated parsing
tools was a code smell, sadly it doesn’t have a specific name, but she
confirmed my feeling that not using dedicated purpose-built tools
might mean you’ll end up “re-inventing all of that logic yourself, in
hacky way.”. If you have code falling under the definition below, maybe
try to re-factor and if needed get
help
.

From Markdown to XML

In this note I’ll use my local fork of rOpenSci’s website source, and
use all the Markdown sources of blog posts as example data. The chunk
below is therefore not portable, sorry about that.

roblog <- "C:\\Users\\Maelle\\Documents\\ropensci\\roweb2\\content\\blog"

all_posts <- fs::dir_ls(roblog, regexp = "*.md")
all_posts <- all_posts[all_posts != "_index.md"]

My fork master branch isn’t entirely synced. It has 202 posts.

The code below uses the commonmark
package
to render Markdown to
XML. Commonmark is a standardized specification for Markdown syntax by
John McFarlane. The commonmark R
package by Jeroen Ooms wraps the official
cmark library and is used by
e.g. GitHub to render issues and readmes. Note that my function still
has a hacky element, it uses a blogdown unexported function to strip
the YAML header of posts! If you know a better way feel free to answer
my question over at RStudio community discussion
forum
.

library("magrittr")
get_one_xml <- function(md){
  md %>%
    readLines(encoding = "UTF-8") %>%
    blogdown:::split_yaml_body() %>%
    .$body %>%
    commonmark::markdown_xml(extensions = TRUE) %>%
    xml2::read_xml()
}

See what it gives me for one post.

get_one_xml(all_posts[42])
## {xml_document}
## 
##  [1] \n  We just released a new version of \n  < ...
##  [2] \n  First, install and load taxize\ ...
##  [3] install.packages("rgbif")\n
##  [4] library(taxize)\n
##  [5] \n  New things\n
##  [6] \n  New functions: class2tree\n\n  Sometimes you just want to have a visual of th ...
##  [8] \n  Define a species list\n
##  [9] spnames <- c("Latania lontaroides", "Randia ...
## [10] \n  Then collect taxonomic hierarchies for each ta ...
## [11] out <- classification(spnames, db = "ncbi", ...
## [12] \n  Use \n  class2tree\n  tr <- class2tree(out)\nplot(tr, no.margin = ...
## [14] \n  \n  New functions: get_gbfid\n\n  The Global Biodiversity Information Facility ( ...
## [17] \n  We added a similar function to our \n   ...
## [18] get_gbifid(sciname = "Poa annua", verbose = FA ...
## [19] ##         1\n## "2704179"\n## attr(,"class")\n## [1] " ...
## [20] get_gbifid(sciname = "Pinus contorta", verbose ...
## ...

Headings, code blocks… all properly delimited and one XPath query away
from us! Let me convert all posts before diving into parsing examples.

all_posts %>%
  purrr::map(get_one_xml) -> blog_xml

Parsing the XML

URLs parsing

Let’s say I want to find out which domains are the most often linked
from rOpenSci’s blog. No need for any regular expression thanks to
commonmark, xml2 and urltools!

get_urls <- function(post_xml){
  post_xml %>%
    xml2::xml_find_all(xpath = './/d1:link', xml2::xml_ns(post_xml)) %>%
    xml2::xml_attr("destination") %>%
    urltools::url_parse()
}

# URLs
blog_xml %>%
  purrr::map_df(get_urls) %>%
  dplyr::count(domain, sort = TRUE) %>%
  head(n = 10) %>%
  knitr::kable()
domain n
github.com 1111
ropensci.org 272
twitter.com 167
cran.r-project.org 130
en.wikipedia.org 60
ropensci.github.io 29
doi.org 27
bioconductor.org 15
unconf17.ropensci.org 15
www.gbif.org 15

More Twitter than CRAN! We probably could do with less own-domain use
since / would get us here too.

R code parsing

Remember that cool post by Matt Dancho analyzing David Robinson’s
code
?
In theory you could clone any of your favorite blogs (David Robinson’s
blog
, Julia Silge’s
blog
, etc.) to analyze
them, no need to even webscrape first! Note that you can git clone from
R using the git2r package
.

get_functions <- function(post_xml){
  post_xml %>%
    # select all code chunks
    xml2::xml_find_all(xpath = './/d1:code_block', xml2::xml_ns(.)) %>%
    # select chunks with language info
    .[xml2::xml_has_attr(., "info")] %>%
    # select R chunks
    .[xml2::xml_attr(., "info") == "r"] %>%
    # get the content of these chunks
    xml2::xml_text() %>%
    glue::glue_collapse(sep = "\n") -> code_text
  
  # Base R code parsing tools
  parsed_code <- try(parse(text = code_text,
        keep.source = TRUE) %>%
    utils::getParseData(),
    silent = TRUE)
  
  if(is(parsed_code, "try-error")){
    # this happens because of output sometimes
    # stored in R chunks when not using R Markdown
    return(NULL)
  }
  
  if(is.null(parsed_code)){
    return(NULL)
  }
    
  dplyr::filter(parsed_code,
                grepl("FUNCTION", token))
    
}

blog_xml %>%
  purrr::map_df(get_functions) %>%
  dplyr::count(text, sort = TRUE) %>%
  head(n = 10) %>%
  knitr::kable()
text n
library 263
c 210
aes 106
filter 71
mutate 64
ggplot 58
function 53
install.packages 50
install_github 38
select 38

Function definititions (function), basic stuff (c, library) and
tidyverse functions seem to be the most popular on the blog!

Text parsing

After complementing our commonmarkxml2 combo with urltools and
with R base code parsing facilities… let’s pair it with
tidytext! What are the words most
commonly use d n rOpenSci’s blog posts?

get_text <- function(post_xml){
  xml2::xml_find_all(post_xml,
                     xpath = './/d1:text', xml2::xml_ns(post_xml)) %>%
    xml2::xml_text(trim = TRUE) %>%
    glue::glue_collapse(sep = " ") %>%
    as.character() -> text
  
  tibble::tibble(text = text)
}

blog_xml %>%
  purrr::map_df(get_text) %>%
  tidytext::unnest_tokens(word, text, token = "words") %>%
  dplyr::filter(!word %in% tidytext::stop_words$word) %>%
  dplyr::count(word, sort = TRUE) %>%
  head(n = 10) %>%
  knitr::kable()
word n
data 1969
package 1097
ropensci 569
packages 486
time 412
community 394
code 377
github 358
software 302
science 297

This beats my old code! There’s really something to be said for
purpose-built tools.

Conclusion

I hope this note will inspire you to use commonmark and xml2 when
analyzing Markdown files. As mentioned earlier, Hugo or Jekyll website
sources are Markdown files and GitHub issue threads are too so it should
open up quite a lot of data! If you’re new to XPath, I’d recommend
reading this
introduction
. The
results of XML-parsing are also better parsed without (your writing)
regular expressions: I have shown urltools for URL parsing, that base
R has code parsing tools (parse, getParsedData), and I’ve used
tidytext.

Note that if you’re into blog analysis, don’t forget you can also get
information out of the YAML header using… the yaml
package
, not regular expressions!

As a bonus, maybe seeing this wrangling inspired you to modify
Markdown files programmatically? E.g. what if I wanted to automatically
replace all level 1 headers with level 2 headers? We’re working on that,
stay tuned and if you want follow this GitHub
thread
!

Thanks to Jeroen Ooms, Jenny
Bryan
and Jim
Hester
for their answering my XML parsing
(meta) questions.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)