In praise of Commonmark: wrangle (R)Markdown files without regex

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You might have read my blog post analyzing the social weather of
based on a text analysis of GitHub issues. I extracted text out of
Markdown-formatted threads with regular expressions. I basically
hammered away at the issues using tools I was familiar with until it
worked! Now I know there’s a much better and cleaner way, that I’ll
present in this note. Read on if you want to extract insights about
text, code, links, etc. from R Markdown reports, Hugo website sources,
GitHub issues… without writing messy and smelly code!

Introduction to Markdown rendering and parsing

This note will appear to you, dear reader, as an html page, either here
on or on R-Bloggers, but I’m writing it as an R Markdown
document, using Markdown syntax. I’ll knit it to Markdown and then
Hugo’s Markdown processor,
Blackfriday, will transform
it to html. Elements such as # blabla thus get transformed to


. Awesome!

The rendering of Markdown to html or XML can also be used as a way to
parse it, which is what the spelling package does in order to
identify text

of R Markdown files, before spell checking them only, not code. I had an
aha moment when seeing this spelling strategy: why did I ever use
regex to parse Markdown for text analysis?! Transforming it to XML
first, and then using XPath, would be much cleaner!

As a side-note, realizing how to simplify my old code made me think of
Jenny Bryan’s inspiring useR! keynote talk about code
. I asked her
whether code full of regular expressions instead of dedicated parsing
tools was a code smell, sadly it doesn’t have a specific name, but she
confirmed my feeling that not using dedicated purpose-built tools
might mean you’ll end up “re-inventing all of that logic yourself, in
hacky way.”. If you have code falling under the definition below, maybe
try to re-factor and if needed get

From Markdown to XML

In this note I’ll use my local fork of rOpenSci’s website source, and
use all the Markdown sources of blog posts as example data. The chunk
below is therefore not portable, sorry about that.

roblog <- "C:\\Users\\Maelle\\Documents\\ropensci\\roweb2\\content\\blog"

all_posts <- fs::dir_ls(roblog, regexp = "*.md")
all_posts <- all_posts[all_posts != ""]

My fork master branch isn’t entirely synced. It has 202 posts.

The code below uses the commonmark
to render Markdown to
XML. Commonmark is a standardized specification for Markdown syntax by
John McFarlane. The commonmark R
package by Jeroen Ooms wraps the official
cmark library and is used by
e.g. GitHub to render issues and readmes. Note that my function still
has a hacky element, it uses a blogdown unexported function to strip
the YAML header of posts! If you know a better way feel free to answer
my question over at RStudio community discussion

get_one_xml <- function(md){
  md %>%
    readLines(encoding = "UTF-8") %>%
    blogdown:::split_yaml_body() %>%
    .$body %>%
    commonmark::markdown_xml(extensions = TRUE) %>%

See what it gives me for one post.


## {xml_document}
## <document xmlns="">
##  [1] <paragraph>\n  <text>We just released a new version of </text>\n  < ...
##  [2] <heading level="2">\n  <text>First, install and load taxize</text>\ ...
##  [3] <code_block info="r">install.packages("rgbif")\n</code_block>
##  [4] <code_block info="r">library(taxize)\n</code_block>
##  [5] <heading level="2">\n  <text>New things</text>\n</heading>
##  [6] <heading level="3">\n  <text>New functions: class2tree</text>\n</he ...
##  [7] <paragraph>\n  <text>Sometimes you just want to have a visual of th ...
##  [8] <paragraph>\n  <text>Define a species list</text>\n</paragraph>
##  [9] <code_block info="r">spnames &lt;- c("Latania lontaroides", "Randia ...
## [10] <paragraph>\n  <text>Then collect taxonomic hierarchies for each ta ...
## [11] <code_block info="r">out &lt;- classification(spnames, db = "ncbi", ...
## [12] <paragraph>\n  <text>Use </text>\n  <code>class2tree</code>\n  <tex ...
## [13] <code_block info="r">tr &lt;- class2tree(out)\nplot(tr, no.margin = ...
## [14] <paragraph>\n  <image destination="/assets/blog-images/2014-02-19-t ...
## [15] <heading level="3">\n  <text>New functions: get_gbfid</text>\n</hea ...
## [16] <paragraph>\n  <text>The Global Biodiversity Information Facility ( ...
## [17] <paragraph>\n  <text>We added a similar function to our </text>\n   ...
## [18] <code_block info="r">get_gbifid(sciname = "Poa annua", verbose = FA ...
## [19] <code_block>##         1\n## "2704179"\n## attr(,"class")\n## [1] " ...
## [20] <code_block info="r">get_gbifid(sciname = "Pinus contorta", verbose ...
## ...

Headings, code blocks… all properly delimited and one XPath query away
from us! Let me convert all posts before diving into parsing examples.

all_posts %>%
  purrr::map(get_one_xml) -> blog_xml

Parsing the XML

URLs parsing

Let’s say I want to find out which domains are the most often linked
from rOpenSci’s blog. No need for any regular expression thanks to
commonmark, xml2 and urltools!

get_urls <- function(post_xml){
  post_xml %>%
    xml2::xml_find_all(xpath = './/d1:link', xml2::xml_ns(post_xml)) %>%
    xml2::xml_attr("destination") %>%

# URLs
blog_xml %>%
  purrr::map_df(get_urls) %>%
  dplyr::count(domain, sort = TRUE) %>%
  head(n = 10) %>%
domain n 1111 272 167 130 60 29 27 15 15 15

More Twitter than CRAN! We probably could do with less own-domain use
since / would get us here too.

R code parsing

Remember that cool post by Matt Dancho analyzing David Robinson’s
In theory you could clone any of your favorite blogs (David Robinson’s
, Julia Silge’s
, etc.) to analyze
them, no need to even webscrape first! Note that you can git clone from
R using the git2r package

get_functions <- function(post_xml){
  post_xml %>%
    # select all code chunks
    xml2::xml_find_all(xpath = './/d1:code_block', xml2::xml_ns(.)) %>%
    # select chunks with language info
    .[xml2::xml_has_attr(., "info")] %>%
    # select R chunks
    .[xml2::xml_attr(., "info") == "r"] %>%
    # get the content of these chunks
    xml2::xml_text() %>%
    glue::glue_collapse(sep = "\n") -> code_text
  # Base R code parsing tools
  parsed_code <- try(parse(text = code_text,
        keep.source = TRUE) %>%
    silent = TRUE)
  if(is(parsed_code, "try-error")){
    # this happens because of output sometimes
    # stored in R chunks when not using R Markdown
                grepl("FUNCTION", token))

blog_xml %>%
  purrr::map_df(get_functions) %>%
  dplyr::count(text, sort = TRUE) %>%
  head(n = 10) %>%
text n
library 263
c 210
aes 106
filter 71
mutate 64
ggplot 58
function 53
install.packages 50
install_github 38
select 38

Function definititions (function), basic stuff (c, library) and
tidyverse functions seem to be the most popular on the blog!

Text parsing

After complementing our commonmark-xml2 combo with urltools and
with R base code parsing facilities… let’s pair it with
tidytext! What are the words most
commonly use d n rOpenSci’s blog posts?

get_text <- function(post_xml){
                     xpath = './/d1:text', xml2::xml_ns(post_xml)) %>%
    xml2::xml_text(trim = TRUE) %>%
    glue::glue_collapse(sep = " ") %>%
    as.character() -> text
  tibble::tibble(text = text)

blog_xml %>%
  purrr::map_df(get_text) %>%
  tidytext::unnest_tokens(word, text, token = "words") %>%
  dplyr::filter(!word %in% tidytext::stop_words$word) %>%
  dplyr::count(word, sort = TRUE) %>%
  head(n = 10) %>%
word n
data 1969
package 1097
ropensci 569
packages 486
time 412
community 394
code 377
github 358
software 302
science 297

This beats my old code! There’s really something to be said for
purpose-built tools.


I hope this note will inspire you to use commonmark and xml2 when
analyzing Markdown files. As mentioned earlier, Hugo or Jekyll website
sources are Markdown files and GitHub issue threads are too so it should
open up quite a lot of data! If you’re new to XPath, I’d recommend
reading this
. The
results of XML-parsing are also better parsed without (your writing)
regular expressions: I have shown urltools for URL parsing, that base
R has code parsing tools (parse, getParsedData), and I’ve used

Note that if you’re into blog analysis, don’t forget you can also get
information out of the YAML header using… the yaml
, not regular expressions!

As a bonus, maybe seeing this wrangling inspired you to modify
Markdown files programmatically? E.g. what if I wanted to automatically
replace all level 1 headers with level 2 headers? We’re working on that,
stay tuned and if you want follow this GitHub

Thanks to Jeroen Ooms, Jenny
and Jim
for their answering my XML parsing
(meta) questions.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)