tinkr: editing Markdown documents using XML tools

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Remember our recent post showing that one can wrangle Markdown files
programmatically without regex
? That
tech note showed how to convert Markdown bodies to XML in order to
extract information from them. Now, this post goes one step further and
presents tinkr, a package for
converting .md and .Rmd files to XML, editing them, and… writing them
back as Markdown
!

General tinkr workflow

The goal of tinkr is to convert Markdown files to XML and back to
allow their editing with xml2 (XPath!) instead of numerous complicated
regular expressions. The XML represents the full Markdown syntax tree
(or AST)
. If new
to XPath refer to this great
intro
. The package
offers two functions, to_xml and to_md.

The current workflow for editing .md and .Rmd with tinkr is

  1. use to_xml to obtain XML from Markdown (based on
    commonmark::markdown_xml and blogdown:::split_yaml_body
    ).

  2. edit the XML using xml2.

  3. use to_md to save back the resulting Markdown (this uses a XSLT
    stylesheet, and the xslt
    package
    ).

Maybe there could be shortcuts functions for some operations in 2, maybe
not.

The md to XML to md loop on which tinkr is based is slightly lossy
because of Markdown syntax redundancy. For instance

  • lists can be created with either “+”, “-” or “*“. When using
    tinkr, the md after editing will only use”-” for lists.

  • Links built like [word][smallref] and bottom [smallref]: URL
    become [word](URL).

  • Characters are escaped (e.g. “[” when not for a link).

  • Block quotes lines all get “>” whereas in the input only the
    first could have a “>” at the beginning of the first line.

  • Tables are no longer pretty, i.e. each header is followed by three
    dashes no matter the length of the longest string in the column. See
    this
    cheatsheet
    .

Such losses make your Markdown file different, and the git diff a bit
harder to parse, but should not change the documents your .md or .Rmd
is rendered to. If it does, report a bug in the issue
tracker
! And in any case,
as mentioned in the docs of the namer
package
, such programmatic Markdown
editing is best paired with version
control
.

A solution to not loose your Markdown style, e.g. your preferring “*“
over “-” for lists is to tweak our XSL
stylesheet
and provide its filepath as
stylesheet_path argument to to_md.

If you want to read more technical details, head over to that
section
. But we’ll show examples first!

A few tinkr examples

Since tinkr is still a toddler package, it’s available from GitHub
only
.

remotes::install_github("ropenscilabs/tinkr")

Use tinkr to change header level

We read “example1.md”, change all headers 3 to headers 1, and save it
back to md.

# From Markdown to XML
path <- system.file("extdata", "example1.md", package = "tinkr")
(yaml_xml_list <- tinkr::to_xml(path))
## $yaml
##  [1] "---"                                                                                                                                                                                                                                                                             
##  [2] "title: \"What have these birds been studied for? Querying science outputs with R\""                                                                                                                                                                                              
##  [3] "slug: birds-science"                                                                                                                                                                                                                                                             
##  [4] "authors:"                                                                                                                                                                                                                                                                        
##  [5] "  - name: Maëlle Salmon"                                                                                                                                                                                                                                                         
##  [6] "    url: https://masalmon.eu/"                                                                                                                                                                                                                                                   
##  [7] "date: 2018-09-11"                                                                                                                                                                                                                                                                
##  [8] "topicid: 1347"                                                                                                                                                                                                                                                                   
##  [9] "preface: The blog post series corresponds to the material for a talk Maëlle will give at the [Animal Movement Analysis summer school in Radolfzell, Germany on September the 12th](http://animove.org/animove-2019-evening-keynotes/), in a Max Planck Institute of Ornithology."
## [10] "tags:"                                                                                                                                                                                                                                                                           
## [11] "- rebird"                                                                                                                                                                                                                                                                        
## [12] "- birder"                                                                                                                                                                                                                                                                        
## [13] "- fulltext"                                                                                                                                                                                                                                                                      
## [14] "- dataone"                                                                                                                                                                                                                                                                       
## [15] "- EML"                                                                                                                                                                                                                                                                           
## [16] "- literature"                                                                                                                                                                                                                                                                    
## [17] "output:"                                                                                                                                                                                                                                                                         
## [18] "  md_document:"                                                                                                                                                                                                                                                                  
## [19] "    variant: markdown_github"                                                                                                                                                                                                                                                    
## [20] "    preserve_yaml: true"                                                                                                                                                                                                                                                         
## [21] "---"                                                                                                                                                                                                                                                                             
## 
## $body
## {xml_document}
## 
##  [1] \n  In the \n  \n  Getting a list of ...
##  [3] \n  For more details about th ...
##  [4] # polygon for filtering\n ...
##  [5] \n  For the sake of simplicit ...
##  [6] species <- ebd %>%\ ...
##  [7] \n  The species are Carrion C ...
##  [8] \n  Querying the scie ...
##  [9] \n  Just like rOpenSci has a  ...
## [10] \n  We shall use \n  < ...
## [11] \n  We first define a functio ...
## [12] \n  We use \n  .get_papers <- functio ...
## [14] ##  [1] "Great spotted cuckoo nest ...
## [15] \n  If we were working on a s ...
## [16] \n  We then apply this functi ...
## [17] get_papers <- ratelimi ...
## [18] ## [1] 522\n
## [19] all_papers <- unique(a ...
## [20] ## [1] 378\n
## ...
library("magrittr")
# transform level 3 headers into level 1 headers
body <- yaml_xml_list$body
body %>%
  xml2::xml_find_all(xpath = './/d1:heading',
                     xml2::xml_ns(.)) %>%
  .[xml2::xml_attr(., "level") == "3"] -> headers3

xml2::xml_set_attr(headers3, "level", 1)

(yaml_xml_list$body <- body)
## {xml_document}
## 
##  [1] \n  In the \n  \n  Getting a list of ...
##  [3] \n  For more details about th ...
##  [4] # polygon for filtering\n ...
##  [5] \n  For the sake of simplicit ...
##  [6] species <- ebd %>%\ ...
##  [7] \n  The species are Carrion C ...
##  [8] \n  Querying the scie ...
##  [9] \n  Just like rOpenSci has a  ...
## [10] \n  We shall use \n  < ...
## [11] \n  We first define a functio ...
## [12] \n  We use \n  .get_papers <- functio ...
## [14] ##  [1] "Great spotted cuckoo nest ...
## [15] \n  If we were working on a s ...
## [16] \n  We then apply this functi ...
## [17] get_papers <- ratelimi ...
## [18] ## [1] 522\n
## [19] all_papers <- unique(a ...
## [20] ## [1] 378\n
## ...
# Back to Markdown
tinkr::to_md(yaml_xml_list, "newmd.md")

Use tinkr to wrangle chunks

Because to_xml parses chunk options into node attributes, one can use
tinkr to programmatically change options. In the example below, we set
the “echo” option to FALSE in all chunks (in real life we might prefer
deleting all echo options in chunks, and set it to FALSE in the setup
chunk).

# from Rmd to XML
path <- system.file("extdata", "example2.Rmd",
                    package = "tinkr")
yaml_xml_list <- tinkr::to_xml(path)
library("magrittr")

# identify code blocks
body <- yaml_xml_list$body
blocks <- body %>%
xml2::xml_find_all(xpath = './/d1:code_block',
                   xml2::xml_ns(.))

# Change echo attribute
xml2::xml_set_attr(blocks, "echo", "FALSE")
blocks
## {xml_nodeset (4)}
## [1] # Show fig.cap
xml2::xml_attr(blocks, "fig.cap")
## [1] NA                NA                "\"pretty plot\"" NA
# save back to Rmd
yaml_xml_list$body <- body
tinkr::to_md(yaml_xml_list, "newmd.Rmd")

Bonus: use tinkr::to_xml to analyse chunk options

Even only tinkr::to_xml on its own can be powerful, thanks to its
parsing chunk options to node attributes. In the example below, we find
all code chunks of the “R for data science” book by Garrett Grolemund
and Hadley Wickham
that have the
error=TRUE chunk option. It’s the option you can set to show wrong
code in your R Markdown document. Let’s find the wrong examples of that
book!

Note that the path below is where the clone of the r4ds repo lives on
my computer.

book_path <- "C:\\Users\\Maelle\\Documents\\ropensci\\r4ds"

rmds <- fs::dir_ls(book_path, regexp = "*.Rmd")

There are 32 R Markdown documents.

library("magrittr")

get_chunks <- function(xml){
  xml %>%
    xml2::xml_find_all(xpath = './/d1:code_block',
                       xml2::xml_ns(.))
}

get_error_chunks_code <- function(xml){
  xml %>%
    get_chunks()%>%
    .[xml2::xml_has_attr(., "error")] %>%
    # note that logicals are character
    .[xml2::xml_attr(., "error") == "TRUE"] %>%
    xml2::xml_text() %>%
    as.character() }

purrr::map(rmds, tinkr::to_xml) %>%
  purrr::map("body") -> bodies

bodies %>%
  purrr::map(get_chunks) -> all_chunks

length(unlist(all_chunks))
## [1] 1776
bodies %>%
  purrr::map(get_error_chunks_code) %>%
  unlist() -> buggy_code

There are 11 chunks to show errors out of 1776 chunks. Few enough to
show all of them (using the results="asis" chunk option!)…

glue::glue("```r\n {unname(buggy_code)} \n ```")
if (c(TRUE, FALSE)) {}

if (NA) {}
 
wt_mean <- function(x, w, na.rm = FALSE) {
  stopifnot(is.logical(na.rm), length(na.rm) == 1)
  stopifnot(length(x) == length(w))
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}
wt_mean(1:6, 6:1, na.rm = "foo")
 
issues %>% map_chr(c("pull_request", "html_url"))
 
tibble(x = "e") %>% 
  add_predictions(mod2)
 
mtcars %>% 
  group_by(cyl) %>% 
  summarise(q = quantile(mpg))
 
# Ok, because y and z have the same number of elements in
# every row
df1 <- tribble(
  ~x, ~y,           ~z,
   1, c("a", "b"), 1:2,
   2, "c",           3
)
df1
df1 %>% unnest(y, z)

# Doesn't work because y and z have different number of elements
df2 <- tribble(
  ~x, ~y,           ~z,
   1, "a",         1:2,  
   2, c("b", "c"),   3
)
df2
df2 %>% unnest(y, z)
 
tryCatch(stop("!"), error = function(e) "An error")

stop("!") %>% 
  tryCatch(error = function(e) "An error")
 
filter(flights, month = 1)
 
tibble(x = 1:4, y = 1:2)

tibble(x = 1:4, y = rep(1:2, 2))

tibble(x = 1:4, y = rep(1:2, each = 2))
 
x[c(1, -1)]
 
my_variable <- 10
my_variable
 

The book explains in more detail why these
chunks create errors… or you can use tinkr to extract more information
out of the R Markdown source!

If you’re not interested into technical details, hop over the next
section and read the conclusion directly!

tinkr under the hood

This section features a bit more technical details about to_xml and
to_md, for those interested in such stuff, out of curiosity (cool!) or
out of a desire to become a contributor (yay!).

From Markdown to XML: to_xml

The to_xml function uses internal code from blogdown to split lines
of the md between YAML header and Markdown body. The Markdown body is
further processed using commonmark::markdown_xml with
extensions=TRUE and a homegrown code to transform code chunks info
from string ““`{r setup, include=FALSE, eval = TRUE}” to XML node
attributes (called “language”, “name”, “include”, “eval” for this
example). This allows easier editing of code chunks.

Side-note: This piece of tinkr worked quite well right away but we
noticed that the XML conversion lost the alignment attributes of table
cells, so I opened an issue in the repo of the github fork of
cmark
. rOpenSci post-doc
hacker Jeroen Ooms helped me find where to
open this issue: cmark is the C library wrapped by the R package
commonmark. There’s the parent repo commonmark/cmark, but the one
wrapped by commonmark is the github/cmark fork because it’s the one
supporting GitHub-flavored Markdown extensions such as tables and
strike-through text. A new version of the github/cmark library was
released
by Ashe Connor and
commonmark
was updated by Jeroen to use this newest version.

to_xml returns a list containing the YAML header as a character
vector, and the body as XML. YAML metadata could be edited using the
yaml package, which is not the
goal of tinkr.

From XML to Markdown: to_md

After editing the XML body, one needs to convert the list containing the
YAML and the body back to Markdown. to_md does so by:

  • packing the chunk options back from node attributes to a string.

  • using the rOpenSci package
    xslt
    (bindings to
    libxslt) for conversion of the body
    to Markdown, using an XSLT stylesheet as Rosetta stone. In Jeroen’s
    words, “XSLT is just a mini programming language to run loops over
    an XML, that is written itself in XML”. Things such as
    \n R Markdown\n thus
    becomes ## R Markdown

  • using writeLines on the YAML character vectors and on the body.

Another side-note: The most crucial part here was obtaining an XSLT
stylesheet. Jeroen opened an issue in the commonmark/cmark
repo
and Nick
Wellnhofer
“whipped up” an XSLT
stylesheet. If you followed correctly, the commonmark/cmark library
doesn’t support GitHub-flavoured Markdown extensions, so Nick Wellnhofer
understandably didn’t include templates for those in his stylesheet. I
miraculously was able to write a bit of XSLT myself, and submitted a
stylesheet to the github/cmark
repo
. It imports Nick
Wellnhofer’s stylesheet, and defines templates for tables and
strike-through text. Copies of these two stylesheets live in tinkr and
are therefore available after installing it.

to_md exposes the path to the stylesheet as argument, so you could
provide your own in order to preserve your style preferences
(e.g. writing lists with “*” rather than “-”)

Future plans

The goal of tinkr was to experiment with the idea of Commonmark/XML
conversion, and then to go from there. Now that to_xml and to_md
allow the programmatic XML editing of Markdown files, albeit slightly
lossy, maybe tinkr could keep growing! Here are two main possible
discussion topics with you, dear reader:

  • Are you an XSL wizard? If so, would you like to help improve the
    XSLT stylesheet, in particular to prettify tables? Head to the
    GitHub repository, in particular to this
    issue
    , and
    volunteer!

  • tinkr was called after the made-up but accepted word “tink” that
    means “unknit”. tinkr is currently more about tinkering with
    than tinking R Markdown files, but why not unearth the common
    discussion of how to collaborate with people who’d rather edit a
    Google doc/docx than R Markdon source? Could one convert to and from
    Open Office XML and help track changes for instance?

And of course, to finish on a more down-to-earth note, if you do edit
Markdown files with tinkr, please report your experience and what
could be improved, in the comments below or the GitHub
repo
!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)