tinkr: editing Markdown documents using XML tools
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Remember our recent post showing that one can wrangle Markdown files
programmatically without regex? That
tech note showed how to convert Markdown bodies to XML in order to
extract information from them. Now, this post goes one step further and
presents tinkr
, a package for
converting .md and .Rmd files to XML, editing them, and… writing them
back as Markdown!
General tinkr
workflow
The goal of tinkr
is to convert Markdown files to XML and back to
allow their editing with xml2
(XPath!) instead of numerous complicated
regular expressions. The XML represents the full Markdown syntax tree
(or AST). If new
to XPath refer to this great
intro. The package
offers two functions, to_xml
and to_md
.
The current workflow for editing .md and .Rmd with tinkr
is
use
to_xml
to obtain XML from Markdown (based oncommonmark::markdown_xml
andblogdown:::split_yaml_body
).edit the XML using
xml2
.use
to_md
to save back the resulting Markdown (this uses a XSLT stylesheet, and thexslt
package).
Maybe there could be shortcuts functions for some operations in 2, maybe not.
The md to XML to md loop on which tinkr
is based is slightly lossy
because of Markdown syntax redundancy. For instance
lists can be created with either “+”, “-” or “*“. When using
tinkr
, the md after editing will only use”-” for lists.Links built like
[word][smallref]
and bottom[smallref]: URL
become[word](URL)
.Characters are escaped (e.g. “[” when not for a link).
Block quotes lines all get “>” whereas in the input only the first could have a “>” at the beginning of the first line.
Tables are no longer pretty, i.e. each header is followed by three dashes no matter the length of the longest string in the column. See this cheatsheet.
Such losses make your Markdown file different, and the git diff a bit
harder to parse, but should not change the documents your .md or .Rmd
is rendered to. If it does, report a bug in the issue
tracker! And in any case,
as mentioned in the docs of the namer
package, such programmatic Markdown
editing is best paired with version
control.
A solution to not loose your Markdown style, e.g. your preferring “*“
over “-” for lists is to tweak our XSL
stylesheet and provide its filepath as
stylesheet_path
argument to to_md
.
If you want to read more technical details, head over to that section. But we’ll show examples first!
A few tinkr
examples
Since tinkr
is still a toddler package, it’s available from GitHub
only.
remotes::install_github("ropenscilabs/tinkr")
Use tinkr
to change header level
We read “example1.md”, change all headers 3 to headers 1, and save it back to md.
# From Markdown to XML path <- system.file("extdata", "example1.md", package = "tinkr") (yaml_xml_list <- tinkr::to_xml(path)) ## $yaml ## [1] "---" ## [2] "title: \"What have these birds been studied for? Querying science outputs with R\"" ## [3] "slug: birds-science" ## [4] "authors:" ## [5] " - name: Maëlle Salmon" ## [6] " url: https://masalmon.eu/" ## [7] "date: 2018-09-11" ## [8] "topicid: 1347" ## [9] "preface: The blog post series corresponds to the material for a talk Maëlle will give at the [Animal Movement Analysis summer school in Radolfzell, Germany on September the 12th](http://animove.org/animove-2019-evening-keynotes/), in a Max Planck Institute of Ornithology." ## [10] "tags:" ## [11] "- rebird" ## [12] "- birder" ## [13] "- fulltext" ## [14] "- dataone" ## [15] "- EML" ## [16] "- literature" ## [17] "output:" ## [18] " md_document:" ## [19] " variant: markdown_github" ## [20] " preserve_yaml: true" ## [21] "---" ## ## $body ## {xml_document} ## <document xmlns="http://commonmark.org/xml/1.0"> ## [1] <paragraph>\n <text xml:space="preserve">In the </text>\n <link d ... ## [2] <heading level="3">\n <text xml:space="preserve">Getting a list of ... ## [3] <paragraph>\n <text xml:space="preserve">For more details about th ... ## [4] <code_block info="r" xml:space="preserve"># polygon for filtering\n ... ## [5] <paragraph>\n <text xml:space="preserve">For the sake of simplicit ... ## [6] <code_block info="r" xml:space="preserve">species <- ebd %>%\ ... ## [7] <paragraph>\n <text xml:space="preserve">The species are Carrion C ... ## [8] <heading level="3">\n <text xml:space="preserve">Querying the scie ... ## [9] <paragraph>\n <text xml:space="preserve">Just like rOpenSci has a ... ## [10] <paragraph>\n <text xml:space="preserve">We shall use </text>\n < ... ## [11] <paragraph>\n <text xml:space="preserve">We first define a functio ... ## [12] <paragraph>\n <text xml:space="preserve">We use </text>\n <code x ... ## [13] <code_block info="r" xml:space="preserve">.get_papers <- functio ... ## [14] <code_block xml:space="preserve">## [1] "Great spotted cuckoo nest ... ## [15] <paragraph>\n <text xml:space="preserve">If we were working on a s ... ## [16] <paragraph>\n <text xml:space="preserve">We then apply this functi ... ## [17] <code_block info="r" xml:space="preserve">get_papers <- ratelimi ... ## [18] <code_block xml:space="preserve">## [1] 522\n</code_block> ## [19] <code_block info="r" xml:space="preserve">all_papers <- unique(a ... ## [20] <code_block xml:space="preserve">## [1] 378\n</code_block> ## ... library("magrittr") # transform level 3 headers into level 1 headers body <- yaml_xml_list$body body %>% xml2::xml_find_all(xpath = './/d1:heading', xml2::xml_ns(.)) %>% .[xml2::xml_attr(., "level") == "3"] -> headers3 xml2::xml_set_attr(headers3, "level", 1) (yaml_xml_list$body <- body) ## {xml_document} ## <document xmlns="http://commonmark.org/xml/1.0"> ## [1] <paragraph>\n <text xml:space="preserve">In the </text>\n <link d ... ## [2] <heading level="1">\n <text xml:space="preserve">Getting a list of ... ## [3] <paragraph>\n <text xml:space="preserve">For more details about th ... ## [4] <code_block info="r" xml:space="preserve"># polygon for filtering\n ... ## [5] <paragraph>\n <text xml:space="preserve">For the sake of simplicit ... ## [6] <code_block info="r" xml:space="preserve">species <- ebd %>%\ ... ## [7] <paragraph>\n <text xml:space="preserve">The species are Carrion C ... ## [8] <heading level="1">\n <text xml:space="preserve">Querying the scie ... ## [9] <paragraph>\n <text xml:space="preserve">Just like rOpenSci has a ... ## [10] <paragraph>\n <text xml:space="preserve">We shall use </text>\n < ... ## [11] <paragraph>\n <text xml:space="preserve">We first define a functio ... ## [12] <paragraph>\n <text xml:space="preserve">We use </text>\n <code x ... ## [13] <code_block info="r" xml:space="preserve">.get_papers <- functio ... ## [14] <code_block xml:space="preserve">## [1] "Great spotted cuckoo nest ... ## [15] <paragraph>\n <text xml:space="preserve">If we were working on a s ... ## [16] <paragraph>\n <text xml:space="preserve">We then apply this functi ... ## [17] <code_block info="r" xml:space="preserve">get_papers <- ratelimi ... ## [18] <code_block xml:space="preserve">## [1] 522\n</code_block> ## [19] <code_block info="r" xml:space="preserve">all_papers <- unique(a ... ## [20] <code_block xml:space="preserve">## [1] 378\n</code_block> ## ... # Back to Markdown tinkr::to_md(yaml_xml_list, "newmd.md")
Use tinkr
to wrangle chunks
Because to_xml
parses chunk options into node attributes, one can use
tinkr
to programmatically change options. In the example below, we set
the “echo” option to FALSE in all chunks (in real life we might prefer
deleting all echo options in chunks, and set it to FALSE in the setup
chunk).
# from Rmd to XML path <- system.file("extdata", "example2.Rmd", package = "tinkr") yaml_xml_list <- tinkr::to_xml(path) library("magrittr") # identify code blocks body <- yaml_xml_list$body blocks <- body %>% xml2::xml_find_all(xpath = './/d1:code_block', xml2::xml_ns(.)) # Change echo attribute xml2::xml_set_attr(blocks, "echo", "FALSE") blocks ## {xml_nodeset (4)} ## [1] <code_block xml:space="preserve" language="r" name="setup" include=" ... ## [2] <code_block xml:space="preserve" language="r" name="" eval="TRUE" ec ... ## [3] <code_block xml:space="preserve" language="python" name="" fig.cap=" ... ## [4] <code_block xml:space="preserve" language="python" name="" echo="FAL ...
Note that logical chunk options are handled as character. Actual string
chunk options such as fig.cap
keep their quotes. This is all to make
sure the Rmd we write back has valid options.
# Show fig.cap xml2::xml_attr(blocks, "fig.cap") ## [1] NA NA "\"pretty plot\"" NA # save back to Rmd yaml_xml_list$body <- body tinkr::to_md(yaml_xml_list, "newmd.Rmd")
Bonus: use tinkr::to_xml
to analyse chunk options
Even only tinkr::to_xml
on its own can be powerful, thanks to its
parsing chunk options to node attributes. In the example below, we find
all code chunks of the “R for data science” book by Garrett Grolemund
and Hadley Wickham that have the
error=TRUE
chunk option. It’s the option you can set to show wrong
code in your R Markdown document. Let’s find the wrong examples of that
book!
Note that the path below is where the clone of the r4ds repo lives on my computer.
book_path <- "C:\\Users\\Maelle\\Documents\\ropensci\\r4ds" rmds <- fs::dir_ls(book_path, regexp = "*.Rmd")
There are 32 R Markdown documents.
library("magrittr") get_chunks <- function(xml){ xml %>% xml2::xml_find_all(xpath = './/d1:code_block', xml2::xml_ns(.)) } get_error_chunks_code <- function(xml){ xml %>% get_chunks()%>% .[xml2::xml_has_attr(., "error")] %>% # note that logicals are character .[xml2::xml_attr(., "error") == "TRUE"] %>% xml2::xml_text() %>% as.character() } purrr::map(rmds, tinkr::to_xml) %>% purrr::map("body") -> bodies bodies %>% purrr::map(get_chunks) -> all_chunks length(unlist(all_chunks)) ## [1] 1776 bodies %>% purrr::map(get_error_chunks_code) %>% unlist() -> buggy_code
There are 11 chunks to show errors out of 1776 chunks. Few enough to
show all of them (using the results="asis"
chunk option!)…
glue::glue("```r\n {unname(buggy_code)} \n ```") if (c(TRUE, FALSE)) {} if (NA) {} wt_mean <- function(x, w, na.rm = FALSE) { stopifnot(is.logical(na.rm), length(na.rm) == 1) stopifnot(length(x) == length(w)) if (na.rm) { miss <- is.na(x) | is.na(w) x <- x[!miss] w <- w[!miss] } sum(w * x) / sum(w) } wt_mean(1:6, 6:1, na.rm = "foo") issues %>% map_chr(c("pull_request", "html_url")) tibble(x = "e") %>% add_predictions(mod2) mtcars %>% group_by(cyl) %>% summarise(q = quantile(mpg)) # Ok, because y and z have the same number of elements in # every row df1 <- tribble( ~x, ~y, ~z, 1, c("a", "b"), 1:2, 2, "c", 3 ) df1 df1 %>% unnest(y, z) # Doesn't work because y and z have different number of elements df2 <- tribble( ~x, ~y, ~z, 1, "a", 1:2, 2, c("b", "c"), 3 ) df2 df2 %>% unnest(y, z) tryCatch(stop("!"), error = function(e) "An error") stop("!") %>% tryCatch(error = function(e) "An error") filter(flights, month = 1) tibble(x = 1:4, y = 1:2) tibble(x = 1:4, y = rep(1:2, 2)) tibble(x = 1:4, y = rep(1:2, each = 2)) x[c(1, -1)] my_variable <- 10 my_variable
The book explains in more detail why these
chunks create errors… or you can use tinkr
to extract more information
out of the R Markdown source!
If you’re not interested into technical details, hop over the next section and read the conclusion directly!
tinkr
under the hood
This section features a bit more technical details about to_xml
and
to_md
, for those interested in such stuff, out of curiosity (cool!) or
out of a desire to become a contributor (yay!).
From Markdown to XML: to_xml
The to_xml
function uses internal code from blogdown
to split lines
of the md between YAML header and Markdown body. The Markdown body is
further processed using commonmark::markdown_xml
with
extensions=TRUE
and a homegrown code to transform code chunks info
from string “```{r setup, include=FALSE, eval = TRUE}” to XML node
attributes (called “language”, “name”, “include”, “eval” for this
example). This allows easier editing of code chunks.
Side-note: This piece of tinkr
worked quite well right away but we
noticed that the XML conversion lost the alignment attributes of table
cells, so I opened an issue in the repo of the github fork of
cmark
. rOpenSci post-doc
hacker Jeroen Ooms helped me find where to
open this issue: cmark
is the C library wrapped by the R package
commonmark
. There’s the parent repo commonmark/cmark, but the one
wrapped by commonmark
is the github/cmark fork because it’s the one
supporting GitHub-flavored Markdown extensions such as tables and
strike-through text. A new version of the github/cmark library was
released
by Ashe Connor and
commonmark
was updated by Jeroen to use this newest version.
to_xml
returns a list containing the YAML header as a character
vector, and the body as XML. YAML metadata could be edited using the
yaml
package, which is not the
goal of tinkr
.
From XML to Markdown: to_md
After editing the XML body, one needs to convert the list containing the
YAML and the body back to Markdown. to_md
does so by:
packing the chunk options back from node attributes to a string.
using the rOpenSci package
xslt
(bindings to libxslt) for conversion of the body to Markdown, using an XSLT stylesheet as Rosetta stone. In Jeroen’s words, “XSLT is just a mini programming language to run loops over an XML, that is written itself in XML”. Things such as<heading level="2">\n <text>R Markdown</text>\n</heading>
thus becomes## R Markdown
using
writeLines
on the YAML character vectors and on the body.
Another side-note: The most crucial part here was obtaining an XSLT
stylesheet. Jeroen opened an issue in the commonmark/cmark
repo and Nick
Wellnhofer “whipped up” an XSLT
stylesheet. If you followed correctly, the commonmark/cmark library
doesn’t support GitHub-flavoured Markdown extensions, so Nick Wellnhofer
understandably didn’t include templates for those in his stylesheet. I
miraculously was able to write a bit of XSLT myself, and submitted a
stylesheet to the github/cmark
repo. It imports Nick
Wellnhofer’s stylesheet, and defines templates for tables and
strike-through text. Copies of these two stylesheets live in tinkr
and
are therefore available after installing it.
to_md
exposes the path to the stylesheet as argument, so you could
provide your own in order to preserve your style preferences
(e.g. writing lists with “*” rather than “-”)
Future plans
The goal of tinkr
was to experiment with the idea of Commonmark/XML
conversion, and then to go from there. Now that to_xml
and to_md
allow the programmatic XML editing of Markdown files, albeit slightly
lossy, maybe tinkr
could keep growing! Here are two main possible
discussion topics with you, dear reader:
Are you an XSL wizard? If so, would you like to help improve the XSLT stylesheet, in particular to prettify tables? Head to the GitHub repository, in particular to this issue, and volunteer!
tinkr
was called after the made-up but accepted word “tink” that means “unknit”.tinkr
is currently more about tinkering with than tinking R Markdown files, but why not unearth the common discussion of how to collaborate with people who’d rather edit a Google doc/docx than R Markdon source? Could one convert to and from Open Office XML and help track changes for instance?
And of course, to finish on a more down-to-earth note, if you do edit
Markdown files with tinkr
, please report your experience and what
could be improved, in the comments below or the GitHub
repo!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.