RStudio Pandoc – HTML To Markdown

December 14, 2018
By

[This article was first published on R on YIHAN WU, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The knitr and rmarkdown packages are used in conjunction with pandoc to convert R code and figures to a variety of formats including PDF, and word. Here, I’m exploring how to convert HTML back to markdown format. This post came about when I was searching how to convert XML to markdown, which I still haven’t found an easy way to do. Pandoc is not the only way to convert HTML to markdown (see turndown, html2text)

Pandoc is packaged within RStudio and on Windows, the executables are located within Program Files/RStudio/bin/pandoc. The rmarkdown package contains wrapper functions for using pandoc within RStudio.

Here, I am trying to convert this example HTML page back to markdown using the function pandoc_convert. First, pandoc_convert requires an actual file which means it does not accept a quoted string of HTML code in its input argument.

The example html:




Enter a title, displayed at the top of the window.



Enter the main heading, usually the same as the title.

Be bold in stating your key points. Put them in a list:

  • The first item in your list
  • The second item; italicize key words

Improve your image by including an image.

A Great HTML Resource

Add a link to your favorite Web site. Break up your page with a horizontal rule or two.


Finally, link to another page in your own Web site.

© Wiley Publishing, 2011

I saved the HTML example here as example.html.

html_page <- readLines("../../static/files/example.html")

We can print the object in R.

cat(html_page)
##    Enter a title, displayed at the top of the window.    

Enter the main heading, usually the same as the title.

Be bold in stating your key points. Put them in a list:

  • The first item in your list
  • The second item; italicize key words

Improve your image by including an image.

A Great HTML Resource

Add a link to your favorite Web site. Break up your page with a horizontal rule or two.


Finally, link to another page in your own Web site.

© Wiley Publishing, 2011

pandoc can convert between many different formats, and for markdown, it has multiple variants including the github flavored variant (for Github), and php markdown extra (the variant used by WordPress sites).

The safest variant to pick is markdown_strict which is the original markdown variant.

Pandoc requires the file path which in my case, is located in a different directory rather than my working directory.

library(rmarkdown)
file_path <- "../../static/files/example.html"
pandoc_convert(file_path, to = "markdown_strict")
Enter the main heading, usually the same as the title.
======================================================

Be **bold** in stating your key points. Put them in a list:

-   The first item in your list
-   The second item; *italicize* key words

Improve your image by including an image.

![A Great HTML Resource](http://www.mygifs.com/CoverImage.gif)

Add a link to your favorite [Web site](https://www.dummies.com/). Break
up your page with a horizontal rule or two.

------------------------------------------------------------------------

Finally, link to [another page](page2.html) in your own Web site.

© Wiley Publishing, 2011

Notice that heading 1 is formatted with ==== rather than the # that RMarkdown seems to favor. We can require pandoc to use the # during the conversion by adding an argument.

pandoc_convert(file_path, to = "markdown_strict", options = c("--atx-headers"))
# Enter the main heading, usually the same as the title.

Be **bold** in stating your key points. Put them in a list:

-   The first item in your list
-   The second item; *italicize* key words

Improve your image by including an image.

![A Great HTML Resource](http://www.mygifs.com/CoverImage.gif)

Add a link to your favorite [Web site](https://www.dummies.com/). Break
up your page with a horizontal rule or two.

------------------------------------------------------------------------

Finally, link to [another page](page2.html) in your own Web site.

© Wiley Publishing, 2011

Right now, the output is being piped to the console. A file can be created instead with:

pandoc_convert(file_path, to = "markdown_strict", output = "example.md")

Pandoc has a multitude of styling extensions for markdown variants, all listed on the manual page.

Pandoc ignores everything enclosed in . When converting from markdown to HTML, these comments are usually directly placed as is in the HTML document but the opposite does not seem to be true.

Lastly, this was tested using pandoc version 1.19.2.1. Pandoc 2.5 was released last month.

To leave a comment for the author, please follow the link and comment on their blog: R on YIHAN WU.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)