Plain Text, Papers, Pandoc

[This article was first published on Category: R | Kieran Healy, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Over the past few months, I’ve had several people ask me about the tools I use to put papers together. For several year’s I’ve maintained a page of resources somewhat grandiosely headed “Writing and Presenting Social Science”. Really it just makes public my configuration files and templates for my text editor and related tools. Things have changed a little recently—which led to people asking the questions—so I will try to lay out the current setup here. I will also try to avoid veering off into generalized noodling about the nature of writing or creativity. (That’s fine for Merlin.) This is mostly because although I am not a bad writer, I am an excellent procrastinator, and it is quite frankly embarrassing to write about how to write papers when you could be actually writing papers. My excuse today is that I have a head cold.

So, first I will say a little bit about the general problem, and then I will tell you something quite specific: how I take the draft of a scholarly paper, typically including bibliographical references, figures, and the results of some data analysis, and turn it into nice-looking PDF and HTML output. The hopefully redeeming thing about this discussion is that it will help you use the various resources I make available for doing this. If you want to copy what I do, you should be able to. But I am not saying you ought to. Nor am I making any claim that what I do is right, rational, efficient, productive, or psychologically healthy. As in an earlier discussion of mine on this topic, the chief counterexample to taking anything here as advice about writing or productivity is my wife, who—as I type this—is seated opposite me at the dining room table, putting the final touches to a book written in Microsoft Word. I think MS Word is unpleasant to use for all kinds of reasons, and perhaps you agree. On the other hand, she just wrote a book using it that will be published later this year by Oxford University Press. On this side of the table, meanwhile, is this blog post.

What’s the problem?

The problem is that the business of doing scholarly work is intrinsically a mess. There’s the annoying business of getting ideas and writing them down, of course, but also everything before, during, and around it: data analysis and all that comes with it, and the tedious but unavoidable machinery of scholarly papers—especially citations and references. There is a lot of keep track of, a lot to get right, and a lot to draw together at the time of writing. Academic papers are by no means the only form of writing subject to these sorts of constraints. Consider this extremely sensible discussion by Dr Drang, a consulting engineer and blogger you should be reading:

I don’t write fiction, but I can imagine that a lot of fiction writing can be done without any reference materials whatsoever. Similarly, a lot of editorials and opinion pieces are remarkably fact-free; these also can spring directly from the writer’s head. But the type of writing I typically do—mostly for work, but also here—is loaded with facts. I am constantly referring to photographs, drawings, experimental test results, calculations, reports written by others, textbooks, journal articles, and so on. These are not distractions; they are essential to the writing process.

And it’s not just reference material. Quite often I need to make my own graphs and drawings to include in a report. Because the text and the graphics are all part of a coherent whole, I need to go back and forth between the two; the words inform the pictures and the pictures inform the words. This is not the Platonic ideal of a clean writing environment—a cup of coffee on an empty desk in a white room—that you see in videos for distraction-free editors.

Some of the popularity of these editors is part of the backlash against multitasking, but people are confusing themselves with their computers. When I’m writing a report, that is my single task, and I bring to bear whatever tools are necessary to complete it. That my computer is multitasking by running many programs simultaneously isn’t a source of confusion or distraction, it’s the natural and efficient way for me to get my one task done.

A lot of academic writing is just like this. It is difficult to manage. It’s even worse when you have collaborators and other contributors. So, what to do?

The Office Model and the Engineering Model

Let me make a crude distinction. There are “Office Type” solutions to this problem, and there are “Engineering Type” solutions. Please don’t get hung up on the distinction or the labels. Office solutions tend towards a cluster of tools where something like Microsoft Word is at the center of your work. A Word file or set of files is the most “real” thing in your project. Changes to your work are tracked inside that file or files. Citation and reference managers plug into them. The outputs of data analyses—tables, figures—get dropped into them or kept alongside them. The master document may be passed around from person to person or edited and updated in turn. The final output is exported from it, perhaps to PDF or to HTML, but maybe most often the final output just is the .docx file, cleaned up and with the track changes feature turned off.

In the Engineering model, meanwhile, plain text files are at the center of your work. The most “real” thing in your project will either be those files or, more likely, the Git, Mercurial, or SVN repository that controls the project. Changes are tracked outside the files. Data analysis is managed in code that produces outputs in (ideally) a known and reproducible manner. Citation and reference management will likely also be done in plain text, as with a BibTeX .bib file. Final outputs are assembled from the plain text and turned to .tex, .html, or .pdf using some kind of typesetting or conversion tool. Very often, because of some unavoidable facts about the world, the final output of this kind of solution is also a .docx file.

This distinction is meant to capture a tendency in organization, not a rigid divide (and still less a sort of personality). Applications like Scrivener, for example, combine elements of the two models. Scrivener embraces the “bittyness” of large writing projects in an effective way, and can spit out clean copy in a variety of formats. Scrivener is built for people writing lengthy fiction (or qualitative non-fiction) rather than anything with data analysis, so I have never used it extensively—though I bet I could make a go of it if I tried. Microsoft Word, meanwhile, still rules large swathes of the Humanities and the Social Sciences. The two most recent papers I had a hand in were both co-authored. The first was written mostly in plain text. My co-author was far away in either France or California for most of the process, and so we worked in Editorially, a very nice service that allows people to humanely collaborate on documents written in Markdown format. The second paper was written with a colleague whose office is upstairs from mine. It was a Word file from beginning to end, because that was just easier to manage given how my coauthor organizes his work.

When I write things by myself, or co-author with someone I can imperiously boss around, I write everything in plain text. In the past—e.g. for my dissertation, and my book—I wrote everything in LaTeX, which led to the early resources posted on my page—some custom LaTeX templates and style files meant to produce good-looking PDF files. These days I try to write in Markdown, because in principle it is simpler and more easily convertible to many different formats. Which brings me, finally, to the nominally useful part of this post.

What I want to do

I write sociology papers. Those papers cite books and articles. They often incorporate tables and figures created in R. What I want to do is quickly turn a markdown file containing things like that into a properly formatted scholarly paper, without giving up any of the typographical quality or necessary scholarly apparatus (on the output side) or the convenience and convertibilty of markdown (on the input side). Most directly, I want to easily produce good-looking output from the same source in both HTML and PDF formats. And I want to do that with an absolute minimum of—ideally, no—post-processing of the output beyond that basic conversion step.

For transforming a mixture of R code and text into a processable markdown file, everyone’s tool of choice is Yihui Xie’s knitr. For converting markdown to HTML and PDF, the best thing available is John MacFarlane’s superb Pandoc. Pandoc can convert plain text in several markup formats into many output formats. Now, managing citations, especially, has long been the Achilles heel of plain-text workflows. It was one of the few places where LaTeX really worked much better. Markdown is not designed for academic papers or scholarly books. This fact kept me from being able to use it as extensively as I’d like. But thanks to John’s (and other contibutors’) continuing and stellar work on pandoc, the balance has really begun to shift.

There are still limitations to what markdown and pandoc can conveniently do. But being able to produce good HTML, LaTeX, and PDF in one step from the same source is a very attractive prospect. In the next section I describe how I have pandoc set up to do smoothly do this, citations and other material included. I’ll also describe how I have R and the knitr library set up to produce markdown files from .Rmd sources, and provide links to some templates and configuration files that make this possible. Describing this all at once will probably make it sound a little crazy, but if you are like me you will be at the point where you have most or all of these tools installed anyway, and you are using them separately for different things. The thing is just to connect them a little.

I assume you have you have Apple’s developer tools (Xcode or just the command-line tools) installed, along with R, knitr, pandoc, and a TeX distribution. Here is the document flow we want:

I promise this is less insane than it appears.

How I almost do it

I write everything in Emacs, but that doesn’t matter. Use whatever text editor you like and just learn the hell out of it. However, here is my Emacs Starter Kit for the Social Sciences. It sets up Emacs to be aware of the tools discussed here. For present purposes, one of its nice features is that it turns on RefTeX mode for markdown files, and lets you easily cite items from your .bib file in the format pandoc expects.

First there’s the custom LaTeX stuff. I have a GitHub respository of various LaTeX style files that can be used to write nice-looking LaTeX files directly, but which also provide the skeleton for the pandoc conversion process. In particular, the heavy lifting is done by the org-preamble-pdflatex.sty and memoir-article-styles. If you install the custom latex stuff where LaTeX can find it—i.e., you can compile a LaTeX document made from this template—then you are good to go. I originally made these templates when I was writing directly in .tex, but now they just do their work in the background. My BibTeX database is also available, but you will probably want to use your own.

Second, there’s the custom pandoc stuff. Here is the repository for that. Much of the material there is designed to go in the ~/.pandoc/ directory, which is where pandoc expects to find its configuration files.

Let’s take an example

Inside the pandoc-templates repository there’s a folder with some examples. Let’s say you have the software installed and the various pieces are working separately. We can begin just by converting a markdown file with citations only—no R code yet, so nothing above the `article.md line in the picture. We are just looking at this piece:

The sample article-markdown.md file looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
</p>

<hr />

<p>title: A Pandoc Markdown Article Starter
author:
- name: Kieran Healy
  affiliation: Duke University
  email: [email protected]
- name: Joe Bloggs
  affiliation: University of North Carolina, Chapel Hill
  email: [email protected]
date: January 2014
abstract: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
bibliography: <!-- \bibliography{/Users/kjhealy/Documents/bibs/socbib-pandoc.bib} This is a hack for Emacs users so that RefTeX knows where your bibfile is, and you can use RefTeX citation completion in your .md files. -->
...</p>

<h1>Introduction</h1>

<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua [@fourcade13classsituat]. Notice that citation there [@healy02digittechnculturgoods]. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>

<h1>Theory</h1>

<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enimad minim veniam, quis nostrud</p>

<p>

The bit at the top is YAML metadata, which pandoc understands. The HTML and latex templates in the pandoc-templates repository are set up to use the YAML metadata properly. Pandoc will also take care of the citations directly. The Makefile in the examples directory will convert any markdown files to HTML, .tex, and PDF output. Just type make at the terminal. If things are working properly, the HTML output from the example will look like this:

The PDF output, meanwhile, can be viewed here. Both look quite nice. The relevant sections of the Makefile show the pandoc commands that generate the output files from the markdown input. The Makefile section for producing PDF output looks like this:

1
2
3
4
5
</p>

<p>pandoc -r markdown+simple_tables+table_captions+yaml_metadata_block -s -S --latex-engine=pdflatex --template=$(PREFIX)/templates/latex.template --filter pandoc-citeproc --csl=$(PREFIX)/csl/$(CSL).csl --bibliography=$(BIB)</p>

<p>

This contains some variables that are set at the top of the Makefile. On my computer, the command as actually executed looks like this:

1
2
3
4
5
</p>

<p>pandoc -r markdown+simple_tables+table_captions+yaml_metadata_block -s -S --latex-engine=pdflatex --template=/Users/kjhealy/.pandoc/templates/latex.template --filter pandoc-citeproc --csl=/Users/kjhealy/.pandoc/csl/apsr.csl --bibliography=/Users/kjhealy/Documents/bibs/socbib-pandoc.bib</p>

<p>

Your version would vary depending on the location of the templates and bibliography files.

The pandoc latex.template and xelatex.template files differ mainly in the way they set up typefaces. The beginning of the latex.template file has the following lines:

1
2
3
4
5
6
7
</p>

<p>\documentclass[11pt,article,oneside]{memoir}
\usepackage[minion]{org-preamble-pdflatex}
\input{vc}</p>

<p>

If you do not have the Minion Pro fonts installed and available to LaTeX, remove the [minion] option from the section line. If you do not use `vc.sty then comment out or delete the third line. Similarly, in xelatex.template change the font declarations after the \begin{document} line to typefaces you have installed.

The examples directory also includes a sample .Rmd file. The code chunks in the file provide examples of how to generate tables and figures in the document, and some useful options that can be passed to knitr. Consult the knitr project page for extensive documentation and many more examples. To produce output from the article-knitr.Rmd file, launch R in the working directory, load knitr, and process the file. You will also need the ascii and memisc libraries to be available.

Liquid error: ClassNotFound: no lexer for alias ‘R’ found

If things are working properly, then a markdown file called article-knitr.md will be produced. Because of the way some options are set in the .Rmd file, knitr produces both PNG and PDF versions of whatever figures are generated by R. That prepares the way for easy conversion to HTML and LaTeX. Once the article-knitr.md file is produced, HTML, .tex, and PDF versions of it can be produced as before, by typing make at the command line. You can also run the pandoc commands manually, of course, or run pandoc from inside R via knitr’s pandoc helper function, or set your editor up to run make for you as needed, if it can do that.

Using Marked

In everyday use, I find Brett Terpstra’s Marked.app to be a very useful way of previewing text while writing. Marked supports pandoc as a custom processor. Essentially, you tell it to run a pandoc command like the one above to generate its previews, instead of its built-in markdown processor. You do this in the “Behavior” tab of Marked’s preferences.

The “Path” box contains the full path to pandoc, and the “Args” box contains all the relevant command switches—in my case, as above, -r markdown+simple_tables+table_captions+yaml_metadata_block -w html -S --template=/Users/kjhealy/.pandoc/templates/html.template --filter pandoc-citeproc --bibliography=/Users/kjhealy/Documents/bibs/socbib-pandoc.bib. When editing your markdown file in your favorite text editor, you point Marked at the file and get a live preview. Like this:

Marked comes with some nice CSS files. You can add the CSS files in the pandoc-templates repo to Marked’s list of CSS files. As with the LaTeX templates, if you do not have the fonts installed, change the relevant lines of the CSS (or don’t use it).

The upshot of all of this is powerful editing using Emacs, ESS, R, and other tools; flexible conversion using pandoc, quick and easy previewing via HTML and Marked; and high-quality PDF typesetting at the same time (or whenever needed)—all from plain text and including almost all of what most of the scholarly papers I write need to include.

Envoi

Writing academic papers is a pain. The tools for processing documents and integrating data, code, text, and reference material are by now extremely powerful. The main stumbling block is figuring out how to join these tools together while preserving the things academic papers need to have included. I am not the sort of person who codes tools like this. Rather, I’m the sort of user who gets a bee in his bonnet about getting the output to look just so. Hence the resources page. Now you, too, dear reader, are empowered to set up your writing environment in an excessively picky fashion, should you irrationally so desire. As I think Andy Warhol remarked, it takes a lot of work to figure out how to look this good.

To leave a comment for the author, please follow the link and comment on their blog: Category: R | Kieran Healy.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)