Fixing APA citations from Pandoc with stringr

[This article was first published on Higher Order Functions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pandoc is awesome. It’s the universal translator for plain-text
documents. I especially like that it can do inline citations. I write
@Jones2005 proved aliens exist and pandoc produces “Jones (2005) proved
aliens exist”.

But it doesn’t quite do APA style citations correctly. A citation
like @SimpsonFlanders2006 found... renders as “Simpson & Flanders (2006)
found…”. Inline citations are not supposed to have an ampersand. It should be
“Simpson and Flanders (2006) found…”.

In the grand scheme of writing and revising, these errors are tedious low-level
stuff. But I have colleagues who will read a draft of a manuscript and
write unnecessary comments about how to cite stuff in APA. And the problem is
just subtle and pervasive enough that it doesn’t make sense to manually fix
the citations each time I generate my manuscript. My current project has 15 of
these ill-formatted citations. That number is just big enough to make manual
corrections an error-prone process— easy to miss 1 in 15.

Find and replace

I wrote a quick R function that replaces all those inlined ampersands with

<span class="n">library</span><span class="p">(</span><span class="s2">"stringr"</span><span class="p">)</span><span class="w">

</span><span class="n">fix_inline_citations</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

Let’s assume that an inline citation ends with an author’s last name followed
by a parenthesized year: SomeKindOfName (2001). We encode these assumptions
into regular expression patterns, prefixed with re_.

The year is pretty easy. If it looks weird, it’s because I prefer to escape
special punctuation like ( using brackets like [(]. Otherwise, a year is
just four digits: \\d{4}.

<span class="w">  </span><span class="n">re_inline_year</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"[(]\\d{4}[)]"</span><span class="w">

What’s in a name? Here we have to stick our necks out a little bit more about
our assumptions. I’m going to assume a last name is any combination of letters,
hyphens and spaces (spaces needed for von Name).

<span class="w">  </span><span class="n">re_author</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"[[:alpha:]- ]+"</span><span class="w">
  </span><span class="n">re_author_year</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">re_author</span><span class="p">,</span><span class="w"> </span><span class="n">re_inline_year</span><span class="p">)</span><span class="w">

We define the ampersand.

<span class="w">  </span><span class="n">re_ampersand</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">" & "</span><span class="w">

Lookaround, lookaround. Our last regular expression trick is positive lookahead.
Suppose we want just the word “hot” from the larger word “hotdog”.
Using just hot would match too many things, like the “hot” in “hoth”. Using
hotdog would match the whole word “hotdog”, which is more than we asked for.
Lookaround patterns allow us to impose more constraints on a pattern.
In the “hotdog”” example, positive lookahead hot(?=dog) says find “hot” if it
precedes “dog”.

We use positive lookahead to find only the ampersands followed by an author name
and a year. We replace the strings that match this pattern with and’s.

<span class="w">  </span><span class="n">re_ampersand_author_year</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s(?=%s)"</span><span class="p">,</span><span class="w"> </span><span class="n">re_ampersand</span><span class="p">,</span><span class="w"> </span><span class="n">re_author_year</span><span class="p">)</span><span class="w">  
  </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="n">re_ampersand_author_year</span><span class="p">,</span><span class="w"> </span><span class="s2">" and "</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

We can now test our function on a variety of names that it should and should

<span class="n">do_fix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
  </span><span class="s2">"Jones & Name (2005) found..."</span><span class="p">,</span><span class="w">
  </span><span class="s2">"Jones & Hyphen-Name (2005) found..."</span><span class="p">,</span><span class="w">
  </span><span class="s2">"Jones & Space Name (2005) found..."</span><span class="p">,</span><span class="w">
  </span><span class="s2">"Marge, Maggie, & Lisa (2005) found..."</span><span class="p">)</span><span class="w">

</span><span class="n">fix_inline_citations</span><span class="p">(</span><span class="n">do_fix</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] "Jones and Name (2005) found..."         
#> [2] "Jones and Hyphen-Name (2005) found..."  
#> [3] "Jones and Space Name (2005) found..."   
#> [4] "Marge, Maggie, and Lisa (2005) found..."
</span><span class="w">
</span><span class="n">do_not_fix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
  </span><span class="s2">"...have been found (Jones & Name, 2005)"</span><span class="p">,</span><span class="w">
  </span><span class="s2">"...have been found (Jones & Hyphen-Name, 2005)"</span><span class="p">,</span><span class="w">
  </span><span class="s2">"...have been found (Jones & Space Name, 2005)"</span><span class="p">,</span><span class="w">
  </span><span class="s2">"...have been found (Marge, Maggie, & Lisa, 2005)"</span><span class="p">)</span><span class="w">  

</span><span class="n">fix_inline_citations</span><span class="p">(</span><span class="n">do_not_fix</span><span class="p">)</span><span class="w">
</span><span class="c1">#> [1] "...have been found (Jones & Name, 2005)"         
#> [2] "...have been found (Jones & Hyphen-Name, 2005)"  
#> [3] "...have been found (Jones & Space Name, 2005)"   
#> [4] "...have been found (Marge, Maggie, & Lisa, 2005)"

By the way, our final regular expression re_ampersand_author_year is
& (?=[[:alpha:]- ]+ [(]\d{4}[)]). It’s not very readable or comprehensible in
that form, so that’s why we built it up step by step from easier sub-patterns
like re_author and re_inline_year. (Which is a micro-example of the strategy
of managing complexity by combining/composing simpler primitives.)

Steps towards production

These are complications that arose as I tried to use the function on my actual

Placing it in a build pipeline. My text starts with an RMarkdown file
that is knitted into a markdown file and rendered into other formats by
pandoc. Because this function post-processes output from pandoc, I can’t
just hit the “Knit”” button in RStudio. I had to make a separate script to
do rmarkdown::render to convert my .Rmd file into a .md file which can then be
processed by this function.

Don’t fix too much. When pandoc does your references for you, it also does
a bibliography section. But it would be wrong to fix the ampersands there. So
I have to do a bit of fussing around by finding the line "## References" and
processing just the text up until that line.

Accounting for encoding. I use readr::read_lines and
stringi::stri_write_lines to read and write the text file to preserve the
encoding of characters. (readr just released its own write_lines today
actually, so I can’t vouch for it yet.)

False matches are still possible. Suppose I’m citing a publication by an
organization, like Johnson & Johnson, where that ampersand is part of the name.
That citation would wrongly be corrected. I have yet to face
that issue in practice though.

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)