Using Regular Expressions in R: Case Study in Cleaning a BibTeX Database

March 9, 2010
By

(This article was first published on Jeromy Anglim's Blog: Psychology and Statistics, and kindly contributed to R-bloggers)

I recently had to clean up a BibTeX database containing around 1,000 references. One of the clean up tasks was to ensure that page numbers were separated with en-dashes as opposed to hyphens. This post sets out how I used regular expressions in R to complete the task and check the results. I also hope to highlight the general power of string manipulation in R.

Problem: There are four main kinds of dashes: hyphens, en-dashes, em-dashes, and minus. Wikipedia discusses the differences between the dashes. LaTeX produces dashes using one, two, or three hyphens: (- for hyphen; -- for en-dash; --- for em-dash) or $-$ for minus. When expressing ranges of numbers (e.g., pages 96-100), an en-dash should be used.  However, all my page numbers were in a format where only a single hyphen was used.
Thus, I wanted to replace "-" with "--" but only for page numbers.

The initial text file looked a little like this but with 1000 more references:
@ARTICLE{Reder1987CP,
author = {Reder, L. M.},
title = {Strategy selection in question answering},
journal = {Cognitive Psychology},
year = {1987},
volume = {19},
pages = {90-138},
endnotereftype = {Journal Article},
shorttitle = {Strategy selection in question answering}
}

@ARTICLE{Reder1982PR,
author = {Reder, L. M.},
title = {Plausability judgments versus fact retrieval: Strategies for sentence
verification},
journal = {Psychological Review},
year = {1982},
volume = {89(3)},
pages = {248-278},
endnotereftype = {Journal Article},
shorttitle = {Plausability judgments versus fact retrieval: Strategies for sentence
verification}
}

And I wanted something like this (note the "pages = {...}"):
@ARTICLE{Reder1987CP,
author = {Reder, L. M.},
title = {Strategy selection in question answering},
journal = {Cognitive Psychology},
year = {1987},
volume = {19},
pages = {90--138},
endnotereftype = {Journal Article},
shorttitle = {Strategy selection in question answering}
}

@ARTICLE{Reder1982PR,
author = {Reder, L. M.},
title = {Plausability judgments versus fact retrieval: Strategies for sentence
verification},
journal = {Psychological Review},
year = {1982},
volume = {89(3)},
pages = {248--278},
endnotereftype = {Journal Article},
shorttitle = {Plausability judgments versus fact retrieval: Strategies for sentence
verification}
}


Solution: The natural choice was to use regular expressions. Many programming languages (and some text editors) support regular expressions. Because I'm most familiar with R, I tend to use R to process regular expressions. It's probably not the most obvious choice, but it does allow me to get feedback about how the patterns are matched and replaced. And it means that I can leverage my skills in R to use regular expressions. It also means that when I need to use string manipulation for data analysis, I am familiar with the tools.

Overview of regular expressions: For readers unfamiliar with regular expressions, they are an extremely powerful tool for finding and replacing text. Information about support for regular expressions in R can be found by typing ?regex. Additional information about the actual search and replace functions can be found by looking at the help for one of the string manipulation functions such as ?gsub. Data Manipulation with R has a chapter on string manipulation in R that I found helpful. RegularExpression.Info also has a tutorial.


Copy of the R Code

  x <- readLines("clipboard-128") 
#Copy the BibTeX database from the
#Clipboard (or this could be a file)
#result is a character vector where each line is an element

# The initial filter reads:
# "^" start of text
# " page = " literal text
# "[{]" the open brace is a special character
# and needs to be escaped by square brackets
# "[[:digit:]]" any number from 0 to 9
# "+" one or more of the preceding characters
# (i.e.,one or more numbers)
# "-" literal text
# "[[:digit:]]" any number from 0 to 9
# "+" one or more of the preceding characters
# (i.e., one or more numbers)
initialFilter <- "^ pages = [{][[:digit:]]+-[[:digit:]]+"

myPattern <- "-"
myReplacement <- "--"
xOutput <- x

# Apply initial filter
xSubset <- grep(initialFilter, x)

# Replace matches within filter
xOutput[xSubset] <- sub(pattern = myPattern,
replacement = myReplacement, x = x[xSubset])

# Basic Check that it worked
cbind(x[x != xOutput], xOutput[x != xOutput])
# Check replacement: shows original and replaced

xOutput

# Write the replaced text to a file
writeLines(xOutput, "xOutput.txt")

Copy of the R Output from the Check:
The following shows the first few lines of the check. The first column shows the original text and second column shows the replaced text:

>   cbind(x[x != xOutput], xOutput[x != xOutput]) 
[,1] [,2]
[1,] " pages = {598-614}," " pages = {598--614},"
[2,] " pages = {883-901}," " pages = {883--901},"
[3,] " pages = {360-364}," " pages = {360--364},"
[4,] " pages = {288-318}," " pages = {288--318},"
[5,] " pages = {3-27}," " pages = {3--27},"
[6,] " pages = {567-589}," " pages = {567--589},"
[7,] " pages = {259-290}," " pages = {259--290},"
[8,] " pages = {270-304}," " pages = {270--304},"

Main points that I take away from this:

  • R has powerful string manipulation tools; They're worth learning, if you use R.
  • R has a habit of introducing users to powerful tools hidden from the typical Windows setup.
  • R, LaTeX, BibTeX, Sweave, and Regular expressions are all  text-driven systems in contrast to largely menu-driven systems such as SPSS, MS Word, and Endnote. Their textual nature facilitates their mutual co-operation.
  • Running checks on replacement operations in regular expressions is important

To leave a comment for the author, please follow the link and comment on his blog: Jeromy Anglim's Blog: Psychology and Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.