In the past few months, I’ve developed a set of functions for automating model estimation and interpretation using Mplus, an outstanding latent variable modeling program that has unparalleled flexibility for complex models (e.g., factor mixture models). I recently rolled these functions into an R package called MplusAutomation. Because the package focuses on extracting various parameters from text output files, I’ve learned a lot about regular expressions, particularly Perl-compatible regular expressions using PCRE. R provides a handful of useful regular expression routines that are Perl-compatible (including perl=TRUE as a parameter) and I’ve made frequent use of grep, regexpr, and gregexpr.
The problem with regexpr and gregexpr, in particular, is that their output is wonky and does not lend itself to easy string manipulations. The rest of the post will focus on gregexpr, which is identical to regexpr, except that it returns all matches for a regular expression, whereas regexpr returns only the first. So, if you’re searching for all instances of the letter “a” in the line “abcacdabb”, regexpr would only match the first a, whereas gregexpr would find all three a’s.
Let’s take a simple example. We want R to extract all HTML tags from a text file read into a character vector using the scan function. So that it’s easy to follow, I’ve just defined a character vector with a simple HTML example.
> exampleText <- c(“< head>”, “
This is a title.”, “”, “”, “
This is an example header.
And here is some basic text.”, “A line without any tags.”, “”, “