Regular expressions for everyone else

[This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Regular expressions are an amazing tool for working with character data, but they are also painful to read and write.  Even after years of working with them, I struggle to remember the syntax for negative lookahead, or which way round the start and end anchor symbols go.

Consequently, I’ve created the regex package for human readable regular expression generation.  It’s currently only on github (CRAN version arriving as soon as you give me feedback), so you can get it with:

library(devtools)
install_github("regex", "richierocks")

Before, if I wanted to find the names of all the operators in the base package, my workflow would be something like:

I need ls, with a pattern that matches punctuation.  So I open the ?regex help page and look for the character class for punctuation.  My first attempt is then:

ls(baseenv(), pattern = "[:punct:]")

Ok, wait, the class has to be wrapped in square brackets itself.

ls(baseenv(), pattern = "[[:punct:]]")

Better, but that’s matching S3 classes and some functions too.  I want to match only where there’s punctuation at the start.  What’s the anchor for the start?  Back to reading ?regex.  Sod it, there’s too much text here; it’s probably a dollar sign.

ls(baseenv(), pattern = "$[[:punct:]]")

Hmm, nope.  Must be a caret.

ls(baseenv(), pattern = "^[[:punct:]]")

Hurrah!  Still, it took me 5 minutes for a simple example.  For something more complicated like matching email addresses or telephone numbers or particular time formats, building regular expressions this way can become time consuming and frustrating.  Here’s the equivalent syntax using regex.

ls(baseenv(), pattern = START %c% punct())

START; is just a constant that returns a caret. The %c% operator is a wrapper to paste0, and punct is a function returning a group of punctuation.  You can pass it argument to match multiple punctuation.  For example punct(3, 5) matches between 3 and 5 punctuation characters.

You also get lower-level functions.  punct(3, 5) is a convenience wrapper for repeated(group(PUNCT), 3, 5).

As a more complicated example, you can match an email address like:

one_or_more(group(ASCII_ALNUM %c% "._%+-")) %c%
  "@" %c%
  one_or_more(group(ASCII_ALNUM %c% ".-")) %c%
  DOT %c%
  ascii_alpha(2, 4)

This reads Match one or more letters, numbers, dots, underscores, percents, plusses or hyphens. Then match an ‘at’ symbol. Then match one or more letters, numbers, dots, or hyphens. Then match a dot. Then match two to four letters.

There are also functions for tokenising, capturing, and lookahead/lookbehind, and an operator for alternation.  I’m already rather excited about how much easier regular expressions have become for me to use.


Tagged: r, regex, regular-expressions

To leave a comment for the author, please follow the link and comment on their blog: 4D Pie Charts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)