RegEx: Named Capture in R
[This article was first published on Odd Hypothesis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I consider myself a decent RegEx user. References to famous quotes about RegEx aside, I find it intuitive, like its speed and that it makes my code simple (more so than the alternative anyhow). Thus, I use RegEx where I can in the growing grab bag of languages I consider myself proficient in:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- *nix command line / shell scripts
- Javascript
- PHP
- Matlab
- Python
- R
To get a sense of R’s named capture inadequacy, here’s a simple scenario …
The Problem:
You are given a list of files with names like:- chA_0001
- chA_0002
- chA_0003
- chB_0001
- chB_0002
- chB_0003
The regular expression with named capture to do this is quite simple:
ch(?[A-Z])\_(?[0-9]{4})
which, given the list of file names, should return some structure with a property:value pairs of the sort:
- ch : A, A, A, B, B, B
- id : 0001, 0002, 0003, 0001, 0002, 0003
The Solutions:
Here’s some Matlab code that basically does this in one line:which would result in the following console output:
Now here’s the equivalent R code:
There is a lot of work here! To help explain what’s going on, here’s the corresponding console output:
Here’s what’s happening:
- regexpr(…, perl=T) is used to create a regular expression result with named capture which is placed in the
$result
item of the output list.$result [1] 1 1 1 1 1 1 attr(,"match.length") [1] 8 8 8 8 8 8 attr(,"useBytes") [1] TRUE attr(,"capture.start") ch id [1,] 3 5 [2,] 3 5 [3,] 3 5 [4,] 3 5 [5,] 3 5 [6,] 3 5 attr(,"capture.length") ch id [1,] 1 4 [2,] 1 4 [3,] 1 4 [4,] 1 4 [5,] 1 4 [6,] 1 4 attr(,"capture.names") [1] "ch" "id"
This result is pretty unusable since all of the important captured information is buried in attribute settings. - To do anything with the output from
regexpr()
, the result from #1 has to have its attributes probed usingattr()
(via a for loop) to get:- captured group names
- start locations within the strings of the captured groups
- length of the captured groups (oddly/depressingly, end positions are not returned)
substr()
to extract the actual match strings from the input list:rex$names[[.name]] = substr(rex$src, attr(rex$result, 'capture.start')[,.name], attr(rex$result, 'capture.start')[,.name] + attr(rex$result, 'capture.length')[,.name] - 1)
- The above steps are encapsulated into a much easier to use function
re.capture()
that allows for one-line-ish extraction:> src [1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003" > pat [1] "ch(?[A-Z])\\_(?[0-9]{4})" > re.capture(pat, src)$names$ch [1] "A" "A" "A" "B" "B" "B" > re.capture(pat, src)$names$id [1] "0001" "0002" "0003" "0001" "0002" "0003"
Summary
All told, it takes three functions and a for loop to get a user friendly named capture result! While I was able to make a one-liner function out of the ordeal, it’s a shame that someone on the R development team couldn’t build this into the return values forregexpr()
and gregexpr()
. Granted, I’m not the first to wish for something better. Perhaps this is something to look forward to in R 2.16?
To leave a comment for the author, please follow the link and comment on their blog: Odd Hypothesis.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.