Site icon R-bloggers

RegEx: Named Capture in R (Round 2)

[This article was first published on Odd Hypothesis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Previously, I came up with a solution to R’s less than ideal handling of named capture in regular expressions with my re.capture() function. A little more than a year later, the problem is rearing its ugly – albeit subtly different – head again.

I now have a single character string:

x = '`a` + `[b]` + `[1c]` + `[d] e`'

from which I need to pull matches from. In the case above anything encapuslated in backticks. Since my original re.capture() function was based on R’s regexpr() function, it would only return the first match:

> re.capture('`(?<tok>.*?)`', x)$names
$tok
[1] "a"

Simply switching the underlying regexpr() to gregexpr() wasn’t straight forward as gregexpr() returns a list:

> str(gregexpr('`(?<tok>.*?)`', x, perl=T))
List of 1
 $ : atomic [1:4] 1 7 15 24
  ..- attr(*, "match.length")= int [1:4] 3 5 6 7
  ..- attr(*, "useBytes")= logi TRUE
  ..- attr(*, "capture.start")= int [1:4, 1] 2 8 16 25
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr "tok"
  ..- attr(*, "capture.length")= int [1:4, 1] 1 3 4 5
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr "tok"
  ..- attr(*, "capture.names")= chr "tok"

which happens to be as long as the input character vector against which the regex pattern is matched:

> x = '`a` + `[b]` + `[1c]` + `[d] e`'
> z = '`f` + `[g]` + `[1h]` + `[i] j`'
> str(gregexpr('`(?<tok>.*?)`', c(x,z) , perl=T), max.level=0)
List of 2

each element of which is a regex match object with its own set of attributes. Thus the new solution was to write a new function that walks the list() generated by gregexpr() looking for name captured tokens:

gregexcap = function(pattern, x, ...) {
  args = list(...)
  args[['perl']] = T

  re = do.call(gregexpr, c(list(pattern, x), args))

  mapply(function(re, x){

    cap = sapply(attr(re, 'capture.names'), function(n, re, x){
      start = attr(re, 'capture.start')[, n]
      len   = attr(re, 'capture.length')[, n]
      end   = start + len - 1
      tok   = substr(rep(x, length(start)), start, end)

      return(tok)
    }, re, x, simplify=F, USE.NAMES=T)

    return(cap)
  }, re, x, SIMPLIFY=F)

}

thereby returning my R coding universe to one-liner bliss:

> gregexcap('`(?<tok>.*?)`', x)
[[1]]
[[1]]$tok
[1] "a"     "[b]"   "[1c]"  "[d] e"

> gregexcap('`(?<tok>.*?)`', c(x,z))
[[1]]
[[1]]$tok
[1] "a"     "[b]"   "[1c]"  "[d] e"

[[2]]
[[2]]$tok
[1] "ff"      "[gg]"    "[11hh]"  "[ii] jj"

Written with StackEdit.

To leave a comment for the author, please follow the link and comment on their blog: Odd Hypothesis.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.