# RegEx: Named Capture in R (Round 2)

Previously, I came up with a solution to R’s less than ideal handling of named capture in regular expressions with my `re.capture()`

function. A little more than a year later, the problem is rearing its ugly – albeit subtly different – head again.

I now have a single character string:

x = '`a` + `[b]` + `[1c]` + `[d] e`'

from which I need to pull matches from. In the case above anything encapuslated in backticks. Since my original `re.capture()`

function was based on R’s `regexpr()`

function, it would only return the first match:

> re.capture('`(?<tok>.*?)`', x)$names $tok [1] "a"

Simply switching the underlying `regexpr()`

to `gregexpr()`

wasn’t straight forward as `gregexpr()`

returns a list:

> str(gregexpr('`(?<tok>.*?)`', x, perl=T)) List of 1 $ : atomic [1:4] 1 7 15 24 ..- attr(*, "match.length")= int [1:4] 3 5 6 7 ..- attr(*, "useBytes")= logi TRUE ..- attr(*, "capture.start")= int [1:4, 1] 2 8 16 25 .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : NULL .. .. ..$ : chr "tok" ..- attr(*, "capture.length")= int [1:4, 1] 1 3 4 5 .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : NULL .. .. ..$ : chr "tok" ..- attr(*, "capture.names")= chr "tok"

which happens to be as long as the input character vector against which the regex pattern is matched:

> x = '`a` + `[b]` + `[1c]` + `[d] e`' > z = '`f` + `[g]` + `[1h]` + `[i] j`' > str(gregexpr('`(?<tok>.*?)`', c(x,z) , perl=T), max.level=0) List of 2

each element of which is a regex match object with its own set of attributes. Thus the new solution was to write a **new** function that walks the `list()`

generated by `gregexpr()`

looking for name captured tokens:

gregexcap = function(pattern, x, ...) { args = list(...) args[['perl']] = T re = do.call(gregexpr, c(list(pattern, x), args)) mapply(function(re, x){ cap = sapply(attr(re, 'capture.names'), function(n, re, x){ start = attr(re, 'capture.start')[, n] len = attr(re, 'capture.length')[, n] end = start + len - 1 tok = substr(rep(x, length(start)), start, end) return(tok) }, re, x, simplify=F, USE.NAMES=T) return(cap) }, re, x, SIMPLIFY=F) }

thereby returning my R coding universe to one-liner bliss:

> gregexcap('`(?<tok>.*?)`', x) [[1]] [[1]]$tok [1] "a" "[b]" "[1c]" "[d] e" > gregexcap('`(?<tok>.*?)`', c(x,z)) [[1]] [[1]]$tok [1] "a" "[b]" "[1c]" "[d] e" [[2]] [[2]]$tok [1] "ff" "[gg]" "[11hh]" "[ii] jj"

