**Odd Hypothesis**, and kindly contributed to R-bloggers)

Previously, I came up with a solution to R’s less than ideal handling of named capture in regular expressions with my `re.capture()`

function. A little more than a year later, the problem is rearing its ugly – albeit subtly different – head again.

I now have a single character string:

`x = '`a` + `[b]` + `[1c]` + `[d] e`'`

from which I need to pull matches from. In the case above anything encapuslated in backticks. Since my original `re.capture()`

function was based on R’s `regexpr()`

function, it would only return the first match:

`> re.capture('`(?`.*?)`' , x)$names

$tok

[1] "a"

Simply switching the underlying `regexpr()`

to `gregexpr()`

wasn’t straight forward as `gregexpr()`

returns a list:

`> str(gregexpr('`(?`.*?)`' , x, perl=T))

List of 1

$ : atomic [1:4] 1 7 15 24

..- attr(*, "match.length")= int [1:4] 3 5 6 7

..- attr(*, "useBytes")= logi TRUE

..- attr(*, "capture.start")= int [1:4, 1] 2 8 16 25

.. ..- attr(*, "dimnames")=List of 2

.. .. ..$ : NULL

.. .. ..$ : chr "tok"

..- attr(*, "capture.length")= int [1:4, 1] 1 3 4 5

.. ..- attr(*, "dimnames")=List of 2

.. .. ..$ : NULL

.. .. ..$ : chr "tok"

..- attr(*, "capture.names")= chr "tok"

which happens to be as long as the input character vector against which the regex pattern is matched:

`> x = '`a` + `[b]` + `[1c]` + `[d] e`'`

> z = '`f` + `[g]` + `[1h]` + `[i] j`'

> str(gregexpr('`(?.*?)`' , c(x,z) , perl=T), max.level=0)

List of 2

each element of which is a regex match object with its own set of attributes. Thus the new solution was to write a **new** function that walks the `list()`

generated by `gregexpr()`

looking for name captured tokens:

`gregexcap = function(pattern, x, ...) {`

args = list(...)

args[['perl']] = T

re = do.call(gregexpr, c(list(pattern, x), args))

mapply(function(re, x){

cap = sapply(attr(re, 'capture.names'), function(n, re, x){

start = attr(re, 'capture.start')[, n]

len = attr(re, 'capture.length')[, n]

end = start + len - 1

tok = substr(rep(x, length(start)), start, end)

return(tok)

}, re, x, simplify=F, USE.NAMES=T)

return(cap)

}, re, x, SIMPLIFY=F)

}

thereby returning my R coding universe to one-liner bliss:

`> gregexcap('`(?`.*?)`' , x)

[[1]]

[[1]]$tok

[1] "a" "[b]" "[1c]" "[d] e"

> gregexcap('`(?.*?)`' , c(x,z))

[[1]]

[[1]]$tok

[1] "a" "[b]" "[1c]" "[d] e"

[[2]]

[[2]]$tok

[1] "ff" "[gg]" "[11hh]" "[ii] jj"

Written with StackEdit.

**leave a comment**for the author, please follow the link and comment on their blog:

**Odd Hypothesis**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...