RegEx: Named Capture in R (Round 2)

October 9, 2013
By

(This article was first published on Odd Hypothesis, and kindly contributed to R-bloggers)

Previously, I came up with a solution to R’s less than ideal handling of named capture in regular expressions with my re.capture() function. A little more than a year later, the problem is rearing its ugly – albeit subtly different – head again.

I now have a single character string:

x = '`a` + `[b]` + `[1c]` + `[d] e`'

from which I need to pull matches from. In the case above anything encapuslated in backticks. Since my original re.capture() function was based on R’s regexpr() function, it would only return the first match:

> re.capture('`(?.*?)`', x)$names
$tok
[1] "a"

Simply switching the underlying regexpr() to gregexpr() wasn’t straight forward as gregexpr() returns a list:

> str(gregexpr('`(?.*?)`', x, perl=T))
List of 1
$
: atomic [1:4] 1 7 15 24
..- attr(*, "match.length")= int [1:4] 3 5 6 7
..- attr(*, "useBytes")= logi TRUE
..- attr(*, "capture.start")= int [1:4, 1] 2 8 16 25
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr "tok"
..- attr(*, "capture.length")= int [1:4, 1] 1 3 4 5
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr "tok"
..- attr(*, "capture.names")= chr "tok"

which happens to be as long as the input character vector against which the regex pattern is matched:

> x = '`a` + `[b]` + `[1c]` + `[d] e`'
> z = '`f` + `[g]` + `[1h]` + `[i] j`'
> str(gregexpr('`(?.*?)`', c(x,z) , perl=T), max.level=0)
List of 2

each element of which is a regex match object with its own set of attributes. Thus the new solution was to write a new function that walks the list() generated by gregexpr() looking for name captured tokens:

gregexcap = function(pattern, x, ...) {
args
= list(...)
args
[['perl']] = T

re
= do.call(gregexpr, c(list(pattern, x), args))

mapply
(function(re, x){

cap
= sapply(attr(re, 'capture.names'), function(n, re, x){
start
= attr(re, 'capture.start')[, n]
len
= attr(re, 'capture.length')[, n]
end = start + len - 1
tok
= substr(rep(x, length(start)), start, end)

return(tok)
}, re, x, simplify=F, USE.NAMES=T)

return(cap)
}, re, x, SIMPLIFY=F)

}

thereby returning my R coding universe to one-liner bliss:

> gregexcap('`(?.*?)`', x)
[[1]]
[[1]]$tok
[1] "a" "[b]" "[1c]" "[d] e"

> gregexcap('`(?.*?)`', c(x,z))
[[1]]
[[1]]$tok
[1] "a" "[b]" "[1c]" "[d] e"

[[2]]
[[2]]$tok
[1] "ff" "[gg]" "[11hh]" "[ii] jj"

Written with StackEdit.

To leave a comment for the author, please follow the link and comment on their blog: Odd Hypothesis.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)