RegEx: Named Capture in R (Round 2)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Previously, I came up with a solution to R’s less than ideal handling of named capture in regular expressions with my re.capture()
function. A little more than a year later, the problem is rearing its ugly – albeit subtly different – head again.
I now have a single character string:
<span class="pln">x </span><span class="pun">=</span><span class="pln"> </span><span class="str">'`a` + `[b]` + `[1c]` + `[d] e`'</span>
from which I need to pull matches from. In the case above anything encapuslated in backticks. Since my original re.capture()
function was based on R’s regexpr()
function, it would only return the first match:
<span class="pun">></span><span class="pln"> re</span><span class="pun">.</span><span class="pln">capture</span><span class="pun">(</span><span class="str">'`(?<tok>.*?)`'</span><span class="pun">,</span><span class="pln"> x</span><span class="pun">)</span><span class="pln">$names<br />$tok<br /></span><span class="pun">[</span><span class="lit">1</span><span class="pun">]</span><span class="pln"> </span><span class="str">"a"</span>
Simply switching the underlying regexpr()
to gregexpr()
wasn’t straight forward as gregexpr()
returns a list:
<span class="pun">></span><span class="pln"> str</span><span class="pun">(</span><span class="pln">gregexpr</span><span class="pun">(</span><span class="str">'`(?<tok>.*?)`'</span><span class="pun">,</span><span class="pln"> x</span><span class="pun">,</span><span class="pln"> perl</span><span class="pun">=</span><span class="pln">T</span><span class="pun">))</span><span class="pln"><br /></span><span class="typ">List</span><span class="pln"> of </span><span class="lit">1</span><span class="pln"><br /> $ </span><span class="pun">:</span><span class="pln"> atomic </span><span class="pun">[</span><span class="lit">1</span><span class="pun">:</span><span class="lit">4</span><span class="pun">]</span><span class="pln"> </span><span class="lit">1</span><span class="pln"> </span><span class="lit">7</span><span class="pln"> </span><span class="lit">15</span><span class="pln"> </span><span class="lit">24</span><span class="pln"><br /> </span><span class="pun">..-</span><span class="pln"> attr</span><span class="pun">(*,</span><span class="pln"> </span><span class="str">"match.length"</span><span class="pun">)=</span><span class="pln"> </span><span class="kwd">int</span><span class="pln"> </span><span class="pun">[</span><span class="lit">1</span><span class="pun">:</span><span class="lit">4</span><span class="pun">]</span><span class="pln"> </span><span class="lit">3</span><span class="pln"> </span><span class="lit">5</span><span class="pln"> </span><span class="lit">6</span><span class="pln"> </span><span class="lit">7</span><span class="pln"><br /> </span><span class="pun">..-</span><span class="pln"> attr</span><span class="pun">(*,</span><span class="pln"> </span><span class="str">"useBytes"</span><span class="pun">)=</span><span class="pln"> logi TRUE<br /> </span><span class="pun">..-</span><span class="pln"> attr</span><span class="pun">(*,</span><span class="pln"> </span><span class="str">"capture.start"</span><span class="pun">)=</span><span class="pln"> </span><span class="kwd">int</span><span class="pln"> </span><span class="pun">[</span><span class="lit">1</span><span class="pun">:</span><span class="lit">4</span><span class="pun">,</span><span class="pln"> </span><span class="lit">1</span><span class="pun">]</span><span class="pln"> </span><span class="lit">2</span><span class="pln"> </span><span class="lit">8</span><span class="pln"> </span><span class="lit">16</span><span class="pln"> </span><span class="lit">25</span><span class="pln"><br /> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..-</span><span class="pln"> attr</span><span class="pun">(*,</span><span class="pln"> </span><span class="str">"dimnames"</span><span class="pun">)=</span><span class="typ">List</span><span class="pln"> of </span><span class="lit">2</span><span class="pln"><br /> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..</span><span class="pln">$ </span><span class="pun">:</span><span class="pln"> NULL<br /> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..</span><span class="pln">$ </span><span class="pun">:</span><span class="pln"> chr </span><span class="str">"tok"</span><span class="pln"><br /> </span><span class="pun">..-</span><span class="pln"> attr</span><span class="pun">(*,</span><span class="pln"> </span><span class="str">"capture.length"</span><span class="pun">)=</span><span class="pln"> </span><span class="kwd">int</span><span class="pln"> </span><span class="pun">[</span><span class="lit">1</span><span class="pun">:</span><span class="lit">4</span><span class="pun">,</span><span class="pln"> </span><span class="lit">1</span><span class="pun">]</span><span class="pln"> </span><span class="lit">1</span><span class="pln"> </span><span class="lit">3</span><span class="pln"> </span><span class="lit">4</span><span class="pln"> </span><span class="lit">5</span><span class="pln"><br /> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..-</span><span class="pln"> attr</span><span class="pun">(*,</span><span class="pln"> </span><span class="str">"dimnames"</span><span class="pun">)=</span><span class="typ">List</span><span class="pln"> of </span><span class="lit">2</span><span class="pln"><br /> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..</span><span class="pln">$ </span><span class="pun">:</span><span class="pln"> NULL<br /> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..</span><span class="pln"> </span><span class="pun">..</span><span class="pln">$ </span><span class="pun">:</span><span class="pln"> chr </span><span class="str">"tok"</span><span class="pln"><br /> </span><span class="pun">..-</span><span class="pln"> attr</span><span class="pun">(*,</span><span class="pln"> </span><span class="str">"capture.names"</span><span class="pun">)=</span><span class="pln"> chr </span><span class="str">"tok"</span>
which happens to be as long as the input character vector against which the regex pattern is matched:
<span class="pun">></span><span class="pln"> x </span><span class="pun">=</span><span class="pln"> </span><span class="str">'`a` + `[b]` + `[1c]` + `[d] e`'</span><span class="pln"><br /></span><span class="pun">></span><span class="pln"> z </span><span class="pun">=</span><span class="pln"> </span><span class="str">'`f` + `[g]` + `[1h]` + `[i] j`'</span><span class="pln"><br /></span><span class="pun">></span><span class="pln"> str</span><span class="pun">(</span><span class="pln">gregexpr</span><span class="pun">(</span><span class="str">'`(?<tok>.*?)`'</span><span class="pun">,</span><span class="pln"> c</span><span class="pun">(</span><span class="pln">x</span><span class="pun">,</span><span class="pln">z</span><span class="pun">)</span><span class="pln"> </span><span class="pun">,</span><span class="pln"> perl</span><span class="pun">=</span><span class="pln">T</span><span class="pun">),</span><span class="pln"> max</span><span class="pun">.</span><span class="pln">level</span><span class="pun">=</span><span class="lit">0</span><span class="pun">)</span><span class="pln"><br /></span><span class="typ">List</span><span class="pln"> of </span><span class="lit">2</span>
each element of which is a regex match object with its own set of attributes. Thus the new solution was to write a new function that walks the list()
generated by gregexpr()
looking for name captured tokens:
<span class="pln">gregexcap </span><span class="pun">=</span><span class="pln"> </span><span class="kwd">function</span><span class="pun">(</span><span class="pln">pattern</span><span class="pun">,</span><span class="pln"> x</span><span class="pun">,</span><span class="pln"> </span><span class="pun">...)</span><span class="pln"> </span><span class="pun">{</span><span class="pln"><br /> args </span><span class="pun">=</span><span class="pln"> list</span><span class="pun">(...)</span><span class="pln"><br /> args</span><span class="pun">[[</span><span class="str">'perl'</span><span class="pun">]]</span><span class="pln"> </span><span class="pun">=</span><span class="pln"> T<br /><br /> re </span><span class="pun">=</span><span class="pln"> </span><span class="kwd">do</span><span class="pun">.</span><span class="pln">call</span><span class="pun">(</span><span class="pln">gregexpr</span><span class="pun">,</span><span class="pln"> c</span><span class="pun">(</span><span class="pln">list</span><span class="pun">(</span><span class="pln">pattern</span><span class="pun">,</span><span class="pln"> x</span><span class="pun">),</span><span class="pln"> args</span><span class="pun">))</span><span class="pln"><br /><br /> mapply</span><span class="pun">(</span><span class="kwd">function</span><span class="pun">(</span><span class="pln">re</span><span class="pun">,</span><span class="pln"> x</span><span class="pun">){</span><span class="pln"><br /><br /> cap </span><span class="pun">=</span><span class="pln"> sapply</span><span class="pun">(</span><span class="pln">attr</span><span class="pun">(</span><span class="pln">re</span><span class="pun">,</span><span class="pln"> </span><span class="str">'capture.names'</span><span class="pun">),</span><span class="pln"> </span><span class="kwd">function</span><span class="pun">(</span><span class="pln">n</span><span class="pun">,</span><span class="pln"> re</span><span class="pun">,</span><span class="pln"> x</span><span class="pun">){</span><span class="pln"><br /> start </span><span class="pun">=</span><span class="pln"> attr</span><span class="pun">(</span><span class="pln">re</span><span class="pun">,</span><span class="pln"> </span><span class="str">'capture.start'</span><span class="pun">)[,</span><span class="pln"> n</span><span class="pun">]</span><span class="pln"><br /> len </span><span class="pun">=</span><span class="pln"> attr</span><span class="pun">(</span><span class="pln">re</span><span class="pun">,</span><span class="pln"> </span><span class="str">'capture.length'</span><span class="pun">)[,</span><span class="pln"> n</span><span class="pun">]</span><span class="pln"><br /> </span><span class="kwd">end</span><span class="pln"> </span><span class="pun">=</span><span class="pln"> start </span><span class="pun">+</span><span class="pln"> len </span><span class="pun">-</span><span class="pln"> </span><span class="lit">1</span><span class="pln"><br /> tok </span><span class="pun">=</span><span class="pln"> substr</span><span class="pun">(</span><span class="pln">rep</span><span class="pun">(</span><span class="pln">x</span><span class="pun">,</span><span class="pln"> length</span><span class="pun">(</span><span class="pln">start</span><span class="pun">)),</span><span class="pln"> start</span><span class="pun">,</span><span class="pln"> </span><span class="kwd">end</span><span class="pun">)</span><span class="pln"><br /><br /> </span><span class="kwd">return</span><span class="pun">(</span><span class="pln">tok</span><span class="pun">)</span><span class="pln"><br /> </span><span class="pun">},</span><span class="pln"> re</span><span class="pun">,</span><span class="pln"> x</span><span class="pun">,</span><span class="pln"> simplify</span><span class="pun">=</span><span class="pln">F</span><span class="pun">,</span><span class="pln"> USE</span><span class="pun">.</span><span class="pln">NAMES</span><span class="pun">=</span><span class="pln">T</span><span class="pun">)</span><span class="pln"><br /><br /> </span><span class="kwd">return</span><span class="pun">(</span><span class="pln">cap</span><span class="pun">)</span><span class="pln"><br /> </span><span class="pun">},</span><span class="pln"> re</span><span class="pun">,</span><span class="pln"> x</span><span class="pun">,</span><span class="pln"> SIMPLIFY</span><span class="pun">=</span><span class="pln">F</span><span class="pun">)</span><span class="pln"><br /><br /></span><span class="pun">}</span>
thereby returning my R coding universe to one-liner bliss:
<span class="pun">></span><span class="pln"> gregexcap</span><span class="pun">(</span><span class="str">'`(?<tok>.*?)`'</span><span class="pun">,</span><span class="pln"> x</span><span class="pun">)</span><span class="pln"><br /></span><span class="pun">[[</span><span class="lit">1</span><span class="pun">]]</span><span class="pln"><br /></span><span class="pun">[[</span><span class="lit">1</span><span class="pun">]]</span><span class="pln">$tok<br /></span><span class="pun">[</span><span class="lit">1</span><span class="pun">]</span><span class="pln"> </span><span class="str">"a"</span><span class="pln"> </span><span class="str">"[b]"</span><span class="pln"> </span><span class="str">"[1c]"</span><span class="pln"> </span><span class="str">"[d] e"</span><span class="pln"><br /><br /></span><span class="pun">></span><span class="pln"> gregexcap</span><span class="pun">(</span><span class="str">'`(?<tok>.*?)`'</span><span class="pun">,</span><span class="pln"> c</span><span class="pun">(</span><span class="pln">x</span><span class="pun">,</span><span class="pln">z</span><span class="pun">))</span><span class="pln"><br /></span><span class="pun">[[</span><span class="lit">1</span><span class="pun">]]</span><span class="pln"><br /></span><span class="pun">[[</span><span class="lit">1</span><span class="pun">]]</span><span class="pln">$tok<br /></span><span class="pun">[</span><span class="lit">1</span><span class="pun">]</span><span class="pln"> </span><span class="str">"a"</span><span class="pln"> </span><span class="str">"[b]"</span><span class="pln"> </span><span class="str">"[1c]"</span><span class="pln"> </span><span class="str">"[d] e"</span><span class="pln"><br /><br /></span><span class="pun">[[</span><span class="lit">2</span><span class="pun">]]</span><span class="pln"><br /></span><span class="pun">[[</span><span class="lit">2</span><span class="pun">]]</span><span class="pln">$tok<br /></span><span class="pun">[</span><span class="lit">1</span><span class="pun">]</span><span class="pln"> </span><span class="str">"ff"</span><span class="pln"> </span><span class="str">"[gg]"</span><span class="pln"> </span><span class="str">"[11hh]"</span><span class="pln"> </span><span class="str">"[ii] jj"</span>
Written with StackEdit.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.