Properly “internationalized” regular expressions in R

[This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We should pay special attention to writing a truly portable code that works in the same fashion under different locales and character encodings. Currently, R has two Regex engines, ERE (via TRE) and PRE (via PCRE). What is surprising, they ought to give different results on different operating systems and native character encodings used!

UPDATE@2013-07-10: check out our stringi package to get rid of such problems forever!

PCRE often outperforms ERE and has a more powerful syntax. Moreover, it was built into R with Unicode support. As UTF-8 may represent almost all printable characters used around the world, a good idea is to always use PRE on normalized character vectors, i.e. converted from native encoding to UTF-8 via enc2utf8() and then, after regexing, back with enc2native().

Here’s an example on matching some character classes in three different locales. The string where matches were sought consisted of all ASCII characters (codes 1–127) and Polish letters (ą, ę, ł, ś, ż, and so on).

Locale
Pattern pl_PL.UTF-8
(GNU/Linux)
pl_PL.iso-8859-2
(GNU/Linux)
Polish_Poland.1250
(Windows)
ERE-Native
[[:alpha:]] AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż AB...Zab...zĆĘŃÓćęńó
[[:digit:]] 0123456789 0123456789ął
[[:lower:]] ab...ząćęłńóśźż ab...zćęńó
[[:upper:]] AB...ZĄĆĘŁŃÓŚŹŻ AB...ZĆĘŃÓ
[[:punct:]] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŚŹŻąłśźż !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŻąłż
[A-Z] AB...Z
[a-z] ab...z
PCRE-Native
[[:alpha:]] AB...Zab...z AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż
[[:digit:]] 0123456789
[[:lower:]] ab...z ab...ząćęłńóśźż
[[:upper:]] AB...Z AB...ZĄĆĘŁŃÓŚŹŻ
[[:punct:]] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[A-Z] AB...Z
[a-z] ab...z
ERE-UTF-8 normalized
[[:alpha:]] AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż AB...Zab...zÓó
[[:digit:]] 0123456789
[[:lower:]] ab...ząćęłńóśźż ab...zó
[[:upper:]] AB...ZĄĆĘŁŃÓŚŹŻ AB...ZÓ
[[:punct:]] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[A-Z] AB...Z
[a-z] ab...z
PCRE-UTF-8 normalized
\p{L} AB...Zab...zĄĆĘŁŃÓŚŹŻąćęłńóśźż
\p{N} 0123456789
\p{Ll} ab...ząćęłńóśźż
\p{Lu} AB...ZĄĆĘŁŃÓŚŹŻ
\p{P} !"#%&'()*,-./:;?@[\]_{}
\p{S} $+<=>^`|~
\p{S}|\p{P} !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[A-Z] AB...Z
[a-z] ab...z

We see that PCRE after a “normalization” with enc2utf8() gives correct results in all the locales.

An example:

gregexpr(enc2utf8(pattern), enc2utf8(text), perl=TRUE)

With the stringr package you may use e.g.:

str_extract_all(enc2utf8(text), perl(enc2utf8(pattern)))

Note that regexec() (and str_match_all() from stringr) currently doesn’t support PRE. However, you may use gregexpr() instead.

str_match_all.perl <- function(s, p)
{
   require("stringr")
   m <- gregexpr(enc2utf8(p), enc2utf8(s), perl=TRUE) # PCRE-NORMALIZED
   # note that normalization is needed only for regex-matching


   out <- vector("list", length(s))

   # vectorized over s
   for (j in seq_along(s))
   {
      nmatch <- length(m[[j]])
      ncapt  <- length(attr(m[[j]], "capture.names"))

      if (length(m) == 1 && m[[j]] == -1) next

      out[[j]] <- matrix(str_sub(s[[j]], m[[j]],
         m[[j]]+attr(m[[j]], "match.length")-1),
         nrow=nmatch, ncol=ncapt+1)

      if (ncapt > 0) {
         cs <- as.integer(attr(m[[j]], "capture.start"))
         cl <- as.integer(attr(m[[j]], "capture.length"))
         out[[j]][,-1] <- str_sub(s[j], cs, cs+cl-1)

         if (any(str_length(attr(m[[j]], "capture.names")) > 0))
             colnames(out[[j]]) <- c("", attr(m[[j]], "capture.names"))
      }
   }

   out
}


# test:

print(str_match_all.perl(c(
   "rocznik99='nie', skrzynia='auto', osób='5', foo=bar",
   "kolor='ŻÓŁTY', silnik='3.0l TURBO++', skrzynia='manual'"
), "(?<attr>\\p{L}+)='(?<val>[^']*)'")) # two named groups

print(str_match_all.perl(c(
   "kolor='czerwony', świece='nówka', skrzynia='manual'",
   "rocznik99='skądże', skrzynia='auto', osób='5', foo=bar"
), "\\p{L}+='[^']*'")) # no groups

print(str_match_all.perl("123", "\\p{L}+")) # no matches at all

UPDATE@2013-07-10: our stringi package works the same in each platform’s locale and encoding. See the stri_locate_all_regex and stri_match_all_regex functions.

Marek Gągolewski

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)