Properly “internationalized” regular expressions in R

April 5, 2013
By

(This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers)

We should pay special attention to writing a truly portable code that works in the same fashion under different locales and character encodings. Currently, R has two Regex engines, ERE (via TRE) and PRE (via PCRE). What is surprising, they ought to give different results on different operating systems and native character encodings used!

UPDATE@2013-07-10: check out our stringi package to get rid of such problems forever!

PCRE often outperforms ERE and has a more powerful syntax. Moreover, it was built into R with Unicode support. As UTF-8 may represent almost all printable characters used around the world, a good idea is to always use PRE on normalized character vectors, i.e. converted from native encoding to UTF-8 via enc2utf8() and then, after regexing, back with enc2native().

Here’s an example on matching some character classes in three different locales. The string where matches were sought consisted of all ASCII characters (codes 1–127) and Polish letters (ą, ę, ł, ś, ż, and so on).

Locale
Pattern pl_PL.UTF-8
(GNU/Linux)
pl_PL.iso-8859-2
(GNU/Linux)
Polish_Poland.1250
(Windows)
ERE-Native
[[:alpha:]] AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż AB...Zab...zĆĘŃÓćęńó
[[:digit:]] 0123456789 0123456789ął
[[:lower:]] ab...ząćęłńóśźż ab...zćęńó
[[:upper:]] AB...ZĄĆĘŁŃÓŚŹŻ AB...ZĆĘŃÓ
[[:punct:]] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŚŹŻąłśźż !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŻąłż
[A-Z] AB...Z
[a-z] ab...z
PCRE-Native
[[:alpha:]] AB...Zab...z AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż
[[:digit:]] 0123456789
[[:lower:]] ab...z ab...ząćęłńóśźż
[[:upper:]] AB...Z AB...ZĄĆĘŁŃÓŚŹŻ
[[:punct:]] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[A-Z] AB...Z
[a-z] ab...z
ERE-UTF-8 normalized
[[:alpha:]] AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż AB...Zab...zÓó
[[:digit:]] 0123456789
[[:lower:]] ab...ząćęłńóśźż ab...zó
[[:upper:]] AB...ZĄĆĘŁŃÓŚŹŻ AB...ZÓ
[[:punct:]] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[A-Z] AB...Z
[a-z] ab...z
PCRE-UTF-8 normalized
\p{L} AB...Zab...zĄĆĘŁŃÓŚŹŻąćęłńóśźż
\p{N} 0123456789
\p{Ll} ab...ząćęłńóśźż
\p{Lu} AB...ZĄĆĘŁŃÓŚŹŻ
\p{P} !"#%&'()*,-./:;?@[\]_{}
\p{S} $+<=>^`|~
\p{S}|\p{P} !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[A-Z] AB...Z
[a-z] ab...z

We see that PCRE after a “normalization” with enc2utf8() gives correct results in all the locales.

An example:

gregexpr(enc2utf8(pattern), enc2utf8(text), perl=TRUE)

With the stringr package you may use e.g.:

str_extract_all(enc2utf8(text), perl(enc2utf8(pattern)))

Note that regexec() (and str_match_all() from stringr) currently doesn’t support PRE. However, you may use gregexpr() instead.

str_match_all.perl <- function(s, p)
{
   require("stringr")
   m <- gregexpr(enc2utf8(p), enc2utf8(s), perl=TRUE) # PCRE-NORMALIZED
   # note that normalization is needed only for regex-matching


   out <- vector("list", length(s))

   # vectorized over s
   for (j in seq_along(s))
   {
      nmatch <- length(m[[j]])
      ncapt  <- length(attr(m[[j]], "capture.names"))

      if (length(m) == 1 && m[[j]] == -1) next

      out[[j]] <- matrix(str_sub(s[[j]], m[[j]],
         m[[j]]+attr(m[[j]], "match.length")-1),
         nrow=nmatch, ncol=ncapt+1)

      if (ncapt > 0) {
         cs <- as.integer(attr(m[[j]], "capture.start"))
         cl <- as.integer(attr(m[[j]], "capture.length"))
         out[[j]][,-1] <- str_sub(s[j], cs, cs+cl-1)

         if (any(str_length(attr(m[[j]], "capture.names")) > 0))
             colnames(out[[j]]) <- c("", attr(m[[j]], "capture.names"))
      }
   }

   out
}


# test:

print(str_match_all.perl(c(
   "rocznik99='nie', skrzynia='auto', osób='5', foo=bar",
   "kolor='ŻÓŁTY', silnik='3.0l TURBO++', skrzynia='manual'"
), "(?<attr>\\p{L}+)='(?<val>[^']*)'")) # two named groups

print(str_match_all.perl(c(
   "kolor='czerwony', świece='nówka', skrzynia='manual'",
   "rocznik99='skądże', skrzynia='auto', osób='5', foo=bar"
), "\\p{L}+='[^']*'")) # no groups

print(str_match_all.perl("123", "\\p{L}+")) # no matches at all

UPDATE@2013-07-10: our stringi package works the same in each platform’s locale and encoding. See the stri_locate_all_regex and stri_match_all_regex functions.

Marek Gągolewski

To leave a comment for the author, please follow the link and comment on his blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.