Properly “internationalized” regular expressions in R

April 5, 2013

(This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers)

We should pay special attention to writing a truly portable code that works in the same fashion under different locales and character encodings. Currently, R has two Regex engines, ERE (via TRE) and PRE (via PCRE). What is surprising, they ought to give different results on different operating systems and native character encodings used!

[email protected]: check out our stringi package to get rid of such problems forever!

PCRE often outperforms ERE and has a more powerful syntax. Moreover, it was built into R with Unicode support. As UTF-8 may represent almost all printable characters used around the world, a good idea is to always use PRE on normalized character vectors, i.e. converted from native encoding to UTF-8 via enc2utf8() and then, after regexing, back with enc2native().

Here’s an example on matching some character classes in three different locales. The string where matches were sought consisted of all ASCII characters (codes 1–127) and Polish letters (ą, ę, ł, ś, ż, and so on).

Pattern pl_PL.UTF-8
[[:alpha:]] AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż AB...Zab...zĆĘŃÓćęńó
[[:digit:]] 0123456789 0123456789ął
[[:lower:]] ab...ząćęłńóśźż ab...zćęńó
[[:upper:]] AB...ZĄĆĘŁŃÓŚŹŻ AB...ZĆĘŃÓ
[[:punct:]] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŚŹŻąłśźż !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŻąłż
[A-Z] AB...Z
[a-z] ab...z
[[:alpha:]] AB...Zab...z AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż
[[:digit:]] 0123456789
[[:lower:]] ab...z ab...ząćęłńóśźż
[[:upper:]] AB...Z AB...ZĄĆĘŁŃÓŚŹŻ
[[:punct:]] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[A-Z] AB...Z
[a-z] ab...z
ERE-UTF-8 normalized
[[:alpha:]] AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż AB...Zab...zÓó
[[:digit:]] 0123456789
[[:lower:]] ab...ząćęłńóśźż ab...zó
[[:upper:]] AB...ZĄĆĘŁŃÓŚŹŻ AB...ZÓ
[[:punct:]] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[A-Z] AB...Z
[a-z] ab...z
PCRE-UTF-8 normalized
\p{L} AB...Zab...zĄĆĘŁŃÓŚŹŻąćęłńóśźż
\p{N} 0123456789
\p{Ll} ab...ząćęłńóśźż
\p{P} !"#%&'()*,-./:;?@[\]_{}
\p{S} $+<=>^`|~
\p{S}|\p{P} !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[A-Z] AB...Z
[a-z] ab...z

We see that PCRE after a “normalization” with enc2utf8() gives correct results in all the locales.

An example:

gregexpr(enc2utf8(pattern), enc2utf8(text), perl=TRUE)

With the stringr package you may use e.g.:

str_extract_all(enc2utf8(text), perl(enc2utf8(pattern)))

Note that regexec() (and str_match_all() from stringr) currently doesn’t support PRE. However, you may use gregexpr() instead.

str_match_all.perl <- function(s, p)
   m <- gregexpr(enc2utf8(p), enc2utf8(s), perl=TRUE) # PCRE-NORMALIZED
   # note that normalization is needed only for regex-matching

   out <- vector("list", length(s))

   # vectorized over s
   for (j in seq_along(s))
      nmatch <- length(m[[j]])
      ncapt  <- length(attr(m[[j]], "capture.names"))

      if (length(m) == 1 && m[[j]] == -1) next

      out[[j]] <- matrix(str_sub(s[[j]], m[[j]],
         m[[j]]+attr(m[[j]], "match.length")-1),
         nrow=nmatch, ncol=ncapt+1)

      if (ncapt > 0) {
         cs <- as.integer(attr(m[[j]], "capture.start"))
         cl <- as.integer(attr(m[[j]], "capture.length"))
         out[[j]][,-1] <- str_sub(s[j], cs, cs+cl-1)

         if (any(str_length(attr(m[[j]], "capture.names")) > 0))
             colnames(out[[j]]) <- c("", attr(m[[j]], "capture.names"))


# test:

   "rocznik99='nie', skrzynia='auto', osób='5', foo=bar",
   "kolor='ŻÓŁTY', silnik='3.0l TURBO++', skrzynia='manual'"
), "(?<attr>\\p{L}+)='(?<val>[^']*)'")) # two named groups

   "kolor='czerwony', świece='nówka', skrzynia='manual'",
   "rocznik99='skądże', skrzynia='auto', osób='5', foo=bar"
), "\\p{L}+='[^']*'")) # no groups

print(str_match_all.perl("123", "\\p{L}+")) # no matches at all

[email protected]: our stringi package works the same in each platform’s locale and encoding. See the stri_locate_all_regex and stri_match_all_regex functions.

Marek Gągolewski

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)