Site icon R-bloggers

(String/text processing)++: stringi 0.2-3 released

[This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

stringi is a package providing (but definitely not limiting to) replacements for nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

stringi's user interface is inspired by and consistent with that of Hadley Wickham's great stringr package. Quoting its README, stringr (and “hence” stringi):

  • Processes factors and characters in the same way,
  • Gives functions consistent names and arguments,
  • Simplifies string operations by eliminating options that you don't need 95% of the time,
  • Produces outputs than can easily be used as inputs. This includes ensuring that missing inputs result in missing outputs, and zero length inputs result in zero length outputs,
  • Completes R's string handling functions with useful functions from other programming languages.

Some problems with base R functions

While base R as well as stringr functions are great for simple text processing tasks, dealing with more complex ones (such as natural language processing) may be a bit problematic.

First of all, some time ago we mentioned in our blog post that regex search may provide different outputs on different platforms. For example, Polish letters such as ą, ę, ś etc. are correctly captured with [[:alpha:]] by the (default) ERE engine on Linux (native encoding=UTF-8), while on Windows the results are quite surprising. (A year ago my students got (of course, initially) very bad marks from a Polish text processing task just because they had written their R scripts on Windows while I ran them on Linux.) 🙂

Secondly, natural language processing relies on a set of very complex, locale-specific rules. However, the rules available (via e.g. glibc) in base R string functions may sometimes give incorrect results. For example, when we convert German ß (es-zett/double small s) character to upper case, we rather expect SS in result than:

toupper("groß") # GROSS? No...

## [1] "GROß"

Moreover, let's assume that we are asked to sort a character vector according to the rules specific to the Slovak language. Here, quite interestingly, the word hladný (hungry) can be found in a dictionary before the word chladný (cold). Of course, as not everyone works in a Slovak locale, we don't expect to obtain a proper order immediately:

sort(c("hladný", "chladný"))

## [1] "chladný" "hladný"

In order to obtain a proper order, we should temporarily switch to a Slovak “environment”:

oldlocale <- Sys.getlocale("LC_COLLATE")
Sys.setlocale("LC_COLLATE", "sk_SK")
sort(c("hladný", "chladný"))

## [1] "hladný"  "chladný"

Sys.setlocale("LC_COLLATE", oldlocale)

This code works on my Linux, but is not portable. It's because:

  1. Other Linux users may not have Slovak rule-base installed (and not everyone has abilities to do it on his/her own).
  2. Windows users don't use BCP 47-based locale names. There, LCID Slovak_Slovakia.1250 is appropriate.

And so on.

stringi facilities

In order to overcome such problems we decided to reimplement each string processing function from scratch (of course, purely in C++). The internationalization and globalization support, as well as many string processing facilities (like regex searching) is guaranteed by the well-known and established IBM's ICU4C library (refer to ICU's website for more details).

Here is a very general list of the most important features available in the current version of stringi:

and many more.

Showcase

Here's a bunch of examples.

library(stringi)
stri_length(c("aaa", NA, ""))

## [1]  3 NA  0
stri_replace_all_fixed(c("aba", "bab"), c("a", "b"), c("c", "d"))  # 1-1-1 and 2-2-2

## [1] "cbc" "dad"

stri_replace_all_fixed(c("aba", "bab"), "a", "c")  # 1-1-1 and 2-1-1

## [1] "cbc" "bcb"

stri_replace_all_fixed("aba", c("a", "b"), "c")  # 1-1-1 and 1-2-1

## [1] "cbc" "aca"

stri_replace_all_fixed("aba", "a", c("c", "d"))  # 1-1-1 and 1-1-2

## [1] "cbc" "dbd"

(all the functions are vectorized w.r.t most of their arguments)

stri_sort(c("hladný", "chladný"), opts = stri_opts_collator(locale = "sk_SK"))

## [1] "hladný"  "chladný"
stri_trans_toupper("Groß")

## [1] "GROSS"

In our upcoming blog posts we will present some exciting features of stringi. They are definitely worth to be discussed separately! Stay tuned.

Performance

And some benchmarks.

set.seed(123L)
library(microbenchmark)
x <- stri_rand_strings(1e+05, 10)  # 10000 random ASCII 'words' of length 10 each
head(x, 5)

## [1] "HmPsw2WtYS" "xSgZ6tF2Kx" "tgdzehXaH9" "xtgn1TlDJE" "8PPM98ESGr"

microbenchmark(sort(x), stri_sort(x))

## Unit: milliseconds
##          expr    min     lq median   uq    max neval
##       sort(x) 1050.4 1062.8 1076.1 1110 1176.6   100
##  stri_sort(x)  234.2  239.7  243.5  250  303.7   100
microbenchmark(paste(x, collapse = ", "), stri_paste(x, collapse = ", "))

## Unit: milliseconds
##                            expr   min    lq median    uq    max neval
##       paste(x, collapse = ", ") 45.21 45.70  46.64 53.15 244.28   100
##  stri_paste(x, collapse = ", ") 10.14 10.44  10.70 16.36  18.71   100
set.seed(123L)
y <- stri_rand_strings(10000, 10, "[ACGT]")  # 10000 random 'genomes' of length 10
head(y, 5)

## [1] "CTCTTAGTGC" "TCGGATAACT" "TGGTGGGGCA" "TTGTACTACA" "ACCCAAACCT"

microbenchmark(grepl("ACCA", y), grepl("ACCA", y, fixed = TRUE), grepl("ACCA", 
    y, perl = TRUE), stri_detect_fixed(y, "ACCA"), stri_detect_regex(y, "ACCA"))

## Unit: microseconds
##                            expr    min     lq median     uq     max neval
##                grepl("ACCA", y) 4928.0 4968.9 4987.0 5008.9 12723.2   100
##  grepl("ACCA", y, fixed = TRUE)  899.0  906.9  912.0  919.2  2441.2   100
##   grepl("ACCA", y, perl = TRUE) 2145.7 2155.5 2162.8 2174.6  9707.1   100
##    stri_detect_fixed(y, "ACCA")  514.9  523.0  532.2  558.6   893.4   100
##    stri_detect_regex(y, "ACCA") 3720.2 3750.8 3805.6 3891.6  7411.8   100
microbenchmark(substr(y, 2, 4), stri_sub(y, 2, 4))

## Unit: microseconds
##               expr   min    lq median     uq  max neval
##    substr(y, 2, 4) 908.8 915.4  920.3  945.4 3640   100
##  stri_sub(y, 2, 4) 924.4 945.4  955.4 1007.5 2476   100

As a rule of thumb: stringi functions should often be faster than the R ones for long ASCII and UTF-8 strings. They often have poorer performance for short 8-bit encoded ones.

More information

For more information check out the stringi package website and its on-line documentation.

For bug reports and feature requests visit our GitHub profile.

In the future versions of stringi we plan to include:

Any comments and suggestions are warmly welcome.

Have fun!

Marek Gagolewski

Change-log

Notable changes since the previous CRAN release (v0.1-25):

Refer to NEWS for a complete list of changes, new features and bug fixes.

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.