stringi is a package providing (but definitely not limiting to) replacements for nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.
- Processes factors and characters in the same way,
- Gives functions consistent names and arguments,
- Simplifies string operations by eliminating options that you don't need 95% of the time,
- Produces outputs than can easily be used as inputs. This includes ensuring that missing inputs result in missing outputs, and zero length inputs result in zero length outputs,
- Completes R's string handling functions with useful functions from other programming languages.
Some problems with base R functions
While base R as well as
stringr functions are great for simple text processing tasks, dealing with more complex ones (such as natural language processing) may be a bit problematic.
First of all, some time ago we mentioned in our blog post that regex search may provide different outputs on different platforms. For example, Polish letters such as ą, ę, ś etc. are correctly captured with
[[:alpha:]] by the (default) ERE engine on Linux (native encoding=UTF-8), while on Windows the results are quite surprising. (A year ago my students got (of course, initially) very bad marks from a Polish text processing task just because they had written their R scripts on Windows while I ran them on Linux.) 🙂
Secondly, natural language processing relies on a set of very complex, locale-specific rules. However, the rules available (via e.g. glibc) in base R string functions may sometimes give incorrect results. For example, when we convert German
ß (es-zett/double small s) character to upper case, we rather expect
SS in result than:
toupper("groß") # GROSS? No...
##  "GROß"
Moreover, let's assume that we are asked to sort a character vector according to the rules specific to the Slovak language. Here, quite interestingly, the word
hladný (hungry) can be found in a dictionary before the word
chladný (cold). Of course, as not everyone works in a Slovak locale, we don't expect to obtain a proper order immediately:
##  "chladný" "hladný"
In order to obtain a proper order, we should temporarily switch to a Slovak “environment”:
oldlocale <- Sys.getlocale("LC_COLLATE") Sys.setlocale("LC_COLLATE", "sk_SK") sort(c("hladný", "chladný"))
##  "hladný" "chladný"
This code works on my Linux, but is not portable. It's because:
- Other Linux users may not have Slovak rule-base installed (and not everyone has abilities to do it on his/her own).
- Windows users don't use BCP 47-based locale names. There, LCID
And so on.
In order to overcome such problems we decided to reimplement each string processing function from scratch (of course, purely in C++). The internationalization and globalization support, as well as many string processing facilities (like regex searching) is guaranteed by the well-known and established IBM's ICU4C library (refer to ICU's website for more details).
Here is a very general list of the most important features available in the current version of
- string searching:
- with ICU (Java-like) regular expressions,
- ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),
- very fast, locale-independent byte-wise pattern matching;
- joining and duplicating strings;
- extracting and replacing substrings;
- string trimming, padding, and text wrapping (e.g. with Knuth's dynamic word wrap algorithm);
- text transliteration;
- text collation (comparing, sorting);
- text boundary analysis (e.g. for extracting individual words);
- random string generation;
- Unicode normalization;
- character encoding conversion and detection;
and many more.
Here's a bunch of examples.
stri_length(c("aaa", NA, ""))
##  3 NA 0
- “Deep” vectorization:
stri_replace_all_fixed(c("aba", "bab"), c("a", "b"), c("c", "d")) # 1-1-1 and 2-2-2
##  "cbc" "dad"
stri_replace_all_fixed(c("aba", "bab"), "a", "c") # 1-1-1 and 2-1-1
##  "cbc" "bcb"
stri_replace_all_fixed("aba", c("a", "b"), "c") # 1-1-1 and 1-2-1
##  "cbc" "aca"
stri_replace_all_fixed("aba", "a", c("c", "d")) # 1-1-1 and 1-1-2
##  "cbc" "dbd"
(all the functions are vectorized w.r.t most of their arguments)
- Easy-to-use, portable locale selection:
stri_sort(c("hladný", "chladný"), opts = stri_opts_collator(locale = "sk_SK"))
##  "hladný" "chladný"
- Proper transliteration rules:
##  "GROSS"
In our upcoming blog posts we will present some exciting features of
stringi. They are definitely worth to be discussed separately! Stay tuned.
And some benchmarks.
- String sorting:
set.seed(123L) library(microbenchmark) x <- stri_rand_strings(1e+05, 10) # 10000 random ASCII 'words' of length 10 each head(x, 5)
##  "HmPsw2WtYS" "xSgZ6tF2Kx" "tgdzehXaH9" "xtgn1TlDJE" "8PPM98ESGr"
## Unit: milliseconds ## expr min lq median uq max neval ## sort(x) 1050.4 1062.8 1076.1 1110 1176.6 100 ## stri_sort(x) 234.2 239.7 243.5 250 303.7 100
- String joining:
microbenchmark(paste(x, collapse = ", "), stri_paste(x, collapse = ", "))
## Unit: milliseconds ## expr min lq median uq max neval ## paste(x, collapse = ", ") 45.21 45.70 46.64 53.15 244.28 100 ## stri_paste(x, collapse = ", ") 10.14 10.44 10.70 16.36 18.71 100
- Searching for a fixed pattern:
set.seed(123L) y <- stri_rand_strings(10000, 10, "[ACGT]") # 10000 random 'genomes' of length 10 head(y, 5)
##  "CTCTTAGTGC" "TCGGATAACT" "TGGTGGGGCA" "TTGTACTACA" "ACCCAAACCT"
microbenchmark(grepl("ACCA", y), grepl("ACCA", y, fixed = TRUE), grepl("ACCA", y, perl = TRUE), stri_detect_fixed(y, "ACCA"), stri_detect_regex(y, "ACCA"))
## Unit: microseconds ## expr min lq median uq max neval ## grepl("ACCA", y) 4928.0 4968.9 4987.0 5008.9 12723.2 100 ## grepl("ACCA", y, fixed = TRUE) 899.0 906.9 912.0 919.2 2441.2 100 ## grepl("ACCA", y, perl = TRUE) 2145.7 2155.5 2162.8 2174.6 9707.1 100 ## stri_detect_fixed(y, "ACCA") 514.9 523.0 532.2 558.6 893.4 100 ## stri_detect_regex(y, "ACCA") 3720.2 3750.8 3805.6 3891.6 7411.8 100
- Determining a substring:
microbenchmark(substr(y, 2, 4), stri_sub(y, 2, 4))
## Unit: microseconds ## expr min lq median uq max neval ## substr(y, 2, 4) 908.8 915.4 920.3 945.4 3640 100 ## stri_sub(y, 2, 4) 924.4 945.4 955.4 1007.5 2476 100
As a rule of thumb:
stringi functions should often be faster than the R ones for long ASCII and UTF-8 strings. They often have poorer performance for short 8-bit encoded ones.
For bug reports and feature requests visit our GitHub profile.
In the future versions of
stringi we plan to include:
- rule-based number formatting (number spell-out, e.g.
123 -> one hundred twenty three);
- date and time formatting/parsing;
- access to the Unicode Character database;
- functions to read/write text files (with automatic encoding detection);
- and many more.
Any comments and suggestions are warmly welcome.
Notable changes since the previous CRAN release (v0.1-25):
stri_cmp*now do not allow for passing
opts_collator=NA. From now on,
stri_cmp_neq, and the new operators
%stri!==%are locale-independent operations, which base on code point comparisons. New functions
stri_cmp_nequiv(and from now on also
%stri!=%) test for canonical equivalence.
stri_*_fixedsearch functions now perform a locale-independent exact (bytewise, of course after conversion to UTF-8)
pattern search. All the Collator-based, locale-dependent search routines are now available via
stri_*_coll. The reason for this is that ICU USearch has currently very poor performance and in many search tasks in fact it is sufficient to do exact pattern matching.
stri_enc_isnf*function families have been renamed to
stri_trans_isnf*, respectively. This is because they deal with text transforming, and not with character encoding. Moreover, all such operation may be performed by ICU's Transliterator (see below).
stri_*_charclasssearch functions now rely solely on ICU's UnicodeSet patterns. All previously accepted charclass identifiers became invalid. However, new patterns should now be more familiar to the users (they are regex-like). Moreover, we observe a very nice performance gain.
[IMPORTANT CHANGE] stri_sort now does not include NAs in output vectors by default, for compatibility with
sort(). Moreover, currently none of the input vector's attributes are preserved.
stri_trans_listgives access to ICU's Transliterator: may be used to perform very general text transforms.
stri_split_boundariesutilizes ICU's BreakIterator to split strings at specific text boundaries. Moreover,
stri_locate_boundariesindicates positions of these boundaries.
stri_extract_wordsuses ICU's BreakIterator to extract all words from a text. Additionally,
stri_locate_wordslocates start and end positions of words in a text.
stri_pad_bothpads a string with a specific code point.
stri_wrapbreaks paragraphs of text into lines. Two algorihms (greedy and minimal-raggedness) are available.
stri_uniqueextracts unique elements from a character vector.
stri_duplicated_anydetermine duplicate elements in a character vector.
NAs in a character vector with a given string, useful for emulating e.g. R's
stri_rand_shufflegenerates a random permutation of code points in a string.
stri_rand_stringsgenerates random strings.
[NEW FUNCTIONS] New functions and binary operators for string comparison:
stri_enc_markreads declared encodings of character strings as seen by
stri_enc_tonative(str)is an alias to
stri_encode(str, NULL, NULL).
stri_sortnow have an additional argument
mergearg (defaults to
FALSEfor backward-compatibility). It may be used to e.g. replace sequences of white spaces with a single space.
stri_enc_toutf8now has a new
validatearg (defaults to
FALSEfor backward-compatibility). It may be used in a (rare) case in which a user wants to fix an invalid UTF-8 byte sequence.
stri_length(among others) now detect invalid UTF-8 byte sequences.
[NEW FEATURE] All binary operators
%???%now also have aliases
stri_*_fixednow use a tweaked Knuth-Morris-Pratt search algorithm, which improves the search performance drastically.
Significant performance improvements in
stri_trans_to*, and others.
Refer to NEWS for a complete list of changes, new features and bug fixes.