Site icon R-bloggers

Pull the (character) strings with stringi 0.5-2

[This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A reliable string processing toolkit is a must-have for any data scientist.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds). As for now, about 850 CRAN packages depend (either directly or recursively) on stringi. And quite recently, the package got listed among the top downloaded R extensions.

# install.packages("stringi") or update.packages()
library("stringi")
stri_info(TRUE)
## [1] "stringi_0.5.2; en_US.UTF-8; ICU4C 55.1; Unicode 7.0"
apkg <- available.packages(contriburl="http://cran.rstudio.com/src/contrib")
length(tools::dependsOnPkgs('stringi', installed=apkg, recursive=TRUE))
## [1] 845

Refer to the INSTALL file for more details if you compile stringi from sources (Linux users mostly).

Here’s a list of changes in version 0.5-2. There are many major (like date&time processing) and minor new features, enhancements, as well as bugfixes. In the current release we also focused on bringing stringr package’s users even better string processing experience, as since the 1.0.0 release it is now powered by stringi.

stri_trans_char("id.123", ".", "_")
## [1] "id_123"
stri_trans_char("babaab", "ab", "01")
## [1] "101001"
stri_width(LETTERS[1:5])
## [1] 1 1 1 1 1
nchar(stri_trans_nfkd("u0105"), "width") # provides incorrect information
## [1] 0
stri_width(stri_trans_nfkd("u0105")) # A and ogonek (width = 1)
## [1] 1
stri_width( # Full-width equivalents of ASCII characters:
   stri_enc_fromutf32(as.list(c(0x3000, 0xFF01:0xFF5E)))
)
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
x <- stri_flatten(c(
   stri_dup(LETTERS, 2),
   stri_enc_fromutf32(as.list(0xFF21:0xFF3a))
), collapse=' ')
# Note that your web browser may have problems with properly aligning
# this (try it in RStudio)
cat(stri_wrap(x, 11), sep='n')
## AA BB CC DD
## EE FF GG HH
## II JJ KK LL
## MM NN OO PP
## QQ RR SS TT
## UU VV WW XX
## YY ZZ A B
## C D E F
## G H I J
## K L M N
## O P Q R
## S T U V
## W X Y Z
x <- stri_rand_strings(100, 10000, "[actg]")
microbenchmark::microbenchmark(
   stri_detect_fixed(x, "acgtgaa"),
   grepl("actggact", x),
   grepl("actggact", x, perl=TRUE),
   grepl("actggact", x, fixed=TRUE)
)
## Unit: microseconds
##                                expr       min        lq       mean
##     stri_detect_fixed(x, "acgtgaa")   349.153   354.181   381.2391
##                grepl("actggact", x) 14017.923 14181.416 14457.3996
##   grepl("actggact", x, perl = TRUE)  8280.282  8367.426  8516.0124
##  grepl("actggact", x, fixed = TRUE)  3599.200  3637.373  3726.6020
##      median         uq       max neval  cld
##    362.7515   391.0655   681.267   100 a   
##  14292.2815 14594.4970 15736.535   100    d
##   8463.4490  8570.0080  9564.503   100   c 
##   3686.6690  3753.4060  4402.397   100  b

Enjoy! Any comments and suggestions are welcome.

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.