Site icon R-bloggers

Faster, easier, and more reliable character string processing with stringi 0.3-1

[This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

# install.packages("stringi") or update.packages()
library("stringi")

stringi is an R package providing (but definitely not limiting to) equivalents of nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

We implemented each string processing function from scratch. The internationalization and globalization support, as well as many string processing facilities (like regex searching) is guaranteed by the well-known IBM’s ICU4C library.

Here is a very general list of the most important features available in the current version of stringi:

and many more.

Here’s a list of changes in version 0.3-1:

test <- "Theu00a0above-mentioned    features are very useful. Warm thanks to their developers. 123 456 789"
stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE, skip_word_number=TRUE)) # cf. stri_extract_words
## [[1]]
##  [1] "The"        "above"      "mentioned"  "features"   "are"       
##  [6] "very"       "useful"     "Warm"       "thanks"     "to"        
## [11] "their"      "developers"
stri_split_boundaries(test, stri_opts_brkiter(type="sentence")) # extract sentences
## [[1]]
## [1] "The above-mentioned    features are very useful. "
## [2] "Warm thanks to their developers. "                
## [3] "123 456 789"
stri_split_boundaries(test, stri_opts_brkiter(type="character")) # extract characters
## [[1]]
##  [1] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t" "i" "o" "n"
## [18] "e" "d" " " " " " " " " "f" "e" "a" "t" "u" "r" "e" "s" " " "a" "r"
## [35] "e" " " "v" "e" "r" "y" " " "u" "s" "e" "f" "u" "l" "." " " "W" "a"
## [52] "r" "m" " " "t" "h" "a" "n" "k" "s" " " "t" "o" " " "t" "h" "e" "i"
## [69] "r" " " "d" "e" "v" "e" "l" "o" "p" "e" "r" "s" "." " " "1" "2" "3"
## [86] " " "4" "5" "6" " " "7" "8" "9"

By the way, the last call also works correctly for strings not in the Unicode Normalization Form C:

stri_split_boundaries(stri_trans_nfkd("zażółć gęślą jaźń"), stri_opts_brkiter(type="character"))
## [[1]]
##  [1] "z" "a" "ż"  "ó"  "ł" "ć"  " " "g" "ę"  "ś"  "l" "ą"  " " "j" "a" "ź"  "ń"
stri_count_words("Have a nice day!")
## [1] 4
stri_startswith_fixed(c("a1o", "a2g", "b3a", "a4e", "c5a"), "a")
## [1]  TRUE  TRUE FALSE  TRUE FALSE
stri_replace_all_fixed("The quick brown fox jumped over the lazy dog.",
     c("quick", "brown", "fox"), c("slow",  "black", "bear"), vectorize_all=FALSE)
## [1] "The slow black bear jumped over the lazy dog."
# Compare the results:
stri_replace_all_fixed("The quicker brown fox jumped over the lazy dog.",
     c("quick", "brown", "fox"), c("slow",  "black", "bear"), vectorize_all=FALSE)
## [1] "The slower black bear jumped over the lazy dog."
stri_replace_all_regex("The quicker brown fox jumped over the lazy dog.",
     "\b"%s+%c("quick", "brown", "fox")%s+%"\b", c("slow",  "black", "bear"), vectorize_all=FALSE)
## [1] "The quicker black bear jumped over the lazy dog."
stri_subset_regex(c("john@office.company.com", "steve1932@g00gl3.eu", "no email here"),
   "^[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,4}$")
## [1] "john@office.company.com" "steve1932@g00gl3.eu"
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=1, tokens_only=TRUE, omit_empty=TRUE)
## [[1]]
## [1] "ab"
## 
## [[2]]
## [1] "d"
## 
## [[3]]
## [1] "h"
## 
## [[4]]
## character(0)
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=2, tokens_only=TRUE, omit_empty=TRUE)
## [[1]]
## [1] "ab" "c" 
## 
## [[2]]
## [1] "d"  "ef"
## 
## [[3]]
## [1] "h"
## 
## [[4]]
## character(0)
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE)
## [[1]]
## [1] "ab" "c" 
## 
## [[2]]
## [1] "d"  "ef" "g" 
## 
## [[3]]
## [1] "h"
## 
## [[4]]
## character(0)
stri_list2matrix(stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE))
##      [,1] [,2] [,3] [,4]
## [1,] "ab" "d"  "h"  NA  
## [2,] "c"  "ef" NA   NA  
## [3,] NA   "g"  NA   NA
stri_split_fixed("a_b_c__d", "_", omit_empty=FALSE)
## [[1]]
## [1] "a" "b" "c" ""  "d"
stri_split_fixed("a_b_c__d", "_", omit_empty=TRUE)
## [[1]]
## [1] "a" "b" "c" "d"
stri_split_fixed("a_b_c__d", "_", omit_empty=NA)
## [[1]]
## [1] "a" "b" "c" NA  "d"
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "ab" "c"  NA  
## [2,] "d"  "ef" "g" 
## [3,] "h"  NA   NA  
## [4,] NA   NA   NA
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=FALSE, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "ab" "c"  NA  
## [2,] "d"  "ef" "g" 
## [3,] ""   "h"  NA  
## [4,] ""   NA   NA
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=NA, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "ab" "c"  NA  
## [2,] "d"  "ef" "g" 
## [3,] NA   "h"  NA  
## [4,] NA   NA   NA
cat(sapply(
   stri_wrap(stri_rand_lipsum(3), 80, simplify=FALSE),
   stri_flatten, collapse="n"), sep="nn")
## Lorem ipsum dolor sit amet, eu turpis pellentesque est, lectus, vestibulum.
## Iaculis et nam ad eu morbi, ultrices enim pellentesque est fusce. Etiam
## ipsum varius, maecenas dapibus. Netus molestie non adipiscing netus,
## aptent sed malesuada, placerat suscipit. A, sed eu luctus imperdiet odio
## tempor. In velit ut vel feugiat felis eros risus. Sed sapien, facilisis
## ullamcorper, senectus efficitur sit id sociis sed purus. Ipsum, a, blandit
## faucibus. In vivamus, duis et sed sollicitudin maximus. Sodales magnis
## ac senectus facilisis, dolor faucibus a. Cursus in cum, cubilia egestas
## ut platea turpis. Maximus sit vel cursus nec in vel, eu, lacinia in ut.
## 
## Libero maximus potenti penatibus amet nisl non ut. Commodo nullam rhoncus,
## bibendum quisque sem aliquam sed, quam enim et, sed. Lacinia netus inceptos
## sapien nostra tincidunt facilisis montes nascetur non pharetra convallis
## id. Netus diam nulla montes nec tincidunt facilisis eros porttitor nisl urna
## cubilia. Aliquet egestas mus nisl, nisi vehicula, ac mauris rutrum, felis
## aenean tristique magna. Ante maecenas phasellus id class. Finibus iaculis purus
## volutpat posuere phasellus magna class blandit augue morbi torquent. Taciti
## ullamcorper venenatis at nulla eget auctor ante neque metus sed metus. Dolor,
## platea sit sed pellentesque ipsum. Dapibus sed nisi vestibulum ex integer.
## 
## Duis iaculis sapien habitasse, facilisi habitasse leo nam. Egestas,
## libero tempor purus in. Aliquam himenaeos conubia egestas cum vestibulum
## nec. Sociosqu mauris cum mus non lobortis eu et dapibus vel integer.
## Blandit quis inceptos cursus vel pellentesque lectus amet egestas.
## Pharetra ac eros nisi. Finibus nec, ac congue in molestie sed.
## Tincidunt faucibus a interdum facilisis, sed nulla, tortor, felis,
## sociis. Sem porttitor himenaeos pharetra nec eu torquent elementum.
stri_trans_totitle("GOOD-OLD cOOkiE mOnSTeR IS watCHinG You. Here HE comes!",
    stri_opts_brkiter(type="word")) # default boundary
## [1] "Good-Old Cookie Monster Is Watching You. Here He Comes!"
stri_trans_totitle("GOOD-OLD cOOkiE mOnSTeR IS watCHinG You. Here HE comes!",
    stri_opts_brkiter(type="sentence"))
## [1] "Good-old cookie monster is watching you. Here he comes!"

Enjoy! Any comments and suggestions are welcome.

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.