stringi 0.4-1 released – fast, portable, consistent character string processing

December 14, 2014
By

(This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers)

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

# install.packages("stringi") or update.packages()
library("stringi")

Here’s a list of changes in version 0.4-1. In the current release, we particularly focused on making the package’s interface more consistent with that of the well-known stringr package. For a general overview of stringi’s facilities and base R string processing issues, see e.g. here.

  • (IMPORTANT CHANGE) n_max argument in stri_split_*() has been renamed n.

  • (IMPORTANT CHANGE) simplify=FALSE in stri_extract_all_*() and stri_split_*() now calls stri_list2matrix() with fill="". fill=NA_character_ may be obtained by using simplify=NA.

  • (IMPORTANT CHANGE, NEW FUNCTIONS) #120: stri_extract_words has been renamed stri_extract_all_words and stri_locate_boundariesstri_locate_all_boundaries as well as stri_locate_wordsstri_locate_all_words. New functions are now available: stri_locate_first_boundaries, stri_locate_last_boundaries, stri_locate_first_words, stri_locate_last_words, stri_extract_first_words, stri_extract_last_words.

# uses ICU's locale-dependent word break iterator
stri_extract_all_words("stringi: THE string processing package for R")
## [[1]]
## [1] "stringi"    "THE"        "string"     "processing" "package"   
## [6] "for"        "R"
  • (IMPORTANT CHANGE) #111: opts_regex, opts_collator, opts_fixed, and opts_brkiter can now be supplied individually via .... In other words, you may now simply call e.g.
stri_detect_regex(c("stringi", "STRINGI"), "stringi", case_insensitive=TRUE)
## [1] TRUE TRUE

instead of:

stri_detect_regex(c("stringi", "STRINGI"), "stringi", opts_regex=stri_opts_regex(case_insensitive=TRUE))
## [1] TRUE TRUE
  • (NEW FEATURE) #110: Fixed pattern search engine’s settings can now be supplied via opts_fixed argument in stri_*_fixed(), see stri_opts_fixed(). A simple (not suitable for natural language processing) yet very fast case_insensitive pattern matching can be performed now. stri_extract_*_fixed is again available.

  • (NEW FEATURE) #23: stri_extract_all_fixed, stri_count, and stri_locate_all_fixed may now also look for overlapping pattern matches, see ?stri_opts_fixed.

stri_extract_all_fixed("abaBAaba", "ABA", case_insensitive=TRUE, overlap=TRUE)
## [[1]]
## [1] "aba" "aBA" "aba"
  • (NEW FEATURE) #129: stri_match_*_regex gained a cg_missing argument.

  • (NEW FEATURE) #117: stri_extract_all_*(), stri_locate_all_*(), stri_match_all_*() gained a new argument: omit_no_match. Setting it to TRUE makes these functions compatible with their stringr equivalents.

  • (NEW FEATURE) #118: stri_wrap() gained indent, exdent, initial, and prefix arguments. Moreover Knuth’s dynamic word wrapping algorithm now assumes that the cost of printing the last line is zero, see #128.

cat(stri_wrap(stri_rand_lipsum(1), 40, 2.0), sep="n")
## Lorem ipsum dolor sit amet, et et diam
## vitae est ut. At tristique, tincidunt
## taciti, ac egestas vestibulum magna.
## Volutpat nisl non sed ultricies nisl
## nibh magna. Nullam rhoncus ut phasellus
## sed. Congue enim libero congue massa
## eget. Ligula, quis est amet velit.
## Accumsan amet nunc ad. Porttitor,
## sed vestibulum diam vestibulum quis
## sed gravida ultrices. Per urna enim.
## Scelerisque interdum sed vestibulum
## rhoncus quis imperdiet pharetra. Sapien
## iaculis, lacinia ac cras ante, sed
## vitae inceptos dis tristique dignissim.
## Venenatis volutpat lectus sodales,
## hac feugiat molestie mollis. A, urna
## pellentesque ante himenaeos ante at
## potenti in.
  • (NEW FEATURE) #122: stri_subset() gained an omit_na argument.
stri_subset_fixed(c("abc", NA, "def"), "a")
## [1] "abc" NA
stri_subset_fixed(c("abc", NA, "def"), "a", omit_na=TRUE)
## [1] "abc"
  • (NEW FEATURE) stri_list2matrix() gained an n_min argument.

  • (NEW FEATURE) #126: stri_split() now is also able to act just like stringr::str_split_fixed().

stri_split_regex(c("bab", "babab"), "a", n = 3, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "b"  "b"  ""  
## [2,] "b"  "b"  "b"
  • (NEW FEATURE) #119: stri_split_boundaries() now have n, tokens_only, and simplify arguments. Additionally, stri_extract_all_words() is now equipped with simplify arg.

  • (NEW FEATURE) #116: stri_paste() gained a new argument: ignore_null. Setting it to TRUE makes this function more compatible with paste().

for (test in c(TRUE, FALSE))
   print(stri_paste("a", if (test) 1:9, ignore_null=TRUE))
## [1] "a1" "a2" "a3" "a4" "a5" "a6" "a7" "a8" "a9"
## [1] "a"

Enjoy! Any comments and suggestions are welcome.

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)