Pull the (character) strings with stringi 0.5-2

Marek Gągolewski

7 years ago

[This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A reliable string processing toolkit is a must-have for any data scientist.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds). As for now, about 850 CRAN packages depend (either directly or recursively) on stringi. And quite recently, the package got listed among the top downloaded R extensions.

# install.packages("stringi") or update.packages()
library("stringi")
stri_info(TRUE)
## [1] "stringi_0.5.2; en_US.UTF-8; ICU4C 55.1; Unicode 7.0"
apkg <- available.packages(contriburl="http://cran.rstudio.com/src/contrib")
length(tools::dependsOnPkgs('stringi', installed=apkg, recursive=TRUE))
## [1] 845

Refer to the INSTALL file for more details if you compile stringi from sources (Linux users mostly).

Here’s a list of changes in version 0.5-2. There are many major (like date&time processing) and minor new features, enhancements, as well as bugfixes. In the current release we also focused on bringing stringr package’s users even better string processing experience, as since the 1.0.0 release it is now powered by stringi.

[BACKWARD INCOMPATIBILITY] The second argument to stri_pad_*() has been renamed width.
[GENERAL] #69: stringi is now bundled with ICU4C 55.1.

[NEW FUNCTIONS] #137: date-time formatting/parsing (note that this is draft API and it may change in future stringi releases; any comments are welcome):

stri_timezone_list() – lists all known time zone identifiers

sample(stri_timezone_list(), 10)
##  [1] "Etc/GMT+12"                  "Antarctica/Macquarie"       
##  [3] "Atlantic/Faroe"              "Antarctica/Troll"           
##  [5] "America/Fort_Wayne"          "PLT"                        
##  [7] "America/Goose_Bay"           "America/Argentina/Catamarca"
##  [9] "Africa/Juba"                 "Africa/Bissau"

stri_timezone_set(), stri_timezone_get() – manage current default time zone
stri_timezone_info() – basic information on a given time zone

str(stri_timezone_info('Europe/Warsaw'))
## List of 6
##  $ ID              : chr "Europe/Warsaw"
##  $ Name            : chr "Central European Standard Time"
##  $ Name.Daylight   : chr "Central European Summer Time"
##  $ Name.Windows    : chr "Central European Standard Time"
##  $ RawOffset       : num 1
##  $ UsesDaylightTime: logi TRUE
stri_timezone_info('Europe/Warsaw', locale='de_DE')$Name
## [1] "Mitteleuropäische Normalzeit"

stri_datetime_symbols() – localizable date-time formatting data

stri_datetime_symbols()
## $Month
##  [1] "January"   "February"  "March"     "April"     "May"      
##  [6] "June"      "July"      "August"    "September" "October"  
## [11] "November"  "December" 
## 
## $Weekday
## [1] "Sunday"    "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"   
## [7] "Saturday" 
## 
## $Quarter
## [1] "1st quarter" "2nd quarter" "3rd quarter" "4th quarter"
## 
## $AmPm
## [1] "AM" "PM"
## 
## $Era
## [1] "Before Christ" "Anno Domini"
stri_datetime_symbols("th_TH_TRADITIONAL")$Month
##  [1] "มกราคม"  "กุมภาพันธ์"    "มีนาคม"    "เมษายน"  "พฤษภาคม" "มิถุนายน"    "กรกฎาคม"
##  [8] "สิงหาคม"   "กันยายน"   "ตุลาคม"    "พฤศจิกายน" "ธันวาคม"
stri_datetime_symbols("he_IL@calendar=hebrew")$Month
##  [1] "תשרי"   "חשון"   "כסלו"   "טבת"    "שבט"    "אדר א׳" "אדר"   
##  [8] "ניסן"   "אייר"   "סיון"   "תמוז"   "אב"     "אלול"   "אדר ב׳"

stri_datetime_now() – return current date-time
stri_datetime_fstr() – convert a strptime-like format string to an ICU date/time format string
stri_datetime_format() – convert date/time to string

    stri_datetime_format(stri_datetime_now(), "datetime_relative_medium")
## [1] "today, 6:21:45 PM"

stri_datetime_parse() – convert string to date/time object

stri_datetime_parse(c("2015-02-28", "2015-02-29"), "yyyy-MM-dd")
## [1] "2015-02-28 18:21:45 CET" NA
stri_datetime_parse(c("2015-02-28", "2015-02-29"), stri_datetime_fstr("%Y-%m-%d"))
## [1] "2015-02-28 18:21:45 CET" NA
stri_datetime_parse(c("2015-02-28", "2015-02-29"), "yyyy-MM-dd", lenient=TRUE)
## [1] "2015-02-28 18:21:45 CET" "2015-03-01 18:21:45 CET"
stri_datetime_parse("19 lipca 2015", "date_long", locale="pl_PL")
## [1] "2015-07-19 18:21:45 CEST"

stri_datetime_create() – construct date-time objects from numeric representations

stri_datetime_create(2015, 12, 31, 23, 59, 59.999)
## [1] "2015-12-31 23:59:59 CET"
stri_datetime_create(5775, 8, 1, locale="@calendar=hebrew") # 1 Nisan 5775 -> 2015-03-21
## [1] "2015-03-21 12:00:00 CET"
stri_datetime_create(2015, 02, 29)
## [1] NA
stri_datetime_create(2015, 02, 29, lenient=TRUE)
## [1] "2015-03-01 12:00:00 CET"

stri_datetime_fields() – get values for date-time fields

stri_datetime_fields(stri_datetime_now())
##   Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth
## 1 2015     6  23   18     21     45          52         26           4
##   DayOfYear DayOfWeek Hour12 AmPm Era
## 1       174         3      6    2   2
   stri_datetime_fields(stri_datetime_now(), locale="@calendar=hebrew")
##   Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth
## 1 5775    11   6   18     21     45          56         40           2
##   DayOfYear DayOfWeek Hour12 AmPm Era
## 1       272         3      6    2   1
   stri_datetime_symbols(locale="@calendar=hebrew")$Month[
  stri_datetime_fields(stri_datetime_now(), locale="@calendar=hebrew")$Month
   ]
## [1] "Tamuz"

stri_datetime_add() – add specific number of date-time units to a date-time object

x <- stri_datetime_create(2015, 12, 31, 23, 59, 59.999)
stri_datetime_add(x, units="months") <- 2
print(x)
## [1] "2016-02-29 23:59:59 CET"
stri_datetime_add(x, -2, units="months")
## [1] "2015-12-29 23:59:59 CET"

[NEW FUNCTIONS] stri_extract_*_boundaries() extract text between text boundaries.
[NEW FUNCTION] #46: stri_trans_char() is a stringi-flavoured chartr() equivalent.

stri_trans_char("id.123", ".", "_")
## [1] "id_123"
stri_trans_char("babaab", "ab", "01")
## [1] "101001"

[NEW FUNCTION] #8: stri_width() approximates the width of a string in a more Unicodish fashion than nchar(..., "width")

stri_width(LETTERS[1:5])
## [1] 1 1 1 1 1
nchar(stri_trans_nfkd("u0105"), "width") # provides incorrect information
## [1] 0
stri_width(stri_trans_nfkd("u0105")) # A and ogonek (width = 1)
## [1] 1
stri_width( # Full-width equivalents of ASCII characters:
   stri_enc_fromutf32(as.list(c(0x3000, 0xFF01:0xFF5E)))
)
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[NEW FEATURE] #149: stri_pad() and stri_wrap() now by default bases on code point widths instead of the number of code points. Moreover, the default behavior of stri_wrap() is now such that it does not get rid of non-breaking, zero width, etc. spaces

x <- stri_flatten(c(
   stri_dup(LETTERS, 2),
   stri_enc_fromutf32(as.list(0xFF21:0xFF3a))
), collapse=' ')
# Note that your web browser may have problems with properly aligning
# this (try it in RStudio)
cat(stri_wrap(x, 11), sep='n')
## AA BB CC DD
## EE FF GG HH
## II JJ KK LL
## MM NN OO PP
## QQ RR SS TT
## UU VV WW XX
## YY ZZ Ａ Ｂ
## Ｃ Ｄ Ｅ Ｆ
## Ｇ Ｈ Ｉ Ｊ
## Ｋ Ｌ Ｍ Ｎ
## Ｏ Ｐ Ｑ Ｒ
## Ｓ Ｔ Ｕ Ｖ
## Ｗ Ｘ Ｙ Ｚ

[NEW FEATURE] #133: stri_wrap() silently allows for width <= 0 (for compatibility with strwrap()).
[NEW FEATURE] #139: stri_wrap() gained a new argument: whitespace_only.
[GENERAL] #144: Performance improvements in handling ASCII strings (these affect stri_sub(), stri_locate() and other string index-based operations)
[GENERAL] #143: Searching for short fixed patterns (stri_*_fixed()) now relies on the current libC’s implementation of strchr() and strstr(). This is very fast e.g. on glibc utilizing the SSE2/3/4 instruction set.

x <- stri_rand_strings(100, 10000, "[actg]")
microbenchmark::microbenchmark(
   stri_detect_fixed(x, "acgtgaa"),
   grepl("actggact", x),
   grepl("actggact", x, perl=TRUE),
   grepl("actggact", x, fixed=TRUE)
)
## Unit: microseconds
##                                expr       min        lq       mean
##     stri_detect_fixed(x, "acgtgaa")   349.153   354.181   381.2391
##                grepl("actggact", x) 14017.923 14181.416 14457.3996
##   grepl("actggact", x, perl = TRUE)  8280.282  8367.426  8516.0124
##  grepl("actggact", x, fixed = TRUE)  3599.200  3637.373  3726.6020
##      median         uq       max neval  cld
##    362.7515   391.0655   681.267   100 a   
##  14292.2815 14594.4970 15736.535   100    d
##   8463.4490  8570.0080  9564.503   100   c 
##   3686.6690  3753.4060  4402.397   100  b

[GENERAL] #141: a local copy of icudt*.zip may be used on package install; see the INSTALL file for more information.
[GENERAL] #165: the ./configure option --disable-icu-bundle forces the use of system ICU when building the package.
[BUGFIX] locale specifiers are now normalized in a more intelligent way: e.g. @calendar=gregorian expands to DEFAULT_LOCALE@calendar=gregorian.
[BUGFIX] #134: stri_extract_all_words() did not accept simplify=NA.
[BUGFIX] #132: incorrect behavior in stri_locate_regex() for matches of zero lengths.
[BUGFIX] stringr/#73: stri_wrap() returned CHARSXP instead of STRSXP on empty string input with simplify=FALSE argument.
[BUGFIX] #164: libicu-dev usage used to fail on Ubuntu.
[BUGFIX] #135: C++11 is now used by default (see the INSTALL file, however) to build stringi from sources. This is because ICU4C uses the long long type which is not part of the C++98 standard.
[BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly.
[BUGFIX] #168: Build now fails if icudt is not available.
[BUGFIX] Force ICU u_init() call on stringi dynlib load.
[BUGFIX] #157: many overfull hboxes in the package PDF manual has been corrected.

Enjoy! Any comments and suggestions are welcome.

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.