Strung Out On String Ops – A Brief Comparison of stringi and stringr

Posted on February 5, 2017 by hrbrmstr in R bloggers | 0 Comments

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I made a promise to someone that my next blog would be about stringi vs stringr and I intend to keep said promise.

stringr and stringi do “string operations”: find, replace, match, extract, convert, transform, etc.

The stringr package is now part of the tidyverse and is 100% focused on string processing and is pretty much a wrapper package for stringi. The stringi package wraps chunks of the icu4c library but the stringi API frmaing was actually based on the patterns in the stringr package API. stringr did not wrap stringi at the time but dows now and stringi strays a bit (on occasion) from string processing since the entire icu4c library is at it’s disposal. Confused? Good! There’s more!

The impetus for asking me to blog about this is that I’m known to say “just use stringi” in situations where someone has taken a stringr “shortcut”. Let me explain why.

Readers Digest

First, you need to read pages 4-5 of the stringi manual [PDF] and then the stringr vignette. I’m not duplicating the information on those pages. The TL;DR on them is:

that stringr makes some (valid) assumptions about defaults for the stringi calls it wraps
stringr is much easier to initially grok as it’s very focused and has far fewer functions
they both use ICU regular expressions
stringi includes more than string processing and has far more total functions:

As noted, stringr wraps stringi calls (for the most part) and some of the stringr functions reference more than one stringi function:

That’s my primary defense for “just use stringi” — stringr “just uses” it and you are forced to install stringi on every system stringr is on, so why introduce another dependency into your code?

All Wrapped Up

These are the stringr functions with a 1:~1 correspondence to stringi functions:

stri_c stri_conv stri_count stri_detect stri_dup stri_extract stri_extract_all stri_join stri_length stri_locate stri_locate_all stri_match stri_match_all stri_order stri_pad stri_replace stri_replace_all stri_replace_na stri_sort stri_split stri_split_fixed stri_sub stri_sub<- stri_subset stri_trim stri_wrap

I used 1:~1 since at the heart of the string processing capabilities of both packages lies the concept of granular control of matching behaviour. Specifically, there are four modes (so it’s really 1:4?):

fixed: Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets
coll: Compare strings respecting standard collation rules
regex: The default. Uses ICU regular expressions
boundary: Match boundaries between things

stringr has function modifiers around pattern to handle those whereas stringi requires explicit function calls. So, you’d do the following to replace a fixed char/byte sequence in each package:

stri_replace_all_fixed("Lorem i.sum dolor sit amet, conse.tetur adipisicing elit.", ".", "#")
str_replace_all("Lorem i.sum dolor sit amet, conse.tetur adipisicing elit.", fixed("."), "#")

In that case there’s not much in the way of keystroke savings, but the default mode of stringr is to use regex replacement and you do save both an i and _regex for that but add one more function call in-between you and your goal. When you work with multi-gigabyte character structures (as I do), those milliseconds often add up. If keystrokes > milliseconds in your workflow, you may want to stick with stringr.

Treasure Hunting in stringi

If you take some time to look at what’s in stringi you’ll find quite a bit (I excluded the fixed/coll/reged/boundary versions for brevity):

That’s an SVG, so zoom in as much as you need to to read it.

These are stringi gems:

stri_stats_general (stats abt a character vector)
stri_trans_totitle (For When You Want Title Case)
stri_flatten (paste0 but better defaults)
stri_rand_strings (random strings)
stri_rand_lipsum (random Lorem Ipsum lines!)
stri_count_words, stri_extract_all_words, stri_extract_first_words, stri_extract_last_words

Plus it has some helpful operators:

%s!=%, %s!==%, %s+%, %s<%, %s<=%, %s==%, %s===% %s>%, %s>=%, %stri!=%, %stri!==%, %stri+%, %stri<%, %stri<=%, %stri==%, %stri===%, %stri>%, %stri>=%

Of those, %s+% is ++handy for string concatenation.

Prior to readr, these were my go-to line/raw readers/writer: stri_read_raw, stri_read_lines, and stri_write_lines.

It also handles gnarly character encoding operations in a cross-platform, predictable manner.

FIN

To do a full comparison justice would have required writing a mini-book which is something I can’t spare cycles on, so my primary goals were to make sure folks knew stringr wrapped stringi and to show that stringi has much more to offer than you probably knew. If you start to get hooked on some of the more “fun” or utilitarian functions in stringi it’s probably worth switching to it. If string ops are ancillary operations to you and you normally work in regex-land, then you’re not missing out on anything and can save a few keystrokes here and there by using stringr.

Comments are extremely encouraged for this post as I’m curious if you know about stringi before and when/where/how you use it vs stringr (or, why you don’t).

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers