Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve tweeted this:
Just changed like 100 grepl calls to stringi::stri_detect and my pipeline now runs 4 times faster #RStats
— Bruno Rodrigues (@brodriguesco) July 20, 2022
much discussed ensued. Some people were surprised, because in their experience, grepl()
was faster than alternatives, especially if you set the perl parameter in grepl() to TRUE.
My use case was quite simple; I have a relatively large data set (half a million lines) with
one column with several misspelling of city names. So I painstakingly wrote some code
to correct the spelling of the major cities (those that came up often enough to matter. Minor
cities were set to “Other”. Sorry, Wiltz!)
So in this short blog post, I benchmark some code to see if what I did the other day was a fluke.
Maybe something weird with my R installation on my work laptop running Windows 10 somehow
made stri_detect() run faster than grepl()? I don’t even know if something like that is
possible. I’m writing these lines on my Linux machine, unlike the code I run at work.
So maybe if I find some differences, they could be due to the different OS running.
I don’t want to have to deal with Windows on my days off (for my blood pressure’s sake),
so I’m not running this benchmark on my work laptop. So that part we’ll never know.
Anyways, let’s start by getting some data. I’m not commenting the code below, because that’s not the point of this post.
library(dplyr)
library(stringi)
library(stringr)
library(re2)
adult <- vroom::vroom(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
)
adult_colnames <- readLines(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names"
)
adult_colnames <- adult_colnames[97:110] %>%
  str_extract(".*(?=:)") %>%
  str_replace_all("-", "_")
adult_colnames <- c(adult_colnames, "wage")
colnames(adult) <- adult_colnames
adult
## # A tibble: 32,560 × 15
##      age workclass    fnlwgt educa…¹ educa…² marit…³ occup…⁴ relat…⁵ race  sex  
##    <dbl> <chr>         <dbl> <chr>     <dbl> <chr>   <chr>   <chr>   <chr> <chr>
##  1    50 Self-emp-no…  83311 Bachel…      13 Marrie… Exec-m… Husband White Male 
##  2    38 Private      215646 HS-grad       9 Divorc… Handle… Not-in… White Male 
##  3    53 Private      234721 11th          7 Marrie… Handle… Husband Black Male 
##  4    28 Private      338409 Bachel…      13 Marrie… Prof-s… Wife    Black Fema…
##  5    37 Private      284582 Masters      14 Marrie… Exec-m… Wife    White Fema…
##  6    49 Private      160187 9th           5 Marrie… Other-… Not-in… Black Fema…
##  7    52 Self-emp-no… 209642 HS-grad       9 Marrie… Exec-m… Husband White Male 
##  8    31 Private       45781 Masters      14 Never-… Prof-s… Not-in… White Fema…
##  9    42 Private      159449 Bachel…      13 Marrie… Exec-m… Husband White Male 
## 10    37 Private      280464 Some-c…      10 Marrie… Exec-m… Husband Black Male 
## # … with 32,550 more rows, 5 more variables: capital_gain <dbl>,
## #   capital_loss <dbl>, hours_per_week <dbl>, native_country <chr>, wage <chr>,
## #   and abbreviated variable names ¹education, ²education_num, ³marital_status,
## #   ⁴occupation, ⁵relationship
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Let’s now write the functions used for benchmarking. There will be 5 of them:
- One using grepl()without any fancy options;
- One using grepl()whereperlis set toTRUE;
- One that uses stringi::stri_detect();
- One that uses stringr::str_detect();
- One that uses re2::re2_detect().
Below you can read the functions. They’re all pretty much the same, only the function
looking for the string changes. These functions look for a string in the marital_status
variable and create a new variable with a corresponding integer.
with_grepl <- function(dataset){
  dataset |>
    mutate(married = case_when(
             grepl("Married", marital_status) ~ 1,
             grepl("married", marital_status) ~ 2,
             TRUE ~ 3)
           )
}
with_grepl_perl <- function(dataset){
  dataset |>
    mutate(married = case_when(
             grepl("Married", marital_status, perl = TRUE) ~ 1,
             grepl("married", marital_status, perl = TRUE) ~ 2,
             TRUE ~ 3)
           )
}
with_stringi <- function(dataset){
  dataset |>
    mutate(married = case_when(
             stri_detect(marital_status, regex = "Married") ~ 1,
             stri_detect(marital_status, regex = "married") ~ 2,
             TRUE ~ 3)
           )
}
with_stringr <- function(dataset){
  dataset |>
    mutate(married = case_when(
             str_detect(marital_status, "Married") ~ 1,
             str_detect(marital_status, "married") ~ 2,
             TRUE ~ 3)
           )
}
with_re2 <- function(dataset){
  dataset |>
    mutate(married = case_when(
             re2_detect(marital_status, "Married") ~ 1,
             re2_detect(marital_status, "married") ~ 2,
             TRUE ~ 3)
           )
}
Now I make extra sure these functions actually return the exact same thing. So for this
I’m running them once on the data and use testthat::expect_equal(). It’s a bit
unwieldy, so if you have a better way of doing this, please let me know.
run_grepl <- function(){
  with_grepl(adult) %>%
    count(married, marital_status)
}
one <- run_grepl()
run_grepl_perl <- function(){
  with_grepl_perl(adult) %>%
    count(married, marital_status)
}
two <- run_grepl_perl()
run_stringi <- function(){
  with_stringi(adult) %>%
    count(married, marital_status)
}
three <- run_stringi()
run_stringr <- function(){
  with_stringr(adult) %>%
    count(married, marital_status)
}
four <- run_stringr()
run_re2 <- function(){
  with_re2(adult) %>%
    count(married, marital_status)
}
five <- run_re2()
one_eq_two <- testthat::expect_equal(one, two)
one_eq_three <- testthat::expect_equal(one, three)
three_eq_four <- testthat::expect_equal(three, four)
testthat::expect_equal(
            one_eq_two,
            one_eq_three
          )
testthat::expect_equal(
            one_eq_three,
            three_eq_four
          )
testthat::expect_equal(
            one,
            five)
testthat::expect_equal() does not complain, so I’m pretty sure my functions, while different,
return the exact same thing. Now, we’re ready for the benchmark itself. Let’s run these
function 500 times using {microbenchmark}:
microbenchmark::microbenchmark(
     run_grepl(),
     run_grepl_perl(),
     run_stringi(),
     run_stringr(),
     run_re2(),
     times = 500
)
## Unit: milliseconds
##              expr      min       lq     mean   median       uq      max neval
##       run_grepl() 24.37832 24.89573 26.64820 25.50033 27.05967 115.0769   500
##  run_grepl_perl() 19.03446 19.41323 20.91045 19.89093 21.16683 104.3917   500
##     run_stringi() 23.01141 23.40151 25.00304 23.82441 24.83598 104.8065   500
##     run_stringr() 22.98317 23.44332 25.32851 23.92721 25.18168 145.5861   500
##         run_re2() 22.22656 22.60817 24.07254 23.05895 24.22048 108.6825   500
There you have it folks! The winner is grepl() with perl = TRUE, and then it’s
pretty much tied between stringi(), stringr() and re2() (maybe there’s a slight edge
for re2()) and grepl() without perl = TRUE is last. But don’t forget that this is running
on my machine with Linux installed on it; maybe you’ll get different results on different
hardware and OSs! So if you rely a lot on grepl() and other such string manipulation
function, maybe run a benchmark on your hardware first. How come switching from grepl()
(without perl = TRUE though) to stri_detect() made my pipeline at work run 4 times
faster I don’t know. Maybe it has also to do with the size of the data, and the complexity
of the regular expression used to detect the problematic strings?
Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!
