What’s the fastest way to search and replace strings in a data frame?

Posted on July 22, 2022 by Econometrics and Free Software in R bloggers | 0 Comments

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve tweeted this:

Just changed like 100 grepl calls to stringi::stri_detect and my pipeline now runs 4 times faster #RStats
— Bruno Rodrigues (@brodriguesco) July 20, 2022

much discussed ensued. Some people were surprised, because in their experience, grepl() was faster than alternatives, especially if you set the perl parameter in grepl() to TRUE. My use case was quite simple; I have a relatively large data set (half a million lines) with one column with several misspelling of city names. So I painstakingly wrote some code to correct the spelling of the major cities (those that came up often enough to matter. Minor cities were set to “Other”. Sorry, Wiltz!)

So in this short blog post, I benchmark some code to see if what I did the other day was a fluke. Maybe something weird with my R installation on my work laptop running Windows 10 somehow made stri_detect() run faster than grepl()? I don’t even know if something like that is possible. I’m writing these lines on my Linux machine, unlike the code I run at work. So maybe if I find some differences, they could be due to the different OS running. I don’t want to have to deal with Windows on my days off (for my blood pressure’s sake), so I’m not running this benchmark on my work laptop. So that part we’ll never know.

Anyways, let’s start by getting some data. I’m not commenting the code below, because that’s not the point of this post.

library(dplyr)
library(stringi)
library(stringr)
library(re2)

adult <- vroom::vroom(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
)

adult_colnames <- readLines(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names"
)

adult_colnames <- adult_colnames[97:110] %>%
  str_extract(".*(?=:)") %>%
  str_replace_all("-", "_")

adult_colnames <- c(adult_colnames, "wage")

colnames(adult) <- adult_colnames

adult
## # A tibble: 32,560 × 15
##      age workclass    fnlwgt educa…¹ educa…² marit…³ occup…⁴ relat…⁵ race  sex  
##    <dbl> <chr>         <dbl> <chr>     <dbl> <chr>   <chr>   <chr>   <chr> <chr>
##  1    50 Self-emp-no…  83311 Bachel…      13 Marrie… Exec-m… Husband White Male 
##  2    38 Private      215646 HS-grad       9 Divorc… Handle… Not-in… White Male 
##  3    53 Private      234721 11th          7 Marrie… Handle… Husband Black Male 
##  4    28 Private      338409 Bachel…      13 Marrie… Prof-s… Wife    Black Fema…
##  5    37 Private      284582 Masters      14 Marrie… Exec-m… Wife    White Fema…
##  6    49 Private      160187 9th           5 Marrie… Other-… Not-in… Black Fema…
##  7    52 Self-emp-no… 209642 HS-grad       9 Marrie… Exec-m… Husband White Male 
##  8    31 Private       45781 Masters      14 Never-… Prof-s… Not-in… White Fema…
##  9    42 Private      159449 Bachel…      13 Marrie… Exec-m… Husband White Male 
## 10    37 Private      280464 Some-c…      10 Marrie… Exec-m… Husband Black Male 
## # … with 32,550 more rows, 5 more variables: capital_gain <dbl>,
## #   capital_loss <dbl>, hours_per_week <dbl>, native_country <chr>, wage <chr>,
## #   and abbreviated variable names ¹education, ²education_num, ³marital_status,
## #   ⁴occupation, ⁵relationship
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Let’s now write the functions used for benchmarking. There will be 5 of them:

One using grepl() without any fancy options;
One using grepl() where perl is set to TRUE;
One that uses stringi::stri_detect();
One that uses stringr::str_detect();
One that uses re2::re2_detect().

Below you can read the functions. They’re all pretty much the same, only the function looking for the string changes. These functions look for a string in the marital_status variable and create a new variable with a corresponding integer.

with_grepl <- function(dataset){
  dataset |>
    mutate(married = case_when(
             grepl("Married", marital_status) ~ 1,
             grepl("married", marital_status) ~ 2,
             TRUE ~ 3)
           )
}

with_grepl_perl <- function(dataset){
  dataset |>
    mutate(married = case_when(
             grepl("Married", marital_status, perl = TRUE) ~ 1,
             grepl("married", marital_status, perl = TRUE) ~ 2,
             TRUE ~ 3)
           )
}

with_stringi <- function(dataset){
  dataset |>
    mutate(married = case_when(
             stri_detect(marital_status, regex = "Married") ~ 1,
             stri_detect(marital_status, regex = "married") ~ 2,
             TRUE ~ 3)
           )
}

with_stringr <- function(dataset){
  dataset |>
    mutate(married = case_when(
             str_detect(marital_status, "Married") ~ 1,
             str_detect(marital_status, "married") ~ 2,
             TRUE ~ 3)
           )
}

with_re2 <- function(dataset){
  dataset |>
    mutate(married = case_when(
             re2_detect(marital_status, "Married") ~ 1,
             re2_detect(marital_status, "married") ~ 2,
             TRUE ~ 3)
           )
}

Now I make extra sure these functions actually return the exact same thing. So for this I’m running them once on the data and use testthat::expect_equal(). It’s a bit unwieldy, so if you have a better way of doing this, please let me know.

run_grepl <- function(){
  with_grepl(adult) %>%
    count(married, marital_status)
}

one <- run_grepl()

run_grepl_perl <- function(){
  with_grepl_perl(adult) %>%
    count(married, marital_status)
}

two <- run_grepl_perl()

run_stringi <- function(){
  with_stringi(adult) %>%
    count(married, marital_status)
}

three <- run_stringi()

run_stringr <- function(){
  with_stringr(adult) %>%
    count(married, marital_status)
}

four <- run_stringr()

run_re2 <- function(){
  with_re2(adult) %>%
    count(married, marital_status)
}

five <- run_re2()

one_eq_two <- testthat::expect_equal(one, two)
one_eq_three <- testthat::expect_equal(one, three)
three_eq_four <- testthat::expect_equal(three, four)

testthat::expect_equal(
            one_eq_two,
            one_eq_three
          )

testthat::expect_equal(
            one_eq_three,
            three_eq_four
          )

testthat::expect_equal(
            one,
            five)

testthat::expect_equal() does not complain, so I’m pretty sure my functions, while different, return the exact same thing. Now, we’re ready for the benchmark itself. Let’s run these function 500 times using {microbenchmark}:

microbenchmark::microbenchmark(
     run_grepl(),
     run_grepl_perl(),
     run_stringi(),
     run_stringr(),
     run_re2(),
     times = 500
)
## Unit: milliseconds
##              expr      min       lq     mean   median       uq      max neval
##       run_grepl() 24.37832 24.89573 26.64820 25.50033 27.05967 115.0769   500
##  run_grepl_perl() 19.03446 19.41323 20.91045 19.89093 21.16683 104.3917   500
##     run_stringi() 23.01141 23.40151 25.00304 23.82441 24.83598 104.8065   500
##     run_stringr() 22.98317 23.44332 25.32851 23.92721 25.18168 145.5861   500
##         run_re2() 22.22656 22.60817 24.07254 23.05895 24.22048 108.6825   500

There you have it folks! The winner is grepl() with perl = TRUE, and then it’s pretty much tied between stringi(), stringr() and re2() (maybe there’s a slight edge for re2()) and grepl() without perl = TRUE is last. But don’t forget that this is running on my machine with Linux installed on it; maybe you’ll get different results on different hardware and OSs! So if you rely a lot on grepl() and other such string manipulation function, maybe run a benchmark on your hardware first. How come switching from grepl() (without perl = TRUE though) to stri_detect() made my pipeline at work run 4 times faster I don’t know. Maybe it has also to do with the size of the data, and the complexity of the regular expression used to detect the problematic strings?

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!

Buy me an Espresso

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

What’s the fastest way to search and replace strings in a data frame?

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)