Site icon R-bloggers

String Manipulation with Stringr

[This article was first published on Blog on Data Solutions | Dedicated to helping businesses making data-driven decisions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • What is a string?

    In coding, strings, also known as character strings, are sequences of characters that are surrounded by quotation marks. This often includes letters and can include numbers. Character strings are often used for names and categorizing data and it can be extremely useful to learn how to work with this type easily.

    Note: if you prefer a video tutorial, you can see it here

    Basic string manipulation

    "hello world" is an example of a basic character string. It contains two words that are surrounded by quotation marks. Typically when we are working with strings we have more than one, they will be in a vector or in a column(s) in a dataframe. The main package we will be using to manipulate strings will be stringr. There are of course many ways to do these functions but stringr is the tidyverse method for string manipulation so the grammar and structure are consistent with other tidyverse packages.

    First we will create a vector of strings called fruits that contains the names of 5 fruits.

    library(dplyr)
    ## 
    ## Attaching package: 'dplyr'
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    library(stringr)
    fruits <- c("Apple", "Banana", "Kiwi", "Pineapple", "Grape")

    If we want to get some summary stats of our vector we can use the functions str_count() or str_length(). The functions tell us how many characters are in each string.

    str_count(fruits)
    ## [1] 5 6 4 9 5
    str_length(fruits)
    ## [1] 5 6 4 9 5

    Now, say we want to answer a few questions:

    ## Which strings end with the letter e?
    str_ends(fruits, "e")
    ## [1]  TRUE FALSE FALSE  TRUE  TRUE
    ## Which strings start with the letter a?
    str_starts(fruits, "A")
    ## [1]  TRUE FALSE FALSE FALSE FALSE
    ## Do any strings have "pple" in them? 
    str_detect(fruits, "pple")
    ## [1]  TRUE FALSE FALSE  TRUE FALSE

    You can also quickly convert all letters to the same case by using str_to_lower() or str_to_upper() which can be handy for making everything uniform so it is easier to match up or group by later.

    str_to_lower(fruits)
    ## [1] "apple"     "banana"    "kiwi"      "pineapple" "grape"
    str_to_upper(fruits)
    ## [1] "APPLE"     "BANANA"    "KIWI"      "PINEAPPLE" "GRAPE"

    Now lets create a new vector that has day labels and we will:

    1. Replace the first part of each string with the word “sample”
    2. Split each string into two separate strings
    3. Pull out the number from each string
    library(purrr)
    myString <- c("Day_01", "Day_02", "Day_03", "Day_04")
    
    myString %>% 
      str_replace(pattern = "Day", replacement = "sample") %>% 
      str_split(pattern = "_") %>% 
      map(2)
    ## [[1]]
    ## [1] "01"
    ## 
    ## [[2]]
    ## [1] "02"
    ## 
    ## [[3]]
    ## [1] "03"
    ## 
    ## [[4]]
    ## [1] "04"

    Notice that because stringr is part of the tidyverse, I was able to follow the syntax and pipe each function one after the other.

    Advanced string manipulation

    Now we are going to work with strings that are in a column in a dataframe and we will learn how to subset a dataframe by the strings we want, how to change strings, and how to find where certain strings are.

    We will use the murders data set in R, it is the murder rates of each state in the US.

    library(dslabs)
    data("murders")
    head(murders)
    ##        state abb region population total
    ## 1    Alabama  AL  South    4779736   135
    ## 2     Alaska  AK   West     710231    19
    ## 3    Arizona  AZ   West    6392017   232
    ## 4   Arkansas  AR  South    2915918    93
    ## 5 California  CA   West   37253956  1257
    ## 6   Colorado  CO   West    5029196    65

    We see that this dataframe has 2 columns that are character strings and the region column is a factor but can easily be changed into a character type. Often you will want to pull out only the rows of a dataframe that meet a certain criteria. To do that based on strings, we will need to use filter() and str_detect(). We pull out all of the rows with states that:

    ## States that start with A
    murders %>% 
      filter(str_detect(string = state, pattern = "A"))
    ##      state abb region population total
    ## 1  Alabama  AL  South    4779736   135
    ## 2   Alaska  AK   West     710231    19
    ## 3  Arizona  AZ   West    6392017   232
    ## 4 Arkansas  AR  South    2915918    93
    ## States that start with A or C
    murders %>% 
      filter(str_detect(string = state, pattern = "A|C"))
    ##                   state abb    region population total
    ## 1               Alabama  AL     South    4779736   135
    ## 2                Alaska  AK      West     710231    19
    ## 3               Arizona  AZ      West    6392017   232
    ## 4              Arkansas  AR     South    2915918    93
    ## 5            California  CA      West   37253956  1257
    ## 6              Colorado  CO      West    5029196    65
    ## 7           Connecticut  CT Northeast    3574097    97
    ## 8  District of Columbia  DC     South     601723    99
    ## 9        North Carolina  NC     South    9535483   286
    ## 10       South Carolina  SC     South    4625364   207
    ## States that are in states.of.interest
    states.of.interest <- c("Texas", 
                            "Louisiana", 
                            "Mississippi", 
                            "Alabama", 
                            "Florida")
    states.of.interest <- paste(states.of.interest, collapse="|") 
    ## need to collapse the multiple strings into with the | symbol between them
    states.of.interest
    ## [1] "Texas|Louisiana|Mississippi|Alabama|Florida"
    murders %>% 
      filter(str_detect(string = state, pattern = states.of.interest))
    ##         state abb region population total
    ## 1     Alabama  AL  South    4779736   135
    ## 2     Florida  FL  South   19687653   669
    ## 3   Louisiana  LA  South    4533372   351
    ## 4 Mississippi  MS  South    2967297   120
    ## 5       Texas  TX  South   25145561   805
    ## States that don't start with A or C
    murders %>% 
      filter(str_detect(string = state, pattern = "A|C", negate = TRUE))
    ##            state abb        region population total
    ## 1       Delaware  DE         South     897934    38
    ## 2        Florida  FL         South   19687653   669
    ## 3        Georgia  GA         South    9920000   376
    ## 4         Hawaii  HI          West    1360301     7
    ## 5          Idaho  ID          West    1567582    12
    ## 6       Illinois  IL North Central   12830632   364
    ## 7        Indiana  IN North Central    6483802   142
    ## 8           Iowa  IA North Central    3046355    21
    ## 9         Kansas  KS North Central    2853118    63
    ## 10      Kentucky  KY         South    4339367   116
    ## 11     Louisiana  LA         South    4533372   351
    ## 12         Maine  ME     Northeast    1328361    11
    ## 13      Maryland  MD         South    5773552   293
    ## 14 Massachusetts  MA     Northeast    6547629   118
    ## 15      Michigan  MI North Central    9883640   413
    ## 16     Minnesota  MN North Central    5303925    53
    ## 17   Mississippi  MS         South    2967297   120
    ## 18      Missouri  MO North Central    5988927   321
    ## 19       Montana  MT          West     989415    12
    ## 20      Nebraska  NE North Central    1826341    32
    ## 21        Nevada  NV          West    2700551    84
    ## 22 New Hampshire  NH     Northeast    1316470     5
    ## 23    New Jersey  NJ     Northeast    8791894   246
    ## 24    New Mexico  NM          West    2059179    67
    ## 25      New York  NY     Northeast   19378102   517
    ## 26  North Dakota  ND North Central     672591     4
    ## 27          Ohio  OH North Central   11536504   310
    ## 28      Oklahoma  OK         South    3751351   111
    ## 29        Oregon  OR          West    3831074    36
    ## 30  Pennsylvania  PA     Northeast   12702379   457
    ## 31  Rhode Island  RI     Northeast    1052567    16
    ## 32  South Dakota  SD North Central     814180     8
    ## 33     Tennessee  TN         South    6346105   219
    ## 34         Texas  TX         South   25145561   805
    ## 35          Utah  UT          West    2763885    22
    ## 36       Vermont  VT     Northeast     625741     2
    ## 37      Virginia  VA         South    8001024   250
    ## 38    Washington  WA          West    6724540    93
    ## 39 West Virginia  WV         South    1852994    27
    ## 40     Wisconsin  WI North Central    5686986    97
    ## 41       Wyoming  WY          West     563626     5

    In the above examples, the negate = TRUE argument is key for pulling out the states that don’t start with A or C. Sometimes it is easier to tell R which rows you don’t want rather than tell it which ones you do want, as in this case, and this is where the negate argument is useful.

    Another way you may need to manipulate a column of character strings is if you want to change the words. So for example, in the murders dataframe, we will change the names of the regions so they are all one word and all lowercase. To do this we will combine the mutate() and str_replace() functions.

    murders %>% 
      distinct(region)
    ##          region
    ## 1         South
    ## 2          West
    ## 3     Northeast
    ## 4 North Central
    # to just change one region
    murders %>% 
      mutate(region = str_replace(string = region, 
                                  pattern = "South", 
                                  replacement = "south")) %>% 
      head()
    ##        state abb region population total
    ## 1    Alabama  AL  south    4779736   135
    ## 2     Alaska  AK   West     710231    19
    ## 3    Arizona  AZ   West    6392017   232
    ## 4   Arkansas  AR  south    2915918    93
    ## 5 California  CA   West   37253956  1257
    ## 6   Colorado  CO   West    5029196    65
    # to change them all at the same time
    murders %>% 
      mutate(region = str_replace_all(string = region, c("South" = "south",
                                                         "West" = "west",
                                                         "North Central" = "north_central",
                                                         "Northeast" = "northeast"))) %>% 
      head(n = 20)
    ##                   state abb        region population total
    ## 1               Alabama  AL         south    4779736   135
    ## 2                Alaska  AK          west     710231    19
    ## 3               Arizona  AZ          west    6392017   232
    ## 4              Arkansas  AR         south    2915918    93
    ## 5            California  CA          west   37253956  1257
    ## 6              Colorado  CO          west    5029196    65
    ## 7           Connecticut  CT     northeast    3574097    97
    ## 8              Delaware  DE         south     897934    38
    ## 9  District of Columbia  DC         south     601723    99
    ## 10              Florida  FL         south   19687653   669
    ## 11              Georgia  GA         south    9920000   376
    ## 12               Hawaii  HI          west    1360301     7
    ## 13                Idaho  ID          west    1567582    12
    ## 14             Illinois  IL north_central   12830632   364
    ## 15              Indiana  IN north_central    6483802   142
    ## 16                 Iowa  IA north_central    3046355    21
    ## 17               Kansas  KS north_central    2853118    63
    ## 18             Kentucky  KY         south    4339367   116
    ## 19            Louisiana  LA         south    4533372   351
    ## 20                Maine  ME     northeast    1328361    11

    Lastly, if you want to find the rows which contain certain strings you can use the str_which function. For example, if I want to find the index for the rows that contain south in the name.

    murders %>% 
      ## creating a new column with the state name and region together just for the example
      mutate(state_region = paste(state, region, sep = "_")) %>% 
      pull(state_region) %>% 
      str_which(pattern = "_South")
    ##  [1]  1  4  8  9 10 11 18 19 21 25 34 37 41 43 44 47 49

    To leave a comment for the author, please follow the link and comment on their blog: Blog on Data Solutions | Dedicated to helping businesses making data-driven decisions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.