# Locating parts of a string with `stringr`

**R on Jorge Cimentada**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was wondering the realms of StackOver Flow answering some questions when I encoutered a question that looked to extract some parts of a string based on a regex. I thought I knew how to do this with the package `stringr`

using, for example, `str_sub`

but I found it a bit difficult to map how `str_locate`

complements `str_sub`

.

`str_locate`

and `str_locate_all`

give back the locations of your regex inside the desired string as a `matrix`

or a `list`

respectively. However, that didn’t look very intuitive to pass on to `str_sub`

which (I thought) only accepted numeric vectors with the indices of the parts of the strings that you want to extract. However, to my surprise, `str_sub`

accepts not only numeric vectors but also a matrix with two columns, precisely the result of `str_locate`

.

Let’s create a set of random strings from which we want to extract the word `special*word`

, where `*`

represents a random number.

library(stringr) test_string <- replicate( 100, paste0( sample(c(letters, LETTERS, paste0("special", sample(1:10, 1),"word")), 15), collapse = "") ) head(test_string)

## [1] "pZTQHcDVObnaCFS" "qBxfbIHjauyEmgspecial10word" ## [3] "TKgbmQAEFoJHOVh" "VoBdUAuzfPrmCGX" ## [5] "dGgJOspecial5wordiFpbvXzUD" "WOfLjNospecial4wordEeGkyTA"

Using `str_locate`

returns a matrix with the positions of all matches for **every string**. This is what’s called **vectorised** functions in R.

location_matrix <- str_locate(test_string, pattern = "special[0-9]word") head(location_matrix)

## start end ## [1,] NA NA ## [2,] NA NA ## [3,] NA NA ## [4,] NA NA ## [5,] 6 17 ## [6,] 8 19

For this example this wouldn’t work, but I was also interested in checking how the result of `str_locate_all`

would fit in this workflow. `str_locate_all`

is the same as `str_locate`

but since it can find **more** than one match per string, it returns a list with the same slots as there are strings in `test_string`

with a matrix per slot showing the indices of the matches. Since many of the strings in `test_string`

might not have `special*word`

, we need to fill out those matches with `NA`

:

location_list <- str_locate_all(test_string, pattern = "special[0-9]word") %>% lapply(function(.x) if (all(is.na(.x))) matrix(c(NA, NA), ncol = 2) else .x) %>% {do.call(rbind, .)} head(location_list)

## start end ## [1,] NA NA ## [2,] NA NA ## [3,] NA NA ## [4,] NA NA ## [5,] 6 17 ## [6,] 8 19

Now that we have everything ready, `str_sub`

can give our desires results using both numeric vectors as well as the entire matrix:

# Using numeric vectors from str_locate str_sub(test_string, location_matrix[, 1], location_matrix[, 2])

## [1] NA NA NA NA "special5word" ## [6] "special4word" NA NA "special5word" NA ## [11] NA NA NA NA NA ## [16] NA NA NA NA NA ## [21] NA NA NA "special5word" "special6word" ## [26] NA NA NA NA NA ## [31] "special4word" NA NA NA NA ## [36] NA NA NA "special7word" NA ## [41] NA NA NA NA NA ## [46] NA NA NA NA NA ## [51] NA NA NA NA NA ## [56] NA NA NA NA NA ## [61] NA NA "special4word" NA NA ## [66] NA NA NA NA NA ## [71] NA NA NA "special7word" "special9word" ## [76] NA NA NA NA NA ## [81] "special4word" NA NA "special5word" NA ## [86] NA NA NA "special9word" "special9word" ## [91] NA NA NA NA NA ## [96] "special6word" NA NA "special3word" "special1word"

# Using numeric vectors from str_locate_all str_sub(test_string, location_list[, 1], location_list[, 2])

## [1] NA NA NA NA "special5word" ## [6] "special4word" NA NA "special5word" NA ## [11] NA NA NA NA NA ## [16] NA NA NA NA NA ## [21] NA NA NA "special5word" "special6word" ## [26] NA NA NA NA NA ## [31] "special4word" NA NA NA NA ## [36] NA NA NA "special7word" NA ## [41] NA NA NA NA NA ## [46] NA NA NA NA NA ## [51] NA NA NA NA NA ## [56] NA NA NA NA NA ## [61] NA NA "special4word" NA NA ## [66] NA NA NA NA NA ## [71] NA NA NA "special7word" "special9word" ## [76] NA NA NA NA NA ## [81] "special4word" NA NA "special5word" NA ## [86] NA NA NA "special9word" "special9word" ## [91] NA NA NA NA NA ## [96] "special6word" NA NA "special3word" "special1word"

# Using the entire matrix str_sub(test_string, location_matrix)

## [1] NA NA NA NA "special5word" ## [6] "special4word" NA NA "special5word" NA ## [11] NA NA NA NA NA ## [16] NA NA NA NA NA ## [21] NA NA NA "special5word" "special6word" ## [26] NA NA NA NA NA ## [31] "special4word" NA NA NA NA ## [36] NA NA NA "special7word" NA ## [41] NA NA NA NA NA ## [46] NA NA NA NA NA ## [51] NA NA NA NA NA ## [56] NA NA NA NA NA ## [61] NA NA "special4word" NA NA ## [66] NA NA NA NA NA ## [71] NA NA NA "special7word" "special9word" ## [76] NA NA NA NA NA ## [81] "special4word" NA NA "special5word" NA ## [86] NA NA NA "special9word" "special9word" ## [91] NA NA NA NA NA ## [96] "special6word" NA NA "special3word" "special1word"

A much easier approach to doing the above (which is cumbersome and verbose) is to use `str_extract`

:

str_extract(test_string, "special[0-9]word")

However, the whole objecive behind this exercise was to clearly map out how to connect `str_locate`

to `str_sub`

and it’s much clearer if you can pass the entire matrix. However, converting `str_locate_all`

is still a bit tricky.

**leave a comment**for the author, please follow the link and comment on their blog:

**R on Jorge Cimentada**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.