[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.

Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named `df`, consists of four columns (`x1`, `x2`, `x3`, `x4`) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.

# Loading Libraries

Before we begin, we need to load the necessary libraries. We’ll be using the `tidyverse` package for data manipulation, along with `glue` and `unglue` for string manipulation.

```# Library Loading
pacman::p_load(tidyverse, glue, unglue)```

# Exploring the Data

Let’s take a sneak peek at our data using the `head()` function to understand its structure.

```df <- tibble(
x1 = rep("Unit A", 11),
x2 = c(glue("{11:20} ({1:10}%)"),  glue("{251} ({13}%)")),
x3 = c(glue("{21:30} ({11:20}%)"), glue("{252} ({14}%)")),
x4 = c(glue("{31:40} ({21:30}%)"), glue("{253} ({15}%)"))
)

head(df, 3)```
```# A tibble: 3 × 4
x1     x2      x3       x4
<chr>  <chr>   <chr>    <chr>
1 Unit A 11 (1%) 21 (11%) 31 (21%)
2 Unit A 12 (2%) 22 (12%) 32 (22%)
3 Unit A 13 (3%) 23 (13%) 33 (23%)```

This command displays the first three rows of our dataframe `df`, giving us an idea of how our data looks like.

# Creating a Regex Function

Now, we’ll define a custom function named `reg_val_fns` to extract numerical values from strings using regular expressions. This function takes two parameters: `.col_data` (column data) and `.pattern` (regex pattern). If no pattern is provided, it defaults to extracting any sequence of digits followed by non-word characters or the end of the string.

```# Make regex function
reg_val_fns <- function(.col_data, .pattern = NULL){
ptrn <- .pattern
if(is.null(ptrn)){
ptrn <- "\\d+(?=\\W|\$)"
}

reged_val <- .col_data |>
str_extract(ptrn) |>
as.numeric()

return(reged_val)
}```

# Applying the Regex Function

With our regex function defined, we apply it across desired columns using the `mutate(across())` function from the `dplyr` package. This extracts numerical values from strings in each column, converting them into numeric format. Additionally, we compute the total value for each row using `rowSums()`.

```# Apply the function across the desired columns
df |>
mutate(across(-x1, reg_val_fns)) |>
mutate(total_val = rowSums(across(-x1)))```
```# A tibble: 11 × 5
x1        x2    x3    x4 total_val
<chr>  <dbl> <dbl> <dbl>     <dbl>
1 Unit A    11    21    31        63
2 Unit A    12    22    32        66
3 Unit A    13    23    33        69
4 Unit A    14    24    34        72
5 Unit A    15    25    35        75
6 Unit A    16    26    36        78
7 Unit A    17    27    37        81
8 Unit A    18    28    38        84
9 Unit A    19    29    39        87
10 Unit A    20    30    40        90
11 Unit A   251   252   253       756```

# Alternative Approach: Using unglue

An alternative method to extract values from strings is using the `unglue` package. Here, we apply the `unglue_data()` function across columns (excluding `x1`) to extract values and percentages separately, then unnest the resulting dataframe and compute the total value for each row.

```# Use unglue
df |>
mutate(across(-x1, \(x) unglue_data(x, "{val} ({val_pct}%)"))) |>
unnest(cols = everything(), names_sep = "_") |>
mutate(across(.cols = contains("val"), \(x) as.numeric(x))) |>
mutate(total_val = rowSums(across(where(is.numeric))))```
```# A tibble: 11 × 8
x1     x2_val x2_val_pct x3_val x3_val_pct x4_val x4_val_pct total_val
<chr>   <dbl>      <dbl>  <dbl>      <dbl>  <dbl>      <dbl>     <dbl>
1 Unit A     11          1     21         11     31         21        96
2 Unit A     12          2     22         12     32         22       102
3 Unit A     13          3     23         13     33         23       108
4 Unit A     14          4     24         14     34         24       114
5 Unit A     15          5     25         15     35         25       120
6 Unit A     16          6     26         16     36         26       126
7 Unit A     17          7     27         17     37         27       132
8 Unit A     18          8     28         18     38         28       138
9 Unit A     19          9     29         19     39         29       144
10 Unit A     20         10     30         20     40         30       150
11 Unit A    251         13    252         14    253         15       798```

# Conclusion

In this tutorial, we’ve explored how to leverage the power of regular expressions in R to extract numerical values from strings within a dataframe. By defining custom regex functions and using packages like `dplyr` and `unglue`, we can efficiently clean and manipulate text data for further analysis.

I encourage you to try out these techniques on your own datasets and explore the endless possibilities of regex in R. Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)