Taking the data out of the glue with regex in R

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.

Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named df, consists of four columns (x1, x2, x3, x4) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.

Loading Libraries

Before we begin, we need to load the necessary libraries. We’ll be using the tidyverse package for data manipulation, along with glue and unglue for string manipulation.

# Library Loading
pacman::p_load(tidyverse, glue, unglue)

Exploring the Data

Let’s take a sneak peek at our data using the head() function to understand its structure.

df <- tibble(
  x1 = rep("Unit A", 11),
  x2 = c(glue("{11:20} ({1:10}%)"),  glue("{251} ({13}%)")),
  x3 = c(glue("{21:30} ({11:20}%)"), glue("{252} ({14}%)")),
  x4 = c(glue("{31:40} ({21:30}%)"), glue("{253} ({15}%)"))
)

head(df, 3)
# A tibble: 3 × 4
  x1     x2      x3       x4      
  <chr>  <chr>   <chr>    <chr>   
1 Unit A 11 (1%) 21 (11%) 31 (21%)
2 Unit A 12 (2%) 22 (12%) 32 (22%)
3 Unit A 13 (3%) 23 (13%) 33 (23%)

This command displays the first three rows of our dataframe df, giving us an idea of how our data looks like.

Creating a Regex Function

Now, we’ll define a custom function named reg_val_fns to extract numerical values from strings using regular expressions. This function takes two parameters: .col_data (column data) and .pattern (regex pattern). If no pattern is provided, it defaults to extracting any sequence of digits followed by non-word characters or the end of the string.

# Make regex function
reg_val_fns <- function(.col_data, .pattern = NULL){
  ptrn <- .pattern
  if(is.null(ptrn)){
    ptrn <- "\\d+(?=\\W|$)"
  }
  
  reged_val <- .col_data |>
    str_extract(ptrn) |>
    as.numeric()

  return(reged_val)
}

Applying the Regex Function

With our regex function defined, we apply it across desired columns using the mutate(across()) function from the dplyr package. This extracts numerical values from strings in each column, converting them into numeric format. Additionally, we compute the total value for each row using rowSums().

# Apply the function across the desired columns
df |>
  mutate(across(-x1, reg_val_fns)) |>
  mutate(total_val = rowSums(across(-x1)))
# A tibble: 11 × 5
   x1        x2    x3    x4 total_val
   <chr>  <dbl> <dbl> <dbl>     <dbl>
 1 Unit A    11    21    31        63
 2 Unit A    12    22    32        66
 3 Unit A    13    23    33        69
 4 Unit A    14    24    34        72
 5 Unit A    15    25    35        75
 6 Unit A    16    26    36        78
 7 Unit A    17    27    37        81
 8 Unit A    18    28    38        84
 9 Unit A    19    29    39        87
10 Unit A    20    30    40        90
11 Unit A   251   252   253       756

Alternative Approach: Using unglue

An alternative method to extract values from strings is using the unglue package. Here, we apply the unglue_data() function across columns (excluding x1) to extract values and percentages separately, then unnest the resulting dataframe and compute the total value for each row.

# Use unglue
df |>
  mutate(across(-x1, \(x) unglue_data(x, "{val} ({val_pct}%)"))) |> 
  unnest(cols = everything(), names_sep = "_") |>
  mutate(across(.cols = contains("val"), \(x) as.numeric(x))) |>
  mutate(total_val = rowSums(across(where(is.numeric))))
# A tibble: 11 × 8
   x1     x2_val x2_val_pct x3_val x3_val_pct x4_val x4_val_pct total_val
   <chr>   <dbl>      <dbl>  <dbl>      <dbl>  <dbl>      <dbl>     <dbl>
 1 Unit A     11          1     21         11     31         21        96
 2 Unit A     12          2     22         12     32         22       102
 3 Unit A     13          3     23         13     33         23       108
 4 Unit A     14          4     24         14     34         24       114
 5 Unit A     15          5     25         15     35         25       120
 6 Unit A     16          6     26         16     36         26       126
 7 Unit A     17          7     27         17     37         27       132
 8 Unit A     18          8     28         18     38         28       138
 9 Unit A     19          9     29         19     39         29       144
10 Unit A     20         10     30         20     40         30       150
11 Unit A    251         13    252         14    253         15       798

Conclusion

In this tutorial, we’ve explored how to leverage the power of regular expressions in R to extract numerical values from strings within a dataframe. By defining custom regex functions and using packages like dplyr and unglue, we can efficiently clean and manipulate text data for further analysis.

I encourage you to try out these techniques on your own datasets and explore the endless possibilities of regex in R. Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)