# Taking the data out of the glue with regex in R

**Steve's Data Tips and Tricks**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.

Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named `df`

, consists of four columns (`x1`

, `x2`

, `x3`

, `x4`

) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.

# Loading Libraries

Before we begin, we need to load the necessary libraries. We’ll be using the `tidyverse`

package for data manipulation, along with `glue`

and `unglue`

for string manipulation.

# Library Loading pacman::p_load(tidyverse, glue, unglue)

# Exploring the Data

Let’s take a sneak peek at our data using the `head()`

function to understand its structure.

df <- tibble( x1 = rep("Unit A", 11), x2 = c(glue("{11:20} ({1:10}%)"), glue("{251} ({13}%)")), x3 = c(glue("{21:30} ({11:20}%)"), glue("{252} ({14}%)")), x4 = c(glue("{31:40} ({21:30}%)"), glue("{253} ({15}%)")) ) head(df, 3)

# A tibble: 3 × 4 x1 x2 x3 x4 <chr> <chr> <chr> <chr> 1 Unit A 11 (1%) 21 (11%) 31 (21%) 2 Unit A 12 (2%) 22 (12%) 32 (22%) 3 Unit A 13 (3%) 23 (13%) 33 (23%)

This command displays the first three rows of our dataframe `df`

, giving us an idea of how our data looks like.

# Creating a Regex Function

Now, we’ll define a custom function named `reg_val_fns`

to extract numerical values from strings using regular expressions. This function takes two parameters: `.col_data`

(column data) and `.pattern`

(regex pattern). If no pattern is provided, it defaults to extracting any sequence of digits followed by non-word characters or the end of the string.

# Make regex function reg_val_fns <- function(.col_data, .pattern = NULL){ ptrn <- .pattern if(is.null(ptrn)){ ptrn <- "\\d+(?=\\W|$)" } reged_val <- .col_data |> str_extract(ptrn) |> as.numeric() return(reged_val) }

# Applying the Regex Function

With our regex function defined, we apply it across desired columns using the `mutate(across())`

function from the `dplyr`

package. This extracts numerical values from strings in each column, converting them into numeric format. Additionally, we compute the total value for each row using `rowSums()`

.

# Apply the function across the desired columns df |> mutate(across(-x1, reg_val_fns)) |> mutate(total_val = rowSums(across(-x1)))

# A tibble: 11 × 5 x1 x2 x3 x4 total_val <chr> <dbl> <dbl> <dbl> <dbl> 1 Unit A 11 21 31 63 2 Unit A 12 22 32 66 3 Unit A 13 23 33 69 4 Unit A 14 24 34 72 5 Unit A 15 25 35 75 6 Unit A 16 26 36 78 7 Unit A 17 27 37 81 8 Unit A 18 28 38 84 9 Unit A 19 29 39 87 10 Unit A 20 30 40 90 11 Unit A 251 252 253 756

# Alternative Approach: Using unglue

An alternative method to extract values from strings is using the `unglue`

package. Here, we apply the `unglue_data()`

function across columns (excluding `x1`

) to extract values and percentages separately, then unnest the resulting dataframe and compute the total value for each row.

# Use unglue df |> mutate(across(-x1, \(x) unglue_data(x, "{val} ({val_pct}%)"))) |> unnest(cols = everything(), names_sep = "_") |> mutate(across(.cols = contains("val"), \(x) as.numeric(x))) |> mutate(total_val = rowSums(across(where(is.numeric))))

# A tibble: 11 × 8 x1 x2_val x2_val_pct x3_val x3_val_pct x4_val x4_val_pct total_val <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Unit A 11 1 21 11 31 21 96 2 Unit A 12 2 22 12 32 22 102 3 Unit A 13 3 23 13 33 23 108 4 Unit A 14 4 24 14 34 24 114 5 Unit A 15 5 25 15 35 25 120 6 Unit A 16 6 26 16 36 26 126 7 Unit A 17 7 27 17 37 27 132 8 Unit A 18 8 28 18 38 28 138 9 Unit A 19 9 29 19 39 29 144 10 Unit A 20 10 30 20 40 30 150 11 Unit A 251 13 252 14 253 15 798

# Conclusion

In this tutorial, we’ve explored how to leverage the power of regular expressions in R to extract numerical values from strings within a dataframe. By defining custom regex functions and using packages like `dplyr`

and `unglue`

, we can efficiently clean and manipulate text data for further analysis.

I encourage you to try out these techniques on your own datasets and explore the endless possibilities of regex in R. Happy coding!

**leave a comment**for the author, please follow the link and comment on their blog:

**Steve's Data Tips and Tricks**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.