Taking the data out of the glue with regex in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.
Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named df
, consists of four columns (x1
, x2
, x3
, x4
) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.
Loading Libraries
Before we begin, we need to load the necessary libraries. We’ll be using the tidyverse
package for data manipulation, along with glue
and unglue
for string manipulation.
# Library Loading pacman::p_load(tidyverse, glue, unglue)
Exploring the Data
Let’s take a sneak peek at our data using the head()
function to understand its structure.
df <- tibble( x1 = rep("Unit A", 11), x2 = c(glue("{11:20} ({1:10}%)"), glue("{251} ({13}%)")), x3 = c(glue("{21:30} ({11:20}%)"), glue("{252} ({14}%)")), x4 = c(glue("{31:40} ({21:30}%)"), glue("{253} ({15}%)")) ) head(df, 3)
# A tibble: 3 × 4 x1 x2 x3 x4 <chr> <chr> <chr> <chr> 1 Unit A 11 (1%) 21 (11%) 31 (21%) 2 Unit A 12 (2%) 22 (12%) 32 (22%) 3 Unit A 13 (3%) 23 (13%) 33 (23%)
This command displays the first three rows of our dataframe df
, giving us an idea of how our data looks like.
Creating a Regex Function
Now, we’ll define a custom function named reg_val_fns
to extract numerical values from strings using regular expressions. This function takes two parameters: .col_data
(column data) and .pattern
(regex pattern). If no pattern is provided, it defaults to extracting any sequence of digits followed by non-word characters or the end of the string.
# Make regex function reg_val_fns <- function(.col_data, .pattern = NULL){ ptrn <- .pattern if(is.null(ptrn)){ ptrn <- "\\d+(?=\\W|$)" } reged_val <- .col_data |> str_extract(ptrn) |> as.numeric() return(reged_val) }
Applying the Regex Function
With our regex function defined, we apply it across desired columns using the mutate(across())
function from the dplyr
package. This extracts numerical values from strings in each column, converting them into numeric format. Additionally, we compute the total value for each row using rowSums()
.
# Apply the function across the desired columns df |> mutate(across(-x1, reg_val_fns)) |> mutate(total_val = rowSums(across(-x1)))
# A tibble: 11 × 5 x1 x2 x3 x4 total_val <chr> <dbl> <dbl> <dbl> <dbl> 1 Unit A 11 21 31 63 2 Unit A 12 22 32 66 3 Unit A 13 23 33 69 4 Unit A 14 24 34 72 5 Unit A 15 25 35 75 6 Unit A 16 26 36 78 7 Unit A 17 27 37 81 8 Unit A 18 28 38 84 9 Unit A 19 29 39 87 10 Unit A 20 30 40 90 11 Unit A 251 252 253 756
Alternative Approach: Using unglue
An alternative method to extract values from strings is using the unglue
package. Here, we apply the unglue_data()
function across columns (excluding x1
) to extract values and percentages separately, then unnest the resulting dataframe and compute the total value for each row.
# Use unglue df |> mutate(across(-x1, \(x) unglue_data(x, "{val} ({val_pct}%)"))) |> unnest(cols = everything(), names_sep = "_") |> mutate(across(.cols = contains("val"), \(x) as.numeric(x))) |> mutate(total_val = rowSums(across(where(is.numeric))))
# A tibble: 11 × 8 x1 x2_val x2_val_pct x3_val x3_val_pct x4_val x4_val_pct total_val <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Unit A 11 1 21 11 31 21 96 2 Unit A 12 2 22 12 32 22 102 3 Unit A 13 3 23 13 33 23 108 4 Unit A 14 4 24 14 34 24 114 5 Unit A 15 5 25 15 35 25 120 6 Unit A 16 6 26 16 36 26 126 7 Unit A 17 7 27 17 37 27 132 8 Unit A 18 8 28 18 38 28 138 9 Unit A 19 9 29 19 39 29 144 10 Unit A 20 10 30 20 40 30 150 11 Unit A 251 13 252 14 253 15 798
Conclusion
In this tutorial, we’ve explored how to leverage the power of regular expressions in R to extract numerical values from strings within a dataframe. By defining custom regex functions and using packages like dplyr
and unglue
, we can efficiently clean and manipulate text data for further analysis.
I encourage you to try out these techniques on your own datasets and explore the endless possibilities of regex in R. Happy coding!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.