Programmatically generate REGEX Patterns in R without knowing Regex

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Category

Tags

Regular Expression (Regex — often pronounced as ri-je-x or reg-x) is extremely useful while you are about to do Text Analytics or Natural Language Processing. But as much as Regex is useful, it’s also extremely confusing and hard to understand and always require (at least for me) multiple DDGing with click and back to multiple Stack Overflow links.

What’s Regex

According to Wikipedia, A regular expression, regex or regexp is a sequence of characters that define a search pattern.

How does it look?

This is the REGEX pattern to test the validity of a URL:

^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$

A typical regular expression contains — Characters ( http ) and Meta Characters ([]). The combination of these two form a meaningful regular expression for a particular task.
So, What’s the problem?

Remembering the way in which characters and meta-characters are combined to create a meaningful regex is itself a tedious task which sometimes becomes a bigger task than the actual problem of NLP which is the larger goal.

Solution at Hand

Some good soul on this planet has created an open-source Javascript library JSVerbalExpressions to make Regex creation easy. Then some other good soul (Tyler Littlefield) ported the javascript library to R— RVerbalExpressions. This is the beauty of the open source world.

Installation

is available on RVerbalExpressions Github so you can use devtools or remotes to install it from Github.

# install.packages("devtools")
devtools::install_github("VerbalExpressions/RVerbalExpressions")

Pseudo-Problem

Let’s create a pseudo-problem that we’d like to solve with regex through which we can understand this package to programmatically create regex.

A simpler one perhaps, We’ve got multiple text like and we’d like to extract the names from it. Here’s our input and output look like:

strings = c('123Abdul233','233Raja434','223Ethan Hunt444')
Abdul, Raja, Ethan Hunt

Once we solve this, we’ll move forward with slightly complicated problems.

Pseudo-Code

Before we code, it’s always good to write-out a pseudo-code on a napkin or even a paper if you’ve got. That is, We want to extract names (which is composition of alphabets) except numbers (which is digits). We build a regex for one-line and then we iterate it for all the elements in our vector.

Loading

Like any other R package, we can load RVerbalExpressions with library() function.

library(RVerbalExpressions)

Constructing the Expression

Extract Strings

Like many other modern-day R packages, RVerbalExpressions support %>% pipe operator for better simplicity and readability of the code. But for this problem of extracting strings that are present between the numbers, we can simply use one function that is rx_alpha() to say that we need alphabets from the given string.

expr =  rx_alpha() 

stringr::str_extract_all(strings,expr)  


[[1]]
[1] "A" "b" "d" "u" "l"

[[2]]
[1] "R" "a" "j" "a"

[[3]]
[1] "E" "t" "h" "a" "n" "H" "u" "n" "t"

Extract Numbers

Similar to the text that we extracted, Extracting Numbers again is very English as we’ve to use the function rx_digit() to say that we need numbers from the given text.

expr =  rx_digit() 

stringr::str_extract_all(strings,expr)  
[[1]]
[1] "1" "2" "3" "2" "3" "3"

[[2]]
[1] "2" "3" "3" "4" "3" "4"

[[3]]
[1] "2" "2" "3" "4" "4" "4"


Another Constructor to extract the name as a word

Here, we can use the function rx_word() to match it as word (rather than letters).

expr =  rx_alpha()  %>%  rx_word() %>% rx_alpha() 

stringr::str_extract_all(strings,expr) 

[[1]]
[1] "Abdul"

[[2]]
[1] "Raja"

[[3]]
[1] "Ethan" "Hunt"

Expression

What if we want to use the expression somewhere else or simply we need the regex expression. It’s simple because the expression is what we’ve constructed and printing what we constructed would reveal the relevant regex pattern.

expr
"[A-z]\\w+[A-z]"

Summary

Thus, we managed to build a regex pattern without knowing regex. Simply put, we programmatically generated a regex pattern using R (that doesn’t require the high-level knowledge of regex patterns) and accomplished a tiny task that we took up to demonstrate the potential. For more of Regex, Check out this Datacamp course. The entire code is available here.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)