Regular Expression (Regex — often pronounced as ri-je-x or reg-x) is extremely useful while you are about to do Text Analytics or Natural Language Processing. But as much as Regex is useful, it’s also extremely confusing and hard to understand and always require (at least for me) multiple DDGing with click and back to multiple Stack Overflow links.
According to Wikipedia, A regular expression, regex or regexp is a sequence of characters that define a search pattern.
How does it look?
This is the REGEX pattern to test the validity of a URL:
A typical regular expression contains — Characters ( http ) and Meta Characters (). The combination of these two form a meaningful regular expression for a particular task.
So, What’s the problem?
Remembering the way in which characters and meta-characters are combined to create a meaningful regex is itself a tedious task which sometimes becomes a bigger task than the actual problem of NLP which is the larger goal.
Solution at Hand
is available on RVerbalExpressions Github so you can use
remotes to install it from Github.
# install.packages("devtools") devtools::install_github("VerbalExpressions/RVerbalExpressions")
Let’s create a pseudo-problem that we’d like to solve with regex through which we can understand this package to programmatically create regex.
A simpler one perhaps, We’ve got multiple text like and we’d like to extract the names from it. Here’s our input and output look like:
strings = c('123Abdul233','233Raja434','223Ethan Hunt444') Abdul, Raja, Ethan Hunt
Once we solve this, we’ll move forward with slightly complicated problems.
Before we code, it’s always good to write-out a pseudo-code on a napkin or even a paper if you’ve got. That is, We want to extract names (which is composition of alphabets) except numbers (which is digits). We build a regex for one-line and then we iterate it for all the elements in our vector.
Like any other R package, we can load RVerbalExpressions with library() function.
Constructing the Expression
Like many other modern-day R packages, RVerbalExpressions support
%>% pipe operator for better simplicity and readability of the code. But for this problem of extracting strings that are present between the numbers, we can simply use one function that is
rx_alpha() to say that we need alphabets from the given string.
expr = rx_alpha() stringr::str_extract_all(strings,expr) []  "A" "b" "d" "u" "l" []  "R" "a" "j" "a" []  "E" "t" "h" "a" "n" "H" "u" "n" "t"
Similar to the text that we extracted, Extracting Numbers again is very English as we’ve to use the function
rx_digit() to say that we need numbers from the given text.
expr = rx_digit() stringr::str_extract_all(strings,expr) []  "1" "2" "3" "2" "3" "3" []  "2" "3" "3" "4" "3" "4" []  "2" "2" "3" "4" "4" "4"
Another Constructor to extract the name as a word
Here, we can use the function rx_word() to match it as word (rather than letters).
expr = rx_alpha() %>% rx_word() %>% rx_alpha() stringr::str_extract_all(strings,expr) []  "Abdul" []  "Raja" []  "Ethan" "Hunt"
What if we want to use the expression somewhere else or simply we need the regex expression. It’s simple because the expression is what we’ve constructed and printing what we constructed would reveal the relevant regex pattern.
Thus, we managed to build a regex pattern without knowing regex. Simply put, we programmatically generated a regex pattern using R (that doesn’t require the high-level knowledge of regex patterns) and accomplished a tiny task that we took up to demonstrate the potential. For more of Regex, Check out this Datacamp course. The entire code is available here.