# Regular Expressions Exercises – Part 1

October 30, 2016
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A common task performed during data preparation or data analysis is the manipulation of strings.

Regular expressions are meant to assist in such and similar tasks.

A regular expression is a pattern that describes a set of strings.

Regular expressions can range from simple patterns (such as finding a single number) thru complex ones (such as identifing UK postcodes).

R implements a set of “regular expression rules” that are basically shared by other programming languages as well, and even allow the implementation of some nuances, such as Perl-like regular expressions.

Also, sometimes specific patterns may or may not be found, according to the system locales.

The implementation of those patterns can be performed thru several base-r functions, such as:

• `grep`
• `grepl`
• `regexpr`
• `gregexpr`
• `sub`
• `gsub`
• `strsplit`

Since this topic includes both learning a set of rules and several different r functions, I’ll split this subject in a 3-sets series.

Answers to the exercises are available here.

Although with `regex`, you can get correct results in more than one way, if you have different solutions, feel free to post them.

#### Character class

A character class is a list of characters enclosed between square brackets (e.g. [ and ]), which matches any *single* character in that list.
For example [0359abC] means “find a pattern with one of the digits/characters 0,3,5,9,”a”,”b” or “C”.
There are some “shortcuts” that allow us finding specific ranges of digits or characters:

• [0-9] means any digit
• [A-Z] means any upper case character
• [a-z] means any lower case character

Let’s create a variable called ` text1 ` and populate it with the value “The current year is 2016”

Exercise 1
Create a variable called `my_pattern` and implement the required pattern for finding any digit in the variable text1.
Use function ` grepl ` to verify if there is a digit in the string variable

Exercise 2
Use function `gregexpr` to find all the positions in `text1` where there is a digit.
Place the results in a variable called ` string_position `

#### Predefined classes of characters

In many cases, we will look for specific types of characters (for example, any digit, any letter, any whitespace, etc).
For this purpose, there are several predefined classes of characters that save us a lot of typing.

Note: The interpretation of some predefined classes depends on the locale. The “standard” interpretation is that of the POSIX locale.

Below are some “popular” predefined classes and their meaning:
1. `[:alnum:]`
Alphanumeric characters: `[:alpha:]` and `[:digit:]`.

2. `[:alpha:]`
Alphabetic characters: `[:lower:]` and `[:upper:]` can also be used.

3. `[:digit:]`
Digits: 0 1 2 3 4 5 6 7 8 9.

4. `[:blank:]`
Blank characters: space and tab, and possibly other locale-dependent characters
such as non-breaking space.

Exercise 3
Create a variable called `my_pattern` and implement the required pattern for finding one digit and one uppercase alphanumeric character, in variable `text1`.
This time, combine predefined classes in the regex pattern.
Use function ` grepl ` to verify if the searched pattern exists on the string.

Exercise 4
Use function `regexpr` to find the position of the first space in text1.
Place the results in a variable called ` first_space ` and

#### Special single character

The period (“.”) matches any single character.
Exercise 5
Create a pattern that checks in `text1` if there is a lowercase character, followed by any character and then by a digit.

Exercise 6
Find the starting position of the above string. Place the results in a variable called `string_pos2`

#### Special symbols

There are several “special symbols” that assist in the definition of specific patterns.
Pay attention that in R, you should append an extra backslash when using those special symbols:
The symbol `\w` matches a ‘word’ character and `\W` is its negation.
Symbols `\d`, `\s`, `\D` and `\S` denote the digit and space classes and their negations.
As you may have noticed, some special symbols have their parallel “predefined classes”.
(For example, `\d` equals `[0-9]` and equals `[:digit:]`)

Exercise 7
Find the following pattern: one space followed by two lowercase letters and one more space.
Use a function that returns the starting point of the found string and place its result in `string_pos3`.

#### Metacharacters

There are several metacharacters in the “regex syntax”. Here I’ll introduce two popular ones:
The caret `("^")` – means: find a pattern starting from the beginning of the string
The dollar sign `("\$")` – means: find a pattern starting from the end of the string.

Exercise 8
Using the `sub` function, replace the pattern found on the previous exercice by the string ” is not ”
Place the resulting string in `text2` variable.

#### Repetition Characters

There are several ways of dealing with the repetition of characters in the “regex syntax”. Here I’ll introduce the “Curly brackets” syntax:
`{n}` The preceding item is matched exactly n times.

`{n,}` The preceding item is matched n or more times.

`{n,m}` The preceding item is matched at least n times, but not more than m times.

By default repetition is greedy, so the maximal possible number of repeats is used.

Exercise 9
Find in `text2` the following pattern: Four digits starting at the end of the string.
Use a function that returns the starting point of the found string and place its result in `string_pos4`.

Exercise 10
Using the `substr` function, and according to the position of the string found in the previous excercise, extract the first two digits found at the end of `text2`.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.