**R-exercises**, and kindly contributed to R-bloggers)

A common task performed during data preparation or data analysis is the manipulation of strings.

Regular expressions are meant to assist in such and similar tasks.

A *regular expression* is a **pattern** that describes a set of strings.

Regular expressions can range from simple patterns (such as finding a single number) thru complex ones (such as identifing UK postcodes).

R implements a set of “regular expression rules” that are basically shared by other programming languages as well, and even allow the implementation of some nuances, such as Perl-like regular expressions.

Also, sometimes specific patterns may or may not be found, according to the system locales.

The implementation of those **patterns** can be performed thru several base-r functions, such as:

`grep`

`grepl`

`regexpr`

`gregexpr`

`sub`

`gsub`

`strsplit`

Since this topic includes both learning a set of rules and several different r functions, I’ll split this subject in a 3-sets series.

Answers to the exercises are available here.

Although with `regex`

, you can get correct results in more than one way, if you have different solutions, feel free to post them.

** **

#### Character class

A character class is a list of characters enclosed between square brackets (e.g. [ and ]), which matches any *single* character in that list.

For example [0359abC] means “find a pattern with one of the digits/characters 0,3,5,9,”a”,”b” or “C”.

There are some “shortcuts” that allow us finding specific ranges of digits or characters:

- [0-9] means any digit
- [A-Z] means any upper case character
- [a-z] means any lower case character

Let’s create a variable called ` text1 `

and populate it with the value “The current year is 2016”

**Exercise 1**

Create a variable called `my_pattern`

and implement the required pattern for finding **any** digit in the variable text1.

Use function ` grepl `

to verify if there is a digit in the string variable

**Exercise 2**

Use function `gregexpr`

to find **all** the positions in `text1`

where there is a digit.

Place the results in a variable called ` string_position `

** **

#### Predefined classes of characters

In many cases, we will look for specific types of characters (for example, any digit, any letter, any whitespace, etc).

For this purpose, there are several predefined classes of characters that save us a lot of typing.

Note: The interpretation of some predefined classes depends on the locale. The “standard” interpretation is that of the POSIX locale.

Below are some “popular” predefined classes and their meaning:

1. `[:alnum:]`

Alphanumeric characters: `[:alpha:]`

and `[:digit:]`

.

2. `[:alpha:]`

Alphabetic characters: `[:lower:]`

and `[:upper:]`

can also be used.

3. `[:digit:]`

Digits: 0 1 2 3 4 5 6 7 8 9.

4. `[:blank:]`

Blank characters: space and tab, and possibly other locale-dependent characters

such as non-breaking space.

**Exercise 3**

Create a variable called `my_pattern`

and implement the required pattern for finding **one** digit and **one** uppercase alphanumeric character, in variable `text1`

.

This time, combine predefined classes in the regex pattern.

Use function ` grepl `

to verify if the searched pattern exists on the string.

**Exercise 4**

Use function `regexpr`

to find the position of the first space in text1.

Place the results in a variable called ` first_space `

and

** **

#### Special single character

The period (“.”) matches any single character.

**Exercise 5**

Create a pattern that checks in `text1`

if there is a lowercase character, followed by any character and then by a digit.

**Exercise 6**

Find the starting position of the above string. Place the results in a variable called `string_pos2`

** **

#### Special symbols

There are several “special symbols” that assist in the definition of specific patterns.

Pay attention that in R, you should append an extra backslash when using those special symbols:

The symbol `\w`

matches a ‘word’ character and `\W`

is its negation.

Symbols `\d`

, `\s`

, `\D`

and `\S`

denote the digit and space classes and their negations.

As you may have noticed, some special symbols have their parallel “predefined classes”.

(For example, `\d`

equals `[0-9]`

and equals `[:digit:]`

)

**Exercise 7**

Find the following pattern: one space followed by two lowercase letters and one more space.

Use a function that returns the starting point of the found string and place its result in `string_pos3`

.

** **

#### Metacharacters

There are several metacharacters in the “regex syntax”. Here I’ll introduce two popular ones:

The caret `("^")`

– means: find a pattern starting from the **beginning** of the string

The dollar sign `("$")`

– means: find a pattern starting from the **end** of the string.

**Exercise 8**

Using the `sub`

function, replace the pattern found on the previous exercice by the string ” is not ”

Place the resulting string in `text2`

variable.

** **

#### Repetition Characters

There are several ways of dealing with the repetition of characters in the “regex syntax”. Here I’ll introduce the “Curly brackets” syntax:

`{n}`

The preceding item is matched exactly n times.

`{n,}`

The preceding item is matched n or more times.

`{n,m}`

The preceding item is matched at least n times, but not more than m times.

By default repetition is greedy, so the maximal possible number of repeats is used.

**Exercise 9 **

Find in `text2`

the following pattern: Four digits starting at the end of the string.

Use a function that returns the starting point of the found string and place its result in `string_pos4`

.

**Exercise 10**

Using the `substr`

function, and according to the position of the string found in the previous excercise, extract the first two digits found at the end of `text2`

.

**leave a comment**for the author, please follow the link and comment on their blog:

**R-exercises**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...