This note shows how to use the
stringr package to clean a list of full names that need to be turned into unique identifiers, i.e. something that can be assigned as row names to a data frame.
Let’s start by getting a list of real names by scraping the 183 full names of the people currently sitting in the lower chamber of the Austrian parliament, using the
rvest package to find the link to the full list, and then to scrape the name values from the list:
The names extracted through this method need a lot of fixing before we can use them as readable unique identifiers:
"Doris Bures" "Karlheinz Kopf" "Ing. Norbert Hofer" "rnttrnAlm Nikolaus, Mag.rn" "rnttrnAmon Werner, MBArn" "rnttrnAngerer Erwinrn"
There is, in fact, a method to get clean names, but it involves scraping one page per row in the data, which is not always desirable or feasable.
Before we start, let’s remark that text manipulation almost always calls for an idiosyncratic solution: depending on how messy the text is, the solution will rely on specific conditions being met (or, as importantly, being never met) in the data. Here, we assume that we are working with full names.
Working with names means that you are not working with full sentences, so you will want to eliminate some sentence-related characters, such as endmarks. However, with full names, you might need to preserve other punctuation marks, such as apostrophes, dashes or even periods, which are used in titles.
What you need in this scenario is a function that can deal with character duplication, exclusion and, because of punctuation rules and of the high likelihood that a long list of human inputs will get that last one wrong at least once, spacing. Last, we will add subsetting to that list.
We will clean the data by writing short string functions in the style of the
stringr package, which provides a front-end to
stringi that uses fast C code to process strings. All
stringr functions start with
str_ and use sensical verbs. The functions below also start with
str_ but are not necessarily verb-based.
stringr automatically loads
magrittr, so we will be able to use
%>% pipes even without loading that last package. We will also load the
dplyr package when we get to postprocessing and de-duplication.
Whitespace at the beginning or at the end of a word is a common feature of badly formatted text data, as are problematic whitespace characters such as carriage returns. The other error that frequently pops up is the presence of multiple spaces instead of a single one, as in
Extra spaces are easy to fix, and fixing them also offers the opportunity to treat the special characters for line returns or tabs, such as
n, as whitespace. A simple call to
gsub, which we will embed into a
stringr-like function, is sufficient here:
After running the function on the data, the names are now stripped of any issue caused by whitespace, as shown in the results below (original data in left column, processed data in right column):
from to Doris Bures Doris Bures Karlheinz Kopf Karlheinz Kopf Ing. Norbert Hofer Ing. Norbert Hofer rnttrnAlm Nikolaus, Mag.rn Alm Nikolaus, Mag. rnttrnAmon Werner, MBArn Amon Werner, MBA rnttrnAngerer Erwinrn Angerer Erwin
Note that some names still end with a space, an issue that we will fix right at tne end of the cleaning process.
Assuming that you are working with names and that you aim at matching some set of punctuation rules, you will want to treat some punctuation characters as whitespace and remove them, while preserving and adding space after some others.
Names are easy to process because only a few punctuations need to be preserved. Working with full sentences would require some coding effort to get endmarks correctly; instead, we are taking an important shortcut by removing those.
Since some punctuation marks are being preserved, such as dashes, we will also want to make sure that there is a single punctuation items between two words, in order to treat
this--example as duplicated items.
The following function applies all these rules sequentially:
The top of the results show no big difference, except for one endmark that was removed because it was located at the end of the string:
from to Doris Bures Doris Bures Karlheinz Kopf Karlheinz Kopf Ing. Norbert Hofer Ing. Norbert Hofer Alm Nikolaus, Mag. Alm Nikolaus, Mag Amon Werner, MBA Amon Werner, MBA Angerer Erwin Angerer Erwin
The data contain both prefixes, like
"Ing.", and suffixes, like
MBA. Let’s write a short function to find either part of the names, in order to remove them. The function is not called
str_subset because there is already such a function in the
str_filter function takes three arguments:
sepis the separator that starts or ends the part of the string that we want to remove
sideis the side of the string on which the part of the string is expected to be found
greedyasks if all prefixes or suffixes, if there are more than one, should be removed
The function can match either prefixes or suffixes, in any quantity:
Note that the function needs the user to escape any special character in the
sep argument: using
. as a separator will create a destructive regular expression that will eliminate the entire string. Also note that function will strip any space located around the separator.
We run the function twice on our list of names: first with
side set to
sep set to
,, in order to remove
"Dr." and similar prefixes, and then with
side set to
sep set to
"\." to remove
"Ing." and similar suffixes. The results show neither or these:
from to Doris Bures Doris Bures Karlheinz Kopf Karlheinz Kopf Ing. Norbert Hofer Norbert Hofer Alm Nikolaus, Mag Alm Nikolaus Amon Werner, MBA Amon Werner Angerer Erwin Angerer Erwin
It is important to clean the “right-hand side” of the names before cleaning the “left-hand side” to avoid any pattern where they get confused together—which would lose the name in the middle.
It is also important for our solution that the prefixes do not contain any commas, otherwise the prefix and suffix patterns would get mixed up and the results would fail to identify the name in the middle. If the prefixes contained commas, we would need to be more cautious and use
str_locate_all to subset the names more carefully.
A related function can extract the prefix or suffix and the name to a list:
The function uses the
str_filter function to find the part of the string that is not considered as a prefix or suffix, and then uses the (vectorized)
str_replace function from the
stringr package to remove that part of the string from the original text. When there is no prefix or suffix, the result is a missing value:
Applying the function to our data allows to extract the prefix or suffix of the names. In the data extract below, the
prefix column is where we used
str_detach on the
"left" side with separator
suffix is the column where we targeted the
"right" side with separator
from prefix suffix Doris Bures
Karlheinz Kopf Ing. Norbert Hofer Ing Alm Nikolaus, Mag Mag Amon Werner, MBA MBA Angerer Erwin
Let’s finally wrap all processing functions in one, which returns a data frame of cleaned names with their prefixes and (cleaned) suffixes:
The combined results of all previous functions are shown below, with the
suffix columns using
str_detach and some further text replacement to extract clean prefixes and suffixes:
from prefix name suffix Doris Bures
Doris Bures Karlheinz Kopf Karlheinz Kopf Ing. Norbert Hofer Ing Norbert Hofer rnttrnAlm Nikolaus, Mag.rn Alm Nikolaus Mag rnttrnAmon Werner, MBArn Amon Werner MBA rnttrnAngerer Erwinrn Angerer Erwin
After inspection of the data, the code gets only one of the 186 rows wrong, due to one person having his name written differently than all others (row #76). This problematic case will get be fixed in one line of code.
It also appears that there only one name with a prefix (row #3), and that name comes from the first three rows of the data, which designate people who re-appear in the later rows but with their names ordered differently (compare rows #1-3 to #4-6). The only step left is therefore to remove the first three rows of the results.
Both steps outlined above (dropping the extra rows and fixing the sole problematic case) can be performed together with the
Let’s now notice that the names are presented as family names, followed by a space, followed by first names, optionally followed by a space and one initial. Using
str_count to count the number of spaces found in the names seems to confirm that this is how the data are structured.
If the pattern described above is fixed, a simple function can “invert” the names to have first names (and their optional initial) at the front of the family names:
Inspecting the results will reveal one problematic case where the family name is made of two words (
"El Habbassi Asdin"), so let’s fix that by “protecting” the “El” prefix before inverting. The code below does so and then shows all cases where the name inversion might have gone wrong:
At that stage, the results look fine even when the names are ambiguous:
from to Aslan Aygül Berivan Aygül Berivan Aslan Bösch Reinhard Eugen Reinhard Eugen Bösch El Habbassi Asdin Asdin El Habbassi Eßl Franz Leonhard Franz Leonhard Eßl Feichtinger Klaus Uwe Klaus Uwe Feichtinger Fekter Maria Theresia Maria Theresia Fekter Gamon Claudia Angela Claudia Angela Gamon Karlsböck Andreas F Andreas F Karlsböck Krainer Kai Jan Kai Jan Krainer Riemer Josef A Josef A Riemer
There are no nominal duplicates in the data, so there is no need to process the names further. However, if there were duplicates among processed names, the
dplyr package would come in handy to do something like appending numbers to the duplicate names, so that they would read
"Jon Example-1" and
We therefore finalize the data by running the following code to drop the original names, invert the processed names as shown above, and then de-duplicate them if needed:
Inspecting the final data frame for any occurrence of
"-1" confirms that the data did not contain duplicates, and we have reached our goal: all names inthe data have been cleaned up and made unique.
The code featured in this note is available from this Gist, which contains a backup of the example data. As previously remarked, the code is problem-dependent: it fits the example data that we used in this note. However, there is a fair chance that the code might be reusable without too many changes in different contexts.