How to Apply Text Distances and Fuzzy Joins

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Edit Distance for measuring the Text Distance

Today we will talk about text similarities and how we can “calculate” a distance measure between two texts. For example, intuitively we know that the text “cats” is close to “rats” since they differ in one letter (Note: Forget about the meaning of the words). A popular approach of distance measure is the Edit Distance which is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. The main transformations are the following:

  • Insertion of a new character. Example: boat vs boats (insertion of an s)
  • Deletion of an existing character. Example: fair vs far (deletion of an i)
  • Substitution of an existing character. Example: lamp vs lamb (substitution of p with b)
  • Transposition of two existing consecutive characters. Example: sing vs sign (transposition of ng to gn)

We will work with the following two string metrics:

  • Levenshtein distance which is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
  • Damerau-Levenshtein distance which differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions).

Let’s try to calculate the Levenshtein distance of the words “balloon” and “baboon”. The changes that we need to apply are one deletion (i.e. the “l”) and one insertion (i.e. the “o”). Thus, their distance is 2.

R gives us the opportunity to calculate easily the distances between two texts. For example:

library(stringdist)

# methods "lv" is for Levenshtein and "dl" is for Damerau-Levenshtein 
stringdist("antarctica", "africa", method = "lv")

5

Notice that the string distance can be applied to sentences and documents. For example:

stringdist("Georgios Pipis", "George Pipis", method = "lv")

3

Q-Grams

Another approach of measuring the distance between text is the q-gram distance which is the sum of absolute differences between N-gram vectors of both strings. Once we calculate the vector space of the Q-grams (which is like the N-grams of letters) we can calculate the distance by applying Cosine Distance or Jaccard Distance. Notice that both Cosine and Jaccard distance take value from 0 to 1, where the smaller the number the smaller the distance and the higher the similarity between two texts.

Let’s see the 1-Gram and 2-Grams of the input text “George Pipis

# 1 gram
qgrams("George Pipis", q = 1)

   G e o r s p g i P  
V1 1 2 1 1 1 1 1 2 1 1

# 2 gram
qgrams("George Pipis", q = 2)

   Ge eo or rg pi ge ip is e  Pi  P
V1  1  1  1  1  1  1  1  1  1  1  1

Let’s calculate the 2-grams of two texts such as “George Pipis” and “Georgios Pipis

qgrams("George Pipis", "Georgios Pipis", q = 2)

   Ge eo or rg s  pi os ge ip is gi io e  Pi  P
V1  1  1  1  1  0  1  0  1  1  1  0  0  1  1  1
V2  1  1  1  1  1  1  1  0  1  1  1  1  0  1  1

Now we can calculate the Cosine and Jaccard distance of the two texts above based on 2-grams.

# cosine distance
stringdist("George Pipis", "Georgios Pipis", method = "cosine", q = 2)

[1] 0.2473822

stringdist("George Pipis", "Georgios Pipis", method = "jaccard", q = 2)

[1] 0.4

For comparison let’s calculate the Jaccard distance of “George Pipis” vs “Rick Pitino

stringdist("George Pipis", "Rick Pitino", method = "jaccard", q = 2)

[1] 0.8947368

Fuzzy Joins based on Text Distance

As a data scientist, it is quite common to apply Data Linkage which is briefly a method of bringing information from different sources together about the same person or entity to create a new, richer dataset.

Let’s describe one case where we want to match the basketball players of two different data sources where the names have been written slightly different as well as their height (cm). For this task, we will apply fuzzy matches.

We assume that we are dealing with the following two datasets.

dfa<-data.frame(ID=c(1,2,3,4),
                Name=c("Dimitrios", "Nikolas", "Georgios", "Nick"),
                Surname=c("Diamantidis", "Pappas", "Papagiannis", "Calathis"),
                Height = c(196, 195, 218, 198)
                )  

dfb<-data.frame(DoB=c(1980,1990,1997,1989),
                Name=c("Dimitris", "Nikos", "George", "Nick"),
                Surname=c("Diamantides", "Pappas", "Papagiannis", "Calathes"),
                Height = c(196, 196, 220, 199)
)

And we want to join those two data frames based on the following conditions:

  1. The Levenshtein distance of the Names must be less than 4
  2. The cosine distance of (q=2) of the Surnames must be less than 0.5
  3. The difference of the Height must be less than 3 cm

We will work with the fuzzyjoin library:

library(fuzzyjoin)


# Check if the absolute value between left and right is smaller than three
is_closer_than_three_cm <- function(left, right) {
  abs(left - right) < 3
}

# Calculate the string distance - it should be smaller than 4
is_name_distance_below_four <- function(left, right) {
  stringdist(left, right) < 4
}

# Calculate the string distance - it should be smaller than 0.5
is_surname_distance_below_four <- function(left, right) {
  stringdist(left, right, method="cosine", q=2) < 0.5
}



# Join by "title" and "year" with our two helper functions
fuzzy_left_join(
  dfa,dfb,
  by = c("Name", "Surname", "Height"),
  match_fun = c("Name" = is_name_distance_below_four,
                "Surname" = is_surname_distance_below_four,
                "Height" = is_closer_than_three_cm)
)

  ID    Name.x   Surname.x Height.x  DoB   Name.y   Surname.y Height.y
1  1 Dimitrios Diamantidis      196 1980 Dimitris Diamantides      196
2  2   Nikolas      Pappas      195 1990    Nikos      Pappas      196
3  3  Georgios Papagiannis      218 1997   George Papagiannis      220
4  4      Nick    Calathis      198 1989     Nick    Calathes      199

Although there were differences in all three fields (Name, Surname, Heigh), we managed to join the two data sources taking into consideration the distance between the fields.

Generally, working with full names is a difficult task. You can have a look at the name parser which is a good idea to start with before you apply the fuzzy joins.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)