Highly Similar Texts

[This article was first published on Johannes B. Gruber on Johannes B. Gruber, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A while ago I was building a database of newspaper articles retrieved from LexisNexis for a research project in which I was working as a research assistant. At some point we noticed that we seemed to have a lot of duplicates in our database. I had already removed the duplicates with R so we were really surprised that those are still in there. However, after some investigation, I found that there are indeed small differences between the articles we had identified manually as duplicates in our data. This post describes a strategy we came up with to effectively find those highly similar articles. For convenience, I implemented a version of that solution in the function lnt_similarity of my LexisNexisTools package.

The problem is that R will not consider two texts duplicates when only a single character, such as a comma, whitespace or quotation mark is different. That was therefore our first guess. But we found more than that. In some cases, subtle edits had been made. A different word with the same meaning, the title or position of a person added or removed from the article or a new sentence added to the end of an article. Strangely, many of those articles were published on the same date and in the same newspaper.

For some instances, the differences could be explained by taking a look at the editions: Many newspapers in the UK have localised editions and the Scottish edition of the Daily Mail might have and extra sentence about how a story affected Scotland or a comment from a Scottish politician was added somewhere in the article.

However, this poses a problem to any valid media analysis: These articles would be wrongfully counted several times in the sample. The introduced bias would mean that there are more articles from newspapers that add several different editions to the LexisNexis database. An additional problem was that while the information on editions was available for some outlets, it seemed that LexisNexis did not provide it for others. So we needed a computational way to find the duplicates.

Available solutions: the accurate one

Comparing chunks of text is more complicated than we initially thought. The most accurate solution available would be the so-called Levensthein or edit distance.

Consider this example that I pulled off Wikipedia:

  1. kitten → sitten (substitution of “s” for “k”)
  2. sitten → sittin (substitution of “i” for “e”)
  3. sittin → sitting (insertion of “g” at the end).

The distance between “kitten” and “sitting” is three, because three edits need to be made to turn the former into the latter. This is implemented in base R:

adist("kitten", "sitting")
##      [,1]
## [1,]    3

One problem with this is that arguably “kitten” and “sitting” are two very different character string whereas two newspaper articles of a few hundred words where the edit distance is just 3 are very similar. An easy solution is to calculate the relative distance instead: I chose to divide the distance by the length of the longer word, which has the advantage that the result will be between 0 (for identical texts) or 1 (for texts that are completely different).

compare <- c("kitten", "sitting")
adist(compare[1], compare[2]) / max(nchar(compare))
##           [,1]
## [1,] 0.4285714

The relative distance of 0.43 means that the texts are very different, which is intuitively right.

However, there is a second problem when using the edit distance: Although Vladimir Levenshtein publish his paper about the algorithm to calculate the distance more than 50 years ago, it is still extremely computational demanding, even for modern machines.

The main reason for this is that the demand for calculation scales up extremely quickly when comparing more than one text:

compare <- c("kitten", "sitting", "knitting", "omitting")
adist(compare) / max(nchar(compare))
##       [,1]  [,2]  [,3] [,4]
## [1,] 0.000 0.375 0.375 0.50
## [2,] 0.375 0.000 0.250 0.25
## [3,] 0.375 0.250 0.000 0.25
## [4,] 0.500 0.250 0.250 0.00

For the four words above, 12 calculations are necessary, since each text needs to be compared to each other text in the sample (i.e. n*(n-1) calculation with n = number of texts to compare). The longer a text, the more computationally expensive a calculation will be.

Consider the following. I measure time and resources needed to calculate the difference between two words (“kitten” and “sitting”) and between two chunks of Lorem Ipsum, 651 and 650 characters long.

compare <- c("kitten", "sitting")
compare2 <- c(
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam lacinia elementum sapien, eget aliquet ex finibus ut. Suspendisse eget molestie ligula. Etiam quis purus sed urna lacinia lobortis. Suspendisse tempor purus at nibh ultricies pellentesque. Cras sed blandit risus. Duis sit amet felis magna. Quisque consequat libero id turpis dapibus, non venenatis elit sodales. Nam fermentum, sapien nec vehicula blandit, nulla lectus fringilla ligula, eget tempus eros justo vel ante. Ut quis urna id nunc maximus accumsan vitae non leo. Phasellus vestibulum felis id erat euismod consequat. Nam ac metus vitae nunc volutpat luctus sit amet quis sapien.",
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam lacinia elementum sapien, eget aliquet ex finibus ut. Suspendisse eget molestie ligula. Etiam quis purus sed urna lacinia lobortis. Suspendisse tempor purus at nibh ultricies pellentesque. Cras sed blandit risus. Duis sit amet felis magna. Quisque consequat libero id turpis dapibus, non venenatis elit sodales. Nam fermentum, sapien nec vehicula blandit, nulla lectus fringilla ligula, eget tempus eros justo vel ante. Ut quis urna id nunc maximus accumsan vitae non leo. Phasellus vestibulum felis id erat euismod consequat. Nam ac metus vitae nunc volutpat luctus sit amet quis sapien"
)
benchmark <- bench::mark(
  short <- adist(compare) / max(nchar(compare)),
  long <- adist(compare2) / max(nchar(compare2)),
  check = FALSE
)
dplyr::select(benchmark, mean, mem_alloc)
## # A tibble: 2 x 2
##       mean mem_alloc
##   <bch:tm> <bch:byt>
## 1   15.1µs        0B
## 2   12.3ms    10.4KB

To get a feel for this, the mean time for the second calculation is 12.3ms that is miliseconds. In the same unit as the first calculation, that would be 12,281 microseconds. This is an 811% increase on average. The allocated memory can’t even be calculated since the first value is too small to measure while the second calculation uses quite a bit of memory. Some testing revealed that around 3500 articles would already bring my 32GB RAM machine to its knees. This makes it almost impossible to compare hundreds or thousands of newspaper articles.

Available solutions: the fast one

While the edit distance is the most accurate method for comparing texts, there are others which are much more efficient. Instead of comparing character for character, it is also possible to chop up each text into its elements and then compare which elements are shared by two or more texts.

I use the excellent quanteda package to do this.

library(quanteda)
dfm <- dfm(c(compare, compare2))
as.matrix(dfm)
kitten sitting lorem ipsum dolor sit amet , consectetur adipiscing
text1 1 0 0 0 0 0 0 0 0 0
text2 0 1 0 0 0 0 0 0 0 0
text3 0 0 1 1 1 3 3 6 1 1
text4 0 0 1 1 1 3 3 6 1 1

I’m only displaying the first 10 columns of the document feature matrix (dfm) here. Basically every column is one word and then every row is one text. The number in the cell tells us how often a word occurred in a text. Comparing the matrices of two texts is a much easier task for a computer and the implementation in quanteda is extra efficient and fast (I use the cosine similarity, as this is, in my opinion, easiest to read and interpret for humans).

textstat_simil(dfm, method = "cosine")
##           text1     text2     text3
## text2 0.0000000                    
## text3 0.0000000 0.0000000          
## text4 0.0000000 0.0000000 0.9990959

What we see here is that of the four texts, only text 3 and 4 are similar (more than 99% similar), the other two (“kitten” and “sitting”) are considered completely different from the others.

However, consider another example:

compare3 <- c(
  "Dog bites man",
  "Man bites dog"
)

dfm2 <- dfm(compare3)
textstat_simil(dfm2, method = "cosine")
##       text1
## text2     1

The similarity between “Dog bites man” and “Man bites dog” is 100%. But the meaning of the two sentences is arguably very different! This mismatch between intuition and results is caused by textstat_simil() ignoring word order and dfm() getting rid of capital letters by the default—which is the main reason behind the performance gain! This makes textstat_simil() a fast alternative but also relatively inaccurate when comparing text with similar vocabulary but different meaning.

Divide and conquer

However, the two methods combined can cancel each other’s shortcomings out. Let’s check this on all three examples I used so far, combined:

texts <- c(compare, compare2, compare3)
dfm3 <- dfm(texts)
simil <- textstat_simil(dfm3, method = "cosine")
simil
##           text1     text2     text3     text4     text5
## text2 0.0000000                                        
## text3 0.0000000 0.0000000                              
## text4 0.0000000 0.0000000 0.9990959                    
## text5 0.0000000 0.0000000 0.0000000 0.0000000          
## text6 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000

As you can see again, text three and four as well as text five and six are deemed very similar by textstat_simil(). We can use this information to do some more in-depth testing, only on the texts already flagged as highly similar. I do some data wrangling here to get the distance matrix into a presentable form:

library(dplyr)
simil_clean <- simil %>%
  as.matrix() %>%
  reshape2::melt() %>%
  mutate_if(is.factor, as.character) %>% 
  filter(value > 0.95) %>% 
  filter(Var1 != Var2) %>% 
  mutate(comb = purrr::map_chr(seq_along(Var1), function(i) {
    paste(sort(c(Var1[i], Var2[i])), collapse = " ")
  })) %>% 
  distinct(comb, .keep_all = TRUE) %>% 
  mutate(Orig_ID = as.integer(gsub("text", "", Var1)),
         Dup_ID = as.integer(gsub("text", "", Var2))) %>% 
  mutate(Orig_text = texts[Orig_ID],
         Dup_text = texts[Dup_ID]) %>% 
  select(-Var1, -Var2, -comb) %>% 
  mutate(rel_edit = purrr::map_dbl (seq_along(Orig_ID), function(i) {
    compare <- c(.data$Orig_text[i], .data$Dup_text[i])
    as.numeric(adist(compare)[2] / max(nchar(compare)))
  })) 
simil_clean %>% 
  select(-Orig_text, -Dup_text)
value Orig_ID Dup_ID rel_edit
0.9990959 4 3 0.0015361
1.0000000 6 5 0.4615385

Now this is a little confusing, so let me explain: ID 3 and 4 are the Lorem Ipsum texts, 5 and 6 are “Dog bites man” and “Man bites dog”. These texts end up in pairs, as their similarity value is above a threshold I used (0.95). rel_edit is the relative edit distance. The more similar two texts are, the lower this second value—the exact opposite of the similarity measure. So the very low distance between the two Lorem Ipsum texts and the high distance between “Dog bites man” and “Man bites dog” indicate that the first pair is extremely similar while the latter pair consists of pretty different texts. In a newspaper media analysis, it would probably make a lot of sense to remove either text 4 or 3. Texts 5 and 6, however, should probably stay in the database as they are, in fact, quite different.

Implementation in LexisNexisTools

Luckily for you, all this is already conveniently implemented in LexisNexisTools. Since the package is made for newspaper data, the function to find similar texts assumes that articles need to be published on the same day to consider them similar. To get the similarity between texts anyway, just provide a date vector with the same length as the text object:

LexisNexisTools::lnt_similarity(texts, rep("2018-05-19", length(texts)))
Date ID_original text_original ID_duplicate text_duplicate Similarity rel_dist
2018-05-19 3 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam lacinia elementum sapien, eget aliquet ex finibus ut. Suspendisse eget molestie ligula. Etiam quis purus sed urna lacinia lobortis. Suspendisse tempor purus at nibh ultricies pellentesque. Cras sed blandit risus. Duis sit amet felis magna. Quisque consequat libero id turpis dapibus, non venenatis elit sodales. Nam fermentum, sapien nec vehicula blandit, nulla lectus fringilla ligula, eget tempus eros justo vel ante. Ut quis urna id nunc maximus accumsan vitae non leo. Phasellus vestibulum felis id erat euismod consequat. Nam ac metus vitae nunc volutpat luctus sit amet quis sapien. 4 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam lacinia elementum sapien, eget aliquet ex finibus ut. Suspendisse eget molestie ligula. Etiam quis purus sed urna lacinia lobortis. Suspendisse tempor purus at nibh ultricies pellentesque. Cras sed blandit risus. Duis sit amet felis magna. Quisque consequat libero id turpis dapibus, non venenatis elit sodales. Nam fermentum, sapien nec vehicula blandit, nulla lectus fringilla ligula, eget tempus eros justo vel ante. Ut quis urna id nunc maximus accumsan vitae non leo. Phasellus vestibulum felis id erat euismod consequat. Nam ac metus vitae nunc volutpat luctus sit amet quis sapien 1 0.0015361
2018-05-19 5 Dog bites man 6 Man bites dog 1 0.4615385

As you can see, the result is the same, but the code a lot easier.

To leave a comment for the author, please follow the link and comment on their blog: Johannes B. Gruber on Johannes B. Gruber.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)