The stringdist package

[This article was first published on Mark van der Loo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into another.

String distance functions seem to have been partly missing and partly scattered around R and CRAN. For example, the generalized Levenshtein distance (aka restricted Damerau-Levenshtein distance) is implemented in R’s native adist function as well as in the RecordLinkage package. The latter also implements the Jaro-Winkler distance.

I’ve just published a package that (re-)implements four different string metrics and offers them through a uniform interface:

  • Hamming distance: for strings of equal size only; counts the number of different characters.
  • Levenshtein distance: counts the weighted number of deletions, insertions and substitutions.
  • Restricted Damerau-Levenstein: counts the weighted number of deletions, insertions, substitutions and transpositions (character swaps); each character may be transposed only once.
  • True Damerau-Levenshtein distance counts the weighted number of deletions, insertions, substitutions and transpositions.

As far as I know, no weighted Damerau-Levenshtein distance existed in R before (but note that the restricted Damerau-Levenshtein distance is sometimes mistaken for the true DL-distance on the web – including in our own deducorrect package). The metrics mentioned above have been reimplemented in C. In one case I borrowed some C-code from the web and altered it to my liking (check the repo) for the reference.

The package offers two basic interfaces:

  • stringdist computes pairwise distance between character vectors,where the shorter one is recycled.
  • stringdistmatrix: computes the full distance matrix, optionally using multiple cores.

See the built-in manual for more details.

I’m planning to add more distance metrics in the future and I’m happy to receive suggestions, comments, bugreports etc.

The github repo is here and the CRAN page is here.

To leave a comment for the author, please follow the link and comment on their blog: Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)