The stringdist package

February 26, 2013
By

(This article was first published on Mark van der Loo, and kindly contributed to R-bloggers)

String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into another.

String distance functions seem to have been partly missing and partly scattered around R and CRAN. For example, the generalized Levenshtein distance (aka restricted Damerau-Levenshtein distance) is implemented in R's native adist function as well as in the RecordLinkage package. The latter also implements the Jaro-Winkler distance.

I've just published a package that (re-)implements four different string metrics and offers them through a uniform interface:

  • Hamming distance: for strings of equal size only; counts the number of different characters.
  • Levenshtein distance: counts the weighted number of deletions, insertions and substitutions.
  • Restricted Damerau-Levenstein: counts the weighted number of deletions, insertions, substitutions and transpositions (character swaps); each character may be transposed only once.
  • True Damerau-Levenshtein distance counts the weighted number of deletions, insertions, substitutions and transpositions.

As far as I know, no weighted Damerau-Levenshtein distance existed in R before (but note that the restricted Damerau-Levenshtein distance is sometimes mistaken for the true DL-distance on the web - including in our own deducorrect package). The metrics mentioned above have been reimplemented in C. In one case I borrowed some C-code from the web and altered it to my liking (check the repo) for the reference.

The package offers two basic interfaces:

  • stringdist computes pairwise distance between character vectors,where the shorter one is recycled.
  • stringdistmatrix: computes the full distance matrix, optionally using multiple cores.

See the built-in manual for more details.

I'm planning to add more distance metrics in the future and I'm happy to receive suggestions, comments, bugreports etc.

The github repo is here and the CRAN page is here.

To leave a comment for the author, please follow the link and comment on his blog: Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.