stringdist 0.8: now with soundex

August 22, 2014

(This article was first published on Mark van der Loo, and kindly contributed to R-bloggers)

An update to the stringdist package was released earlier this month. Thanks to a contribution of Jan van der Laan the package now includes a method to compute soundex codes as defined here. Briefly, soundex encoding aims to translate words that sound similar (when pronounced in English) to the same code.

Soundex codes can be computed with the new phonetic function, for example:

> phonetic(c('Euler','Gauss','Hilbert','Knuth','Lloyd','Lukasiewicz','Wachs'))
[1] "E460" "G200" "H416" "K530" "L300" "L222" "W200"

Two strings are considered equal when they have the same soundex code, we have a two-valued distance function.

> stringdist('Claire','Clare',method='soundex')
[1] 0
[1] 1

Since soundex is really only defined on the printable ASCII character set, a warning is given when non-ascii or non-printable ascii characters are encountered.

> phonetic("Jörgen")
[1] "J?62"
Warning message:
In phonetic("Jörgen") :
  soundex encountered 1 non-printable ASCII or non-ASCII
  characters. Results may be unreliable, see ?printable_ascii

The also new function printable_ascii can help you to detect such characters.

> printable_ascii(c("jörgen","jurgen"))

To get rid of such characters in a sensible way there are a few options. First of all, you may want to try R’s built-in iconv interface to translate accented characters to ascii.

> iconv("jörgen",to="ASCII//TRANSLIT")
[1] "jorgen"

However, behaviour of iconv may be system-dependent, see the iconv documentation for a thorough discussion. Another option is to install the stringi package.

> stri_trans_general("jörgen","Latin-ASCII")
[1] "jorgen"

This package should yield the same result, regardless of the OS you’re working on.

To leave a comment for the author, please follow the link and comment on their blog: Mark van der Loo. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)