Approximate string matching in R

August 9, 2013
By

(This article was first published on Mark van der Loo, and kindly contributed to R-bloggers)

I have released a new version of the stringdist package.

Besides a some new string distance algorithms it now contains two convenient matching functions:

  • amatch: Equivalent to R's match function but allowing for approximate matching.
  • ain: Similar to R's %in% operator
?Download download.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# here's an example of amatch
> x <- c('foo', 'bar')
> amatch('fu',x,maxDist=2)
[1] 1
 
# if we decrease the maximum allowd distance, we get 
> amatch('fu',x,maxDist=1)
[1] NA
 
# just like with 'match' you can control the output of no-matches:
> amatch('fu',x,maxDist=1,nomatch=0)
[1] 0
 
# to see if 'fu' matches approximately with any element of x:
ain('fu',x)
[1] FALSE
 
# however, if we allow for larger distances
ain('fu',x,maxDist=2)
[1] TRUE

Check the helpfile of for other options, like how to choose the string distance algorithm.

Note previously stringdist and stringdistmatrix returned -1 if a distance was undefined or exceeding a predefined maximum. Now,
these functions return Inf in such cases, making it easier to do comparisons. It may break your code if you explicitly test output for this.

With the latest release also arrive the latest bugs, so please drop me a line if you happen to stumble upon one.

The next release will probably not include any user-facing changes, but I'm planning to improve performance by smarter memory allocation and better maxDist handling for some of the string distance algorithms.

To leave a comment for the author, please follow the link and comment on his blog: Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.