**R – Open Source Automation**, and kindly contributed to R-bloggers)

One of the best things about R is its ability to vectorize code. This allows you to run code much faster than you would if you were using a *for* or *while* loop. In this post, we’re going to show you how to use vectorization to speed up fuzzy matching. First, a little bit of background will be covered. If you’re familiar with vectorization and / or fuzzy matching, feel free to skip further down the post.

**What is vectorization?**

Vectorization works by performing operations on entire vectors, or by extension, matrices, rather than iterating through each element in a collection of objects one at a time. A basic example is adding two vectors together. This can be done like this:

a <- c(3, 4, 5) b <- c(6, 7, 8) sums <- a + b

In this example, *sums* now equals c(9, 11, 13). This is because the addition operator, “+”, was applied pairwise to each vector i.e. the first element of a was added to the first element of b, the second element of a was added to the second element of b, and so on. Vectorization is faster than traditional for loops because it uses parallel operations under the hood. For loops in R, on the other hand, are notoriously slow.

This example may seem simple, but vectorization can be used much more powerfully to speed up a process like fuzzy matching, the topic of this article.

**What is fuzzy matching?**

Fuzzy matching is the process of finding strings that follow similar patterns. For example, suppose you’re trying to join two data sets together on a city field. Data for these cities may be entered into a system manually, allowing for spelling or formatting differences i.e. “Mt. Hood” might be coded as “Mount Hood” etc. For a human, we can see those are clearly the same, but we need an algorithmic way to determine that so we don’t have to manually go through large numbers of possibilities.

**The stringdist package**

To perform fuzzy matching, we’re going to use a package called stringdist. This contains a function we need called *stringsim* which gives a measure of similarity between a pair of strings. This function allows several different algorithms to compare the similarity between two strings, and returns a value between 0 (very dissimilar) and 1 (very similar or equal). For instance:

# load stringdist package require(stringdist) # compare two sample strings stringsim("this is a test", "this is the test", method = "jw")

**Simplistic Approach**

Alright, so to illustrate how much time vectorization saves, let’s do fuzzy matching in a less clever way. We’re going to use the *zipcode* package to get a sample list of cities across the US. These get stored in the vector, *cities*. Note, since this data set is actually at a zip code level, we’re going to have some duplicates in our vector, but that’s not a concern for this example because we’re trying to demonstrate performance differences between vectorization versus non-vectorization. Thus, our *cities* vector contains 44,336 elements.

Also, we create a vector below, called *input*, containing the misspelled or adjusted spellings of several cities in Florida (“Mt. Dora” vs. “Mount Dora”, “Sun City, FL” vs. “Sun City” etc.).

# load packages require(stringdist) require(zipcode) data("zipcode") cities <- zipcode$city # misspell / change spellings from a few Florida cities input <- c("Centry", "Boca R.", "Mt. Dora", "winterhaven", "Sun City, FL")

Now, one way of doing fuzzy matching here would be to loop through each city in our *input* vector, and then loop through each city in the *cities* vector and check the string similarity between each possible pair of strings i.e. check the similarity between the first element in *input* against every single element in *cities*. Since the length of *input* is 5 and the length of *cities* is 44,336, this requires 5 * 44,336 = 221,680 calls to *stringsim*.

# Naive approach start <- proc.time() results <- c() for(city in input) { best_dist <- -Inf for(check in cities) { dist <- stringsim(city, check, method = "jw") if(dist > best_dist) { closest_match <- check best_dist <- dist } } results <- append(results, closest_match) } end <- proc.time()

If we run end – start, we find that this process takes just under 26 seconds. However, let’s see what happens when we ditch the for loops, and use vectorization instead.

**Vectorizing Fuzzy Matching**

# define function to search for best string match in cities get_best_match <- function(city, cities = cities) { max_index <- which.max(stringsim(city, cities, method = "jw")) return(cities[max_index]) } vector_start <- proc.time() vector_results <- sapply(input, function(city) get_best_match(city, cities)) vector_end <- proc.time()

If we run vector_end – vector_start, we see this only takes 0.03 seconds! The main reason for the speed up can be seen in this line:

max_index <- which.max(stringsim(city, cities, method = "jw"))

The inner function, *stringsim*, gets the string similarity between the input city (e.g. “Mt. Dora”) versus every element in the cities vector. This is done in one line, rather than using a for or while loop construct. The output of this function call is a vector that shows the similarity measures (between 0 and 1) of the input city versus everything in *cities*.

We then apply the *which.max* function to get the index of the element in cities that is the closest match to the input city. Using this index, we can get the actual name of the city with the closest match, like below.

return(cities[max_index])

The other reason for the performance increase is using *sapply* to actually loop over the elements in our input vector, passing element in turn to the *get_best_match* function. Since *sapply* is written in C under the hood, it usually runs much faster than a traditional for loop in R.

That’s it for this post! Check out other posts of mine here: http://theautomatic.net/blog/.

The post Vectorize Fuzzy Matching appeared first on Open Source Automation.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Open Source Automation**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...