[This article was first published on Mark van der Loo » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R.

Secondly, stringdist now employs multithreading based on the openmp protocol. This means that calculations are now parallelized on multicore machines running OS’s that support openmp.

The stringdist package offers two main functions, both of which are now parallelized with openmp:

• stringdist can compute a number of different string metrics between vectors of strings (see here)
• amatch is an approximate text matching version of R’s native match function.

By default, the package now uses the following number of cores: if your machine has one or two cores, all of them are used. If your machine has 3 or more cores, $n-1$ cores are used and the number of cores is determined by a call to parallel::detectCores(). This way, you can still use your computer for other things while stringdist is doing its job. I set this default since I noticed in some benchmarks that using all cores in a computation is sometimes slower than using $n-1$ cores. This is probably because one of the cores is occupied with (for example) competing OS tasks, but I haven’t thourougly investigated that. You may still de- or increase the maximum amount of resources consumed since both amatch and stringdist now have a nthread argument. You may also alter the global option

  options("sd_num_thread")


or change the environmental variable OMP_THREAD_LIMIT prior to loading stringdist, but I’m digressing in details now.

A simple benchmark on my quadcore Linux machine (code at the end of the post) shows a near linear speedup as a function of the number of cores. The (default) distance computed here is the optimal string alignment distance. For this benchmark I sampled 10k strings of lengths between 5 and 31 characters. The first benchmark (left panel) shows the time it takes to compute 10k pairwise distances as a function of the number of cores used (nthread=1,2,3,4). The right panel shows how much time it takes to fuzzy-match 15 strings against a table of 10k strings as a function of the number of threads. The areas around the lines show the 1st and 3rd quartile interval of timings (thanks to the awesome microbenchmark package of Olaf Mersmann).

According to the Writing R extensions manual, certain commercially available operating systems have extra (fixed?) overhead when running openmp-based multithreading. However, for larger computations this shouldn’t really matter.

library(microbenchmark)
library(stringdist)
set.seed(2015)
# number of strings
N