stringdist 0.9: exercise all your cores

January 26, 2015
By

(This article was first published on Mark van der Loo » R, and kindly contributed to R-bloggers)

The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R.

Secondly, stringdist now employs multithreading based on the openmp protocol. This means that calculations are now parallelized on multicore machines running OS’s that support openmp.

The stringdist package offers two main functions, both of which are now parallelized with openmp:

  • stringdist can compute a number of different string metrics between vectors of strings (see here)
  • amatch is an approximate text matching version of R’s native match function.

By default, the package now uses the following number of cores: if your machine has one or two cores, all of them are used. If your machine has 3 or more cores, n-1 cores are used and the number of cores is determined by a call to parallel::detectCores(). This way, you can still use your computer for other things while stringdist is doing its job. I set this default since I noticed in some benchmarks that using all cores in a computation is sometimes slower than using n-1 cores. This is probably because one of the cores is occupied with (for example) competing OS tasks, but I haven’t thourougly investigated that. You may still de- or increase the maximum amount of resources consumed since both amatch and stringdist now have a nthread argument. You may also alter the global option

1
  options("sd_num_thread")

or change the environmental variable OMP_THREAD_LIMIT prior to loading stringdist, but I’m digressing in details now.

A simple benchmark on my quadcore Linux machine (code at the end of the post) shows a near linear speedup as a function of the number of cores. The (default) distance computed here is the optimal string alignment distance. For this benchmark I sampled 10k strings of lengths between 5 and 31 characters. The first benchmark (left panel) shows the time it takes to compute 10k pairwise distances as a function of the number of cores used (nthread=1,2,3,4). The right panel shows how much time it takes to fuzzy-match 15 strings against a table of 10k strings as a function of the number of threads. The areas around the lines show the 1st and 3rd quartile interval of timings (thanks to the awesome microbenchmark package of Olaf Mersmann).

benchmark

According to the Writing R extensions manual, certain commercially available operating systems have extra (fixed?) overhead when running openmp-based multithreading. However, for larger computations this shouldn’t really matter.

1
2
3
4
5
library(microbenchmark)
library(stringdist)
set.seed(2015)
# number of strings
N

To leave a comment for the author, please follow the link and comment on their blog: Mark van der Loo » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)