Make pleasingly parallel R code with rxExecBy

April 28, 2017
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Some things are easy to convert from a long-running sequential process to a system where each part runs at the same time, thus reducing the required time overall. We often call these "embarrassingly parallel" problems, but given how easy it is to reduce the time it takes to execute them by converting them into a parallel process, "pleasingly parallel" may well be a more appropriate name.

Using the foreach package (available on CRAN) is one simple way of speeding up pleasingly parallel problems using R. A foreach loop is much like a regular for loop in R, and by default will run each iteration in sequence (again, just like a for loop). But by registering a parallel "backend" for foreach, you can run many (or maybe even all) iterations at the same time, using multiple processors on the same machine, or even multiple machines in the cloud.

For many applications, though, you need to provide a different chunk of data to each iteration to process. (For example, you may need to fit a statistical model within each country — each iteration will then only need the subset for one country.) You could just pass the entire data set into each iteration and subset it there, but that's inefficient and may even be impractical when dealing with very large datasets sitting in a remote repository. A better idea would be to leave the data where it is, and run R within the data repository, in parallel.

Microsoft R 9.1 introduces a new function, rxExecBy, for exactly this purpose. When your data is sitting in SQL Server or Spark, you can specify a set of keys to partition the data by, and an R function (any R function, built-in or user-defined) to apply to the partitions. The data doesn't actually move: R runs directly on the data platform. You can also run it on local data in various formats

RxExecBy

The rxExecBy function is included in Microsoft R Client (available free) and Microsoft R Server. For some examples of using rxExecBy, take a look at the Microsoft R Blog post linked below.

Microsoft R Blog: Running Pleasingly Parallel workloads using rxExecBy on Spark, SQL, Local and Localpar compute contexts

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)