Some things are easy to convert from a long-running sequential process to a system where each part runs at the same time, thus reducing the required time overall. We often call these “embarrassingly parallel” problems, but given how easy it is to reduce the time it takes to execute them by converting them into a parallel process, “pleasingly parallel” may well be a more appropriate name.
Using the foreach package (available on CRAN) is one simple way of speeding up pleasingly parallel problems using R. A
foreach loop is much like a regular
for loop in R, and by default will run each iteration in sequence (again, just like a
for loop). But by registering a parallel “backend” for foreach, you can run many (or maybe even all) iterations at the same time, using multiple processors on the same machine, or even multiple machines in the cloud.
For many applications, though, you need to provide a different chunk of data to each iteration to process. (For example, you may need to fit a statistical model within each country — each iteration will then only need the subset for one country.) You could just pass the entire data set into each iteration and subset it there, but that's inefficient and may even be impractical when dealing with very large datasets sitting in a remote repository. A better idea would be to leave the data where it is, and run R within the data repository, in parallel.
Microsoft R 9.1 introduces a new function,
rxExecBy, for exactly this purpose. When your data is sitting in SQL Server or Spark, you can specify a set of keys to partition the data by, and an R function (any R function, built-in or user-defined) to apply to the partitions. The data doesn't actually move: R runs directly on the data platform. You can also run it on local data in various formats