Parallel Computing Exercises: Snow and Rmpi (Part-3)

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The foreach statement, which was introduced in the previous set of exercises of this series, can work with various parallel backends. This set allows to train in working with backends provided by the snow and Rmpi packages (on a single machine with multiple CPUs). The name of the former package stands for “Simple Network of Workstations”. It can employ various parallelization techniques; socket clustering is used here. The latter one is an R’s wrapper for the MPI (Message-Passing Interface), which is another paralellization technique.
The set also demonstrates that inter-process communication overhead has to be taken into account when preparing to use parallelization. If short tasks are run in parallel the overhead can offset the gains in performance from using multiple CPUs, and in some cases execution can get even slower. For parallelization to be useful, tasks that are run in parallel have to be long enough.
The exercises are based on an example of using bootstrapping to estimate the sampling distribution of linear regression coefficients. The regression is run multiple times on different sets of data derived from an original sample. The size of each derived data set is equal to the size of the original sample, which is possible because the sets are produced by random sampling with replacement. The original sample is taken from the InstEval data set, which comes with the lme4 package, and represents lecture/instructor evaluations by students at the ETH. The estimated distribution is not analyzed in the exercises.
The exercises require the packages foreach, snow, doSNOW, Rmpi, and doMPI to be installed.
IMPORTANT NOTE: the Rmpi package depends on an MPI software, which has to be installed on the machine separately. The software can be the following:

  • Windows: either the Microsoft MPI, or Open MPI library (the former one can be installed as an ordinary application).
  • OS X/macOS: the Open MPI library (available through Homebrew).
  • Linux: the Open MPI library (look for packages named libopenmpi (or openmpi, lib64openmpi, or similar), as well as libopenmpi-dev (or libopenmpi-devel, or similar) in your distribution’s repository).

The zipped data set can be downloaded here. For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.

Exercise 1
Load the data set, and assign it to the data_large variable.

Exercise 2
Create a smaller data set that will be used to compare how parallel computing performance depends on the size of the task. Use the sample function to obtain a random subset from the loaded data. Its size has to be 10% of the size of the original dataset (in terms of rows). Assign the subset to the data_small variable.
For reproducibility, set the seed to 1234.
Print the number of rows in the data_large and data_small data sets.

Exercise 3
Write a function that will be used as a task in parallel computing. The function has to take a data set as an input, and do the following:

  1. Resample the data, i.e. obtain a new sample of data based on the input data set. The number of rows in the new sample has to be equal to the one in the input data set (use the sample function as in the previous exercise, but change parameters to allow for resampling with replacement).
  2. Run a linear regression on the resampled data. Use y as the dependent variable, and the others as independent variables (this can be done by using the formula y ~ . as an argument to the lm function).
  3. Return a vector of coefficients of the linear regression.
Learn more about optimizing your workflow in the online course Getting Started with R for Data Science. In this course you will learn how to:

  • efficiently organize your workflow to get the best performance of your entire project
  • get a full introduction to using R for a data science project
  • And much more

Exercise 4
Let’s test how much time it takes to run the task multiple times sequentially (not in parallel). Use the foreach statement with the %do% operator (as discussed in the previous set of exercises of this series) to run it:

  • 10 times with the data_large data set, and
  • 100 times with the data_small data set.

Use the rbind function as an argument to foreach to combine the results.
In both cases, measure how much time is spent on execution of the task (with the system.time function). Theoretically, the length of time spent should be roughly the same because the total number of rows processed is equal (it is 100,000 rows: 10,000 rows 10 times in the first case, and 1,000 rows 100 times in the second case), and the row length is the same. But is this the case in practice?

Exercise 5
Now we’ll prepare to run the task in parallel using 2 CPU cores. First, we’ll use a parallel computing backend for the foreach statement from the snow package. This requires to steps:

  1. Make a cluster for parallel execution using the makeCluster function from the snow package. Pass two arguments to this function: the size of the cluster (i.e. the number of CPU cores that will be used in computations), and the type of the cluster ("SOCK" in this case).
  2. Register the cluster with the registerDoSNOW function from the doSNOW package (which provides a foreach parallel adapter for the 'snow' package).

Exercise 6
Run the task 10 times with the large data set in parallel using the foreach statement with the %dopar% operator (as discussed in the previous set of exercises of this series). Measure the time spent on execution with the system.time function.
When done, use the stopCluster function from the snow package to stop the cluster.
Is the length of execution time smaller comparing to the one measured in Exercise 4?

Exercise 7
Repeat the steps listed in Exercise 5 and Exercise 6 to run the task 100 times using the small data set.
What is the change in the execution time?

Exercise 8
Next, we’ll use another parallel backend for the foreach function: the one that is provided by the Rmpi package (R’s wrapper to Message-Passing Interface), and accessible through an adapter from the doMPI package. From the user perspective, it differs from the snow-based backend in the following ways:

  • as mentioned above, additional software has to be installed for this backend to work (either (a) the openmpi library, available for Windows, macOS, and Linux, or (b) the Microsoft MPI library, which is available for Windows,
  • when an mpi cluster is created, it immediately starts using CPUs as much as it can,
  • when the work is complete, the mpi execution environment has to be terminated; if terminated, it can’t be relaunched without restarting the R session (if you try to create an mpi cluster after the environment was terminated, the session will be aborted, which may result in a loss of data; see Exercise 10 for more details).

In this exercise, we’ll create an mpi execution environment to run the task using 2 CPU cores. This requires actions similar to the ones performed in Exercise 5:

  1. Make a cluster for parallel execution using the startMPIcluster function from the doMPI package. This function can take just one argument, which is the number of CPU cores to be used in computations.
  2. Register the cluster with the registerDoMPI function from the doMPI package.

After creating a cluster, you may check whether the CPU usage on your machine increased using Resource Monitor (Windows), Activity Monitor (macOS), top or htop commands (Linux), or other tools.

Exercise 9
Stop the cluster created in the previous exercise with the closeCluster command from the doMPI package. The CPU usage should fall immediately.

Exercise 10
Create an mpi cluster again, and use it as a backend for the foreach statement to run the task defined above:

  • 10 times with the data_large data set, and
  • 100 times with the data_small data set.

In both cases, start a cluster before running the task, and stop it afterwards. Measure how much time is spent on execution of the task. How the time compares to the execution time with the snow cluster (found in Exercises 6 and 7)?
When done working with the clusters, terminate the mpi execution environment with the mpi.finalize function. Note that this function always returns 1.
Important! As mentioned above, if you intend to create an mpi cluster again after the environment was terminated you have to restart the R session, otherwise the current session will be aborted, which may result in a loss of data. In RStudio, an R session can be relaunched from the Session menu (relaunching the session this way does not affect the data, you’ll only need to reload libraries). In other cases, you may have to quit and restart R.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.