# Parallel Computing Exercises: Foreach and DoParallel (Part-2)

July 13, 2017
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. In general, `foreach` is a statement for iterating over items in a collection without using any explicit counter. In R, it is also a way to run code in parallel, which may be more convenient and readable that the `sfLapply` function (considered in the previous set of exercises of this series) or other `apply`-alike functions.
Apart from being able to run code in parallel, the R’s `foreach` has some other differences from the standard `for` loop. Specifically, the `foreach` statement:

• allows to iterate over several variables simultaneously,
• returns a value (a list, a vector, a matrix, or another object),
• is able to skip some iterations based on a condition (the last two properties make it similar to the list comprehension, which is present in Python and some other languages),
• has a special syntax that includes operators `%do%` (see an example in Exercise 1), `%dopar%`, and `%:%`.

The first six exercises in this set allow to train in performing basic operations with the `foreach` statement, and the last four ones show how to run it in parallel using multiple CPU cores on one machine. The task will be to parallelize identical operations on a set of files (the zipped data files can be downloaded here). It is assumed that your computer has two or more CPU cores.
The exercises require the packages `foreach`, `doParallel`, and `parallel`. The first two packages have to be installed, and the last one comes with the standard R distribution. The packages `doParallel` and `parallel` are necessary to run `foreach` in parallel.
For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.

Exercise 1
The `foreach` function (from the package of the same name) is typically used as a part of a special statement. In its simple form, the statement looks like this:

`result <- foreach(i = 1:3) %do% sqrt(i)`

The statement above consists of three parts:

• `foreach(i = 1:3)` – a call to the `foreach` function, with an argument that includes an iteration variable (`i`) and a sequence to be iterated over (`1:3`),
• `%do%` – a special operator,
• `sqrt(i)`: an R expression, which represents an operation to be performed over the iteration variable (this part of the statement is equivalent to the body of the loop).

The code iterates over the sequence, applies an operation defined in the expression to each element of the sequence, and stores the output in the `result` variable.
Note that if the expression extends over several lines it has to be enclosed in curly braces. The use of the iteration variable is not mandatory: if you just want to repeat the expression `n` times not passing anything to that expression you can use only a sequence of the length `n` as input to `foreach`.
In this exercise:

1. Run the code above, print the `result` object, and find to which class it belongs.
2. Use the `foreach` function to reverse the result. I.e. write a line of code that receives the `result` object as an input, and outputs the original sequence. Print the sequence.

Exercise 2
The `foreach` function allows for the use of several iteration variables simultaneously. They are passed to the function as arguments, and are separated by commas.
Run the `foreach` function with two iteration variables to get a sequence of their sums. The variables have to iterate over a vector of integers from 1 to 3, and a vector of 5 integers of value 10. Print the result.
(Tip: if you want to use an arithmetic operator to calculate the sum then the expression must be placed in parentheses or curly braces).
What is the length of the resulting object? How does the function deal with the vectors of different length?

Exercise 3
The package `iterators` provides several functions that can be used to create sequences for the `foreach` function. For example, the `irnorm` function creates an object that iterates over vectors of normally distributed random numbers. It is useful when you need to use random variables drawn from one distribution in an expression that is run in parallel.
In this exercise, use the `foreach` and `irnorm` functions to iterate over 3 vectors, each containing 5 random variables. Find the largest value in each vector, and print those largest values.
Before running the `foreach` function set the seed to 1234.

Learn more about optimizing your workflow in the online course Getting Started with R for Data Science. In this course you will learn how to:

• get a full introduction to using R for a data science project
• And much more

Exercise 4
By default the `foreach` function returns a list. But it can also return sequences of other types. This requires changing the value of the `.combine` parameter of the function. This exercise will train how to use this parameter.
As in the previous exercise, use the `foreach` and `irnorm` functions to iterate over 3 vectors, each containing 5 random variables. But now use an expression that returns all variables generated by `irnorm`. Pass the `.combine` parameter to the `foreach` function with value `'c'`. Print the result, and find its class and length.
Then run the code again with the `'cbind'` value assigned to the `.combine` parameter. Print the result, find its class and size.
Note that `'c'` and `'cbind'` are R functions from the `base` package. Other functions (including user-written ones) can be used as well to combine the outputs of the expression.

Exercise 5
The results of the expression placed after the `%do%` operator can be combined in different ways. Look at the documentation for the `foreach` function to find what value has to be assigned to the `.combine` parameter to sum the values produced by the expression in each iteration.
Run the code used in previous exercise with that value assigned to the `.combine` parameter, and print the result.
Before running the code set the seed to 1234.

Exercise 6
The sequence passed to the `foreach` function can be filtered so that the expression after `%do%` is applied only to a part of the sequence. This is done using a syntax like this:

`result ‹- foreach(i = some_sequence) %:% when(i › 0) %do% sqrt(i)`

You can notice that the `%:%` operator and the `when` function, which contains a Boolean expression involving the iteration variable, are added to a standard `foreach` statement.
Modify the example above to get a vector of logs of all even integers in the range from 1 to 10. Print the result.

Exercise 7
Now let’s parallelize the execution of the `foreach` function. We’ll use it to read similarly named files, and perform identical calculations on data from each file.
As a first step, write a function to be run in parallel. The function takes an integer as input, and performs the following actions:

1. Create a string (character vector) with a file name by concatenating constant parts of the name (`test_data_`, `.csv`) with the integer (example of possible result when 1 is used as integer: `test_data_1.csv`).
2. Read the file with the obtained name from the current working directory into a data frame.
3. Calculate mean values for each column in the data frame.
4. Return a vector of those values.

Exercise 8
The second step is to create a backend for parallel execution:

1. Make a cluster for parallel execution using the `makeCluster` function from the `parallel` package; pass the size of the cluster (i.e. the number of CPU cores that you want to be used in computations) as an argument to this function .
2. Register the cluster with the `registerDoParallel` function from the `doParallel` package.

Note that by default the `makeCluster` function creates a `PSOCK` cluster, which is an enhanced version of the `SOCK` cluster implemented in the `snow` package. Accordingly, the `PSOCK` cluster is a pool of worker processes that exchange data with the master process via sockets. The `makeCluster` function can also create other types of clusters.

Exercise 9
The last step is to run the `foreach` function to read and analyze 10 test files (contained in this archive) using the function created in Exercise 7. Combine the outputs of that function using `rbind`.

1. with `%do%` operator, which evaluates the expression sequentially, and
2. with `%dopar%` operator, which evaluates the expression in parallel.

In both cases, measure the execution time using the the `system.time` function. Print the result of the last run.
IMPORTANT: after completing parallel computations stop the cluster (created in Exercise 8) using the `stopCluster` function from the `parallel` package.

Exercise 10
Modify the code written in the Exercise 7 and Exercise 9 to calculate the mean and the variance of values contained in the first column in each file. The resulting object must be a two-column matrix with the first column representing means, and the second column describing variances (the number of rows must be equal to the number of files).
Repeat the actions listed in Exercise 8 to prepare a cluster for parallel execution, then run the modified code in parallel.
Print the result.
Stop the cluster.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.