Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. R has a lot of tools to speed up computations making use of multiple CPU cores either on one computer, or on multiple machines. This series of exercises aims to introduce the basic techniques for implementing parallel computations using multiple CPU cores on one machine.
The initial step in preparation for parallelizing computations is to decide whether the task can and should be run in parallel. Some tasks involve sequential computation, where operations in one round depend on the results of the previous round. Such computations cannot be parallelized. The next question is whether it is worth to use parallel computations. On the one hand, running tasks in parallel may reduce computer time spent on calculations. On the other hand, it requires additional time to write the code that can be run in parallel, and check whether it yields correct results.
The code that implements parallel computations basically makes three things:

• splits the task into pieces,
• runs them in parallel, and
• combines the results.

This set of exercises allows to train in using the snowfall package to perform parallel computations. The set is based on the example of parallelizing the k-means algorithm, which splits data into clusters (i.e. splits data points into groups based on their similarity). The standard k-means algorithm is sensitive to the choice of initial points. So it is advisable to run the algorithm multiple times, with different initial points to get the best result. It is assumed that your computer has two or more CPU cores.
For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.

Exercise 1
Use the detectCores function from the parallel package to find the number of physical CPU cores on your computer. Then change the arguments of the function to find the number of logical CPU cores.

Exercise 2
Load the data set, and assign it to the df variable.

Exercise 3
Use the system.time function to measure the time spent on execution of the command fit_30 <- kmeans(df, centers = 3, nstart = 30), which finds three clusters in the data.
Note that this command runs the kmeans function 30 times sequentially with different (randomly chosen) initial points, and then selects the ‘best’ way of clustering (the one that minimizes the squared sum of distances between each data point and its cluster center).

Learn more about optimizing your workflow in the online course Getting Started with R for Data Science. In this course you will learn how to:

• get a full introduction to using R for a data science project
• And much more

Exercise 4
Now we’ll try to paralellize the runs of kmeans. The first step is to write the code that performs a single run of the kmeans function. The code has to do the following:

1. Randomly choose three rows in the data set (this can be done using the sample function).
2. Subset the data set keeping only the chosen rows (they will be used as initial points in the k-means algorithm).
3. Transform the obtained subset into a matrix.
4. Run the kmeans function using the original data set, the obtained matrix (as the centers argument), and without the nstart argument.

Exercise 5
The second step is to wrap the code written in the previous exercise into a function. It should take one argument, which is not used (see explanation on the solutions page), and should return the output of the kmeans function.
Such functions are often labelled as wrapper, but they may have any possible name.

Exercise 6
Let’s prepare for parallel execution of the function:

1. Initialize a cluster for parallel computations using the sfInit function from the snowfall package. Set the parallel argument equal to TRUE. If your machine has two logical CPU’s assign two to the cpus argument; if the number of CPU’s exceeds two set this argument equal to the number of logical CPU’s on your machine minus one.
2. Make the data set available for parallel processes with the sfExport function.
3. Prepare the random number generation for parallel execution using the sfClusterSetupRNG. Set the seed argument equal to 1234.

(Note that kmeans is a function from the base R packages. If your want to run in parallel a function from a downloaded package, you have also to make it available for parallel execution with the sfLibrary function).

Exercise 7
Use the sfLapply function from the snowfall package to run the wrapper function (written in Exercise 5) 30 times in parallel, and store the output of sfLapply in the result variable. Apply also the system.time function to measure the time spent on execution of sfLapply.
Note that sfLapply is a parallel version of lapply function. It takes two main arguments: (1) a vector or a list (in this case it should be a numeric vector of length 30), and (2) the function to be applied to each element of the vector or list.

Exercise 8
Stop the cluster for parallel execution with the sfStop function from the snowfall package.

Exercise 9
Explore the output of sfLapply (the result object):

1. Find out to what class it belongs.
2. Print its length.
3. Print the structure of its first element.
4. Find the value of the tot.withinss sub-element in the first element (it represents the total squared sum of distances between data points and their cluster centers in a given solution to the clustering problem). Print that value.

Exercise 10
Find an element of the result object with the lowest tot.withinss value (there may be multiple such elements), and assign it to the best_result variable.
Compare the tot.withinss value of that variable with the corresponding value of the fit_30 variable, which was obtained in Exercise 3.