# A very short and unoriginal introduction to snow

April 2, 2011
By

(This article was first published on Left Censored » R, and kindly contributed to R-bloggers)

As ﻿Jian-Feng rightly pointed out in a comment on my guide to setting up snow on the OSC cluster, it was probably somewhat cavalier of me to say:

Getting `snow` to run properly on single machines, or ever with a cluster of machines via `ssh` connections is fairly trivial.

In an effort to redeem myself, I provide this very short and unoriginal introduction to using `snow`. But first a caveat: to make the most of parallel processing in R, or any other environment, the problem you are trying to solve must be amenable to being broken up into smaller, (mostly) independent pieces. In other words, the results from one piece should not be dependent on the results from another. In statistics, depending on the problem at hand, this may or may not apply. Bootstrapping, a simple example of which I provide below, is one place where parallel processing can provide excellent returns from parallelization. On the other hand, a typical maximum likelihood estimate using, for instance, a BFGS optimization routine would gain little from parallel processing since step \(n+1\) is dependent on the results of step \(n\). (Unsurprisingly, things are a bit more complicated than this, and if you are really interested in learning about parallel processing, you may want to start with reading the Wikipedia entry.)

This simple example demonstrates how to calculate bootstrapped sample means of a given vector in parallel across a cluster. First, load the `snow` and `rlecuyer` libraries. Of course, `snow` is what provides the parallel processing, but `rlecuyer` is equally important as it guarantees the random numbers generated in each process are independent (`snow` also supports the `rsprng` library).

```> library(snow)
> library(rlecuyer)```

Now set up some sample data. Here I take 100 random draws, with replacement, from the integers in \([0,5]\).

```> x <- sample(0:5, 100, replace = TRUE)
> mean(x)
[1] 2.64```

Define a simple function to calculate a single bootstrapped mean from a given vector:

```> bs.mean <- function(v) {
+   s <- sample(v, length(v), replace = TRUE)
+   mean(s)
+ }```

Now it’s time to set up the cluster. Here I set up a SOCK-type connection, which can be used to set up multiple R instances on the local machine and/or to set up R instances on remote machines through `ssh` connections. `snow` offers other connection options that may be more convenient or necessary depending on your environment (for instance, MPI was needed on the OSC cluster).

`> cl <- makeCluster(c("localhost", "localhost"), type = "SOCK")`

Here, `c("localhost", "localhost")` tells snow where to set up the R instances, while `type = "SOCK"` is obviously the connection type. If I also wanted to run a single instance on a remote machine named `chuck`, I could specify `c("localhost", "localhost", "chuck")`. In this case, I would be prompted for my `ssh` password for `chuck`, though `snow` would take care of the rest once the connection was authenticated.

Once the connections are set up, you will want to provide unique random seeds on each of the instances.

```> clusterSetupRNG(cl)
[1] "RNGstream"```

The return value, `RNGstream`, just tells you what type of RNG was set up. Finally, it’s time to do some work.

```> clusterCall(cl, bs.mean, x)
[[1]]
[1] 2.81

[[2]]
[1] 2.61```

`clusterCall` instructs all instances in `cl` to execute the function `bs.mean` on the vector `x`, both of which we defined above. The results are returned in a list with a length equal to the number of instances; e.g., had we included `chuck` in our call to `makeCluster`, `clusterCall` would have returned a list of three bootstrapped means. Because `bs.mean` doesn’t depend on anything calculated by the other processes, these bootstrapped means are calculated in parallel.

When you are done with the cluster, you should always stop it. Otherwise, you may have to kill R instances by hand.

`> stopCluster(cl)`

Like I said at the outset, this was just a very short and unoriginal introduction to parallel processing with `snow`. There are many other examples available online, a couple of which I provide links to below.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...