snow and ssh — secure inter-machine parallelism with R

[This article was first published on Life in Code, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I just threw a post up on Revolutions, which got a lot longer than I planned. And got me thinking. And reading (see refs in previous post). And trying. Turns out that it was way easier than I thought!

The problem:

From the blog post:

OpenSSH is now available on all platforms. A sensible solution is to have *one* cluster type — ssh. Let ssh handle inter-computer connection and proper forwarding of ports, and for each machine to connect to a local port that’s been forwarded.

… It seems like most of the heavy lifting is done by foreach, and that engineering a simple makeSSHcluster and doSSH to forward ports, start slaves, and open socket connections should be tractable.

… the SSHcluster method would require minimal invasion on those machines — just ability to execute ssh and Rscript on the remote machines — not even login privileges are required!


Requirements:

R must be installed on both machines, as well as the snow package. On the machine running the primary R session (henceforth referred to as thishost, the foreach and doSNOW packages must be installed, as well as an ssh client. On remotehost, Rscript must be installed, an ssh server must be running, and the user must have permissions to run Rscript and read the R libraries. Note that Rscript is far from a secure program to allow untrusted users to run. For the paranoid, place it in a chroot jail…


Step-by-step:

The solution here piggy-backs on existing snow functionality. All commands below are to be run from thishost, either from commandline (shown by $) or the R session in which work is to be done (shown by \>). You will need to replace remotehost with the hostname or ip address of remotehost, but localhost should be entered verbatim.

Note that the ssh connections are much easier using key-based authentication (which you were using already, right?), and a key agent so you don’t have to keep typing your key password again and again. Even better, set up an .ssh/config file for host-specific aliases and configuration.

Now, onto the code!

$ ssh -f -R 10187:localhost:10187 remotehost sleep 600  ## 10 minutes to set up connection

> require(snow); require(foreach); require(doSNOW);
> cl = makeCluster(rep('localhost',3), type='SOCK', manual=T)  ## wait for 3 slaves

## Use ssh to start the slaves on remotehost and connect to the (local) forwarded port

$ for i in {1..3}; 
$  do ssh -f remotehost "/usr/lib64/R/bin/Rscript ~/R/x86_64-pc-linux-gnu-library/2.12/snow/RSOCKnode.R 
$       MASTER=localhost PORT=10187 OUT=/dev/null SNOWLIB=~/R/x86_64-pc-linux-gnu-library/2.12"; 
$ done

## R should have picked up the slaves and returned.  If not, something is wrong.
## Back up and test ssh connectivity, etc.

> registerDoSNOW(cl)

>  a <- matrix(1:4e6, nrow=4)
>  b <- t(a)
>  bb <-   foreach(b=iter(b, by='col'), .combine=cbind) %dopar%
>    (a %*% b)

Conclusions?

This is a funny example of %dopar%; it’s minimal and computationally useless, which is nice for testing. Still, it allows you to directly test the communication costs of your setup by varying the size of a. In explicit parallelism, gain is proportional to (computational cost)/(communication cost), so the above example is exactly what you don’t want to do.

Next up:
  • Write an R function that uses calls to system to set the tunnel up and spawn the slave processes, with string substitution for remotehost, etc.
  • Set up a second port and socketConnection to read each slave’s stdout to pick up errors. Perhaps call warning() on the output? Trouble-shooting is so much easier when you can see what’s happening!
  • Clean up the path encodings for SNOWLIB and RSOCKnode.R. R knows where it’s packages live, right?
  • As written, a little tricky to roll multiple hosts into the cluster (including thishost). It seems possible to chose port numbers sequentially within a range, and connect each new remotehost through a new port, with slaves on thishost connecting to a separate, non-forwarded port.
Those are low-hanging fruit. It would be nice to cut out the dependency on snow and doSNOW. The doSNOW package contains a single 1-line function. On the other hand, the snow package hasn’t been updated since July 2008, and a google search show up many confused users and few success stories (with the exception of this very helpful thread, which led me down this path in the first place).

+1 if you’d like to see a doSSH package.
+10 if you’d like to help!

To leave a comment for the author, please follow the link and comment on their blog: Life in Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)