StarCluster and R

August 31, 2013
By

(This article was first published on Category: R | Everything Counts, and kindly contributed to R-bloggers)

StarCluster is a utility for creating and managing
distributed computing clusters hosted on Amazon's Elastic Compute
Cloud (EC2). StarCluster utilizes Amazon´s EC2 web service to create
and destroy clusters of Linux virtual machines on demand.

StarCluster provides a convenient way to quickly set up a cluster of machines to run some data parallel jobs using a distributed memory framework.

Install StarCluster using

$ sudo easy_install StarCluster

and then create a configuration file using

$ starcluster help

Add your AWS credentials to the config file and follow the instructions at the StarCluster quick-Start guide.

Once you have StarCluster up and running, you need to install R on all the cluster nodes and any packages you require. I wrote a shell script to automate the process:

#!/bin/zsh

starcluster put $1 starcluster.setup.zsh /home/starcluster.setup.zsh
starcluster put $1 Rpkgs.R /home/Rpkgs.R

numNodes=`starcluster listclusters | grep "Total nodes" | cut -d' ' -f3`
nodes=(`eval echo $(seq -f node%03g 1 $(($numNodes-1)))`)

for node in $nodes; do
    cmd="source /home/starcluster.setup.zsh >& /home/install.log.$node"
    starcluster sshmaster $1 "ssh $node $cmd" &
done

The script takes the name of your cluster as a parameter and pushes the two helper files to the cluster. It then runs the installation on the master and every node. It assumes you are running an Ubuntu Server based StarCluster AMI, which is the default. The first helper script, starcluster.setup.zsh, installs the basic software required:

#!/bin/zsh

echo "deb http://stat.ethz.ch/CRAN/bin/linux/ubuntu precise/" >> /etc/apt/sources.list
gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
apt-get update
apt-get install -y r-base r-base-dev
echo “DONE with Ubuntu package installation on $(hostname -s).”
R CMD BATCH --no-save /home/Rpkgs.R /home/install.Rpkgs.log
echo “DONE with R package installation on $(hostname -s).

The second script, Rpkgs.R, is just a R script containing the packages you want installed:

install.packages(c("randomForest", "caret", "mboost", "plyr", "glmnet"),
 repos = "http://cran.cnr.berkeley.edu")
print(paste("DONE with R package installation on ", system("hostname -s", intern = TRUE), "."))

Once you have everything installed, you can ssh into your master node and start up R as usual:

$ starcluster sshmaster mycluster
$ R

Since StarCluster has set up all the networking nicely, you can use parLapply from the parallel package to run a task on your cluster without further configuration. Running a data parallel task on a cluster with 10 nodes is now as easy as this (parLapply is just like lapply, except it distributes the tasks over the cluster):

library("parallel")
cluster_names <- paste("node00", 1:9, sep="")
cluster_names <- c(cluster_names, "node010")
cluster <- makePSOCKcluster(names = cluster_names)
output <- parLapply(cluster, some_input, some_function)
stopCluster(cluster)

Now you can watch 10 machines working for you. Like!

To leave a comment for the author, please follow the link and comment on his blog: Category: R | Everything Counts.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.