StarCluster and R

[This article was first published on Category: R | Everything Counts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

StarCluster is a utility for creating and managing
distributed computing clusters hosted on Amazon’s Elastic Compute
Cloud (EC2). StarCluster utilizes Amazon´s EC2 web service to create
and destroy clusters of Linux virtual machines on demand.

StarCluster provides a convenient way to quickly set up a cluster of machines to run some data parallel jobs using a distributed memory framework.

Install StarCluster using

$ sudo easy_install StarCluster

and then create a configuration file using

$ starcluster help

Add your AWS credentials to the config file and follow the instructions at the StarCluster quick-Start guide.

Once you have StarCluster up and running, you need to install R on all the cluster nodes and any packages you require. I wrote a shell script to automate the process:


starcluster put $1 starcluster.setup.zsh /home/starcluster.setup.zsh
starcluster put $1 Rpkgs.R /home/Rpkgs.R

numNodes=`starcluster listclusters | grep "Total nodes" | cut -d' ' -f3`
nodes=(`eval echo $(seq -f node%03g 1 $(($numNodes-1)))`)

for node in $nodes; do
    cmd="source /home/starcluster.setup.zsh >& /home/install.log.$node"
    starcluster sshmaster $1 "ssh $node $cmd" &

The script takes the name of your cluster as a parameter and pushes the two helper files to the cluster. It then runs the installation on the master and every node. It assumes you are running an Ubuntu Server based StarCluster AMI, which is the default. The first helper script, starcluster.setup.zsh, installs the basic software required:


echo "deb precise/" >> /etc/apt/sources.list
gpg --keyserver --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9
gpg -a --export E298A3A825C0D65DFD57CBB651716619E084DAB9 | sudo apt-key add -
apt-get update
apt-get install -y r-base r-base-dev
echo “DONE with Ubuntu package installation on $(hostname -s).”
R CMD BATCH --no-save /home/Rpkgs.R /home/install.Rpkgs.log
echo “DONE with R package installation on $(hostname -s).”

The second script, Rpkgs.R, is just a R script containing the packages you want installed:

install.packages(c("randomForest", "caret", "mboost", "plyr", "glmnet"),
 repos = "")
print(paste("DONE with R package installation on ", system("hostname -s", intern = TRUE), "."))

Once you have everything installed, you can ssh into your master node and start up R as usual:

$ starcluster sshmaster mycluster
$ R

Since StarCluster has set up all the networking nicely, you can use parLapply from the parallel package to run a task on your cluster without further configuration. Running a data parallel task on a cluster with 10 nodes is now as easy as this (parLapply is just like lapply, except it distributes the tasks over the cluster):

cluster_names <- paste("node00", 1:9, sep="")
cluster_names <- c(cluster_names, "node010")
cluster <- makePSOCKcluster(names = cluster_names)
output <- parLapply(cluster, some_input, some_function)

Now you can watch 10 machines working for you. Like!

To leave a comment for the author, please follow the link and comment on their blog: Category: R | Everything Counts. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)