StarCluster and R

August 31, 2013

(This article was first published on Category: R | Everything Counts, and kindly contributed to R-bloggers)

StarCluster is a utility for creating and managing
distributed computing clusters hosted on Amazon’s Elastic Compute
Cloud (EC2). StarCluster utilizes Amazon´s EC2 web service to create
and destroy clusters of Linux virtual machines on demand.

StarCluster provides a convenient way to quickly set up a cluster of
machines to run some data parallel jobs using a distributed
memory framework.

Install StarCluster using

$ sudo easy_install StarCluster

and then create a configuration file using

$ starcluster help

Add your AWS credentials to the config file and follow the
instructions at the StarCluster quick-Start guide.

Once you have StarCluster up and running, you need to install R on all the cluster nodes
and any packages you require. I wrote a shell script to automate the process:


starcluster put $1 starcluster.setup.zsh /home/starcluster.setup.zsh
starcluster put $1 Rpkgs.R /home/Rpkgs.R

numNodes=`starcluster listclusters | grep "Total nodes" | cut -d' ' -f3`
nodes=(`eval echo $(seq -f node%03g 1 $(($numNodes-1)))`)

for node in $nodes; do
    cmd="source /home/starcluster.setup.zsh >& /home/install.log.$node"
    starcluster sshmaster $1 "ssh $node $cmd" &

The script takes the name of your cluster as a parameter and pushes
the two helper files to the cluster. It then runs the installation on
the master and every node. It assumes you are running an Ubuntu Server
based StarCluster AMI, which is the default. The first helper
script, starcluster.setup.zsh, installs the basic software


echo "deb precise/" >> /etc/apt/sources.list
gpg --keyserver --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
apt-get update
apt-get install -y r-base r-base-dev
echo “DONE with Ubuntu package installation on $(hostname -s).”
R CMD BATCH --no-save /home/Rpkgs.R /home/install.Rpkgs.log
echo “DONE with R package installation on $(hostname -s).

The second script, Rpkgs.R, is just a R script containing
the packages you want installed:

install.packages(c("randomForest", "caret", "mboost", "plyr", "glmnet"),
 repos = "")
print(paste("DONE with R package installation on ", system("hostname -s", intern = TRUE), "."))

Once you have everything installed, you can ssh into your master node and start up R as usual:

$ starcluster sshmaster mycluster
$ R

Since StarCluster has set up all the networking nicely, you can use
parLapply from the parallel package to run a task on your cluster
without further configuration. Running a data parallel task on a
cluster with 10 nodes is now as easy as this (parLapply is just like
lapply, except it distributes the tasks over the cluster):

cluster_names <- paste("node00", 1:9, sep="")
cluster_names <- c(cluster_names, "node010")
cluster <- makePSOCKcluster(names = cluster_names)
output <- parLapply(cluster, some_input, some_function)

Now you can watch 10 machines working for you. Like!

To leave a comment for the author, please follow the link and comment on their blog: Category: R | Everything Counts. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)