Run R in parallel on a Hadoop cluster with AWS in 15 minutes

January 10, 2011

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you're looking to apply massively parallel resources to an R problem, one of the most time-consuming aspects of the problem might not be the computations themselves, but the task of setting up the cluster in the first place. You can use Amazon Web Services to set up the cluster in the cloud, but even that take some time, especially if you haven't done it before.

Jeffrey Breen created his first AWS cluster this weekend, and in just 15 minutes had demonstrated how to use 5 nodes to generate and analyze a billion simulations in R. It was a toy example, sure — estimating pi — but it's a great example of how quickly you can set up a parallel computing environment using R. Jeffrey used JD Long's segue package, which works with the Hadoop Streaming service on AWS. The segue package is still in the experimental stage, but still: this is a great demonstration of applying cloud-based hardware to parallel problems in R.

Jeffrey Breen: Abusing Amazon’s Elastic MapReduce Hadoop service… easily, from R

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)