Run R in parallel on a Hadoop cluster with AWS in 15 minutes

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you're looking to apply massively parallel resources to an R problem, one of the most time-consuming aspects of the problem might not be the computations themselves, but the task of setting up the cluster in the first place. You can use Amazon Web Services to set up the cluster in the cloud, but even that take some time, especially if you haven't done it before.

Jeffrey Breen created his first AWS cluster this weekend, and in just 15 minutes had demonstrated how to use 5 nodes to generate and analyze a billion simulations in R. It was a toy example, sure — estimating pi — but it's a great example of how quickly you can set up a parallel computing environment using R. Jeffrey used JD Long's segue package, which works with the Hadoop Streaming service on AWS. The segue package is still in the experimental stage, but still: this is a great demonstration of applying cloud-based hardware to parallel problems in R.

Jeffrey Breen: Abusing Amazon’s Elastic MapReduce Hadoop service… easily, from R

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)