Getting Started with R and Hadoop

Posted on August 20, 2012 by David Smith in R bloggers | 0 Comments

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week's meeting of the Chicago area Hadoop User Group (a joint meeting the Chicago R User Group, and sponsored by Revolution Analytics) focused on crunching Big Data with R and Hadoop. Jeffrey Breen, president of Atmosphere Research Group, frequently deals with large data sets in his airline consulting work, and R is his “go-to tool for anything data-related”. His presentation, “Getting Started with R and Hadoop” focuses on the RHadoop suite of packages, and especially the rmr package to interface R and Hadoop. He lists four advantages of using rmr for big-data analytics with R and Hadoop:

Well-designed API: code only needs to deal with basic R objects
Very flexible I/O subsystem: handles common formats like CSV, and also allows complex line-by-line parsing
Map-Reduce jobs can easily be daisy-chained to build complex workflows
Concise code compared to other ways of interfacing R and Hadoop (the chart below compares the number of lines of code required to implement a map-reduce analysis using different systems)

For newcomers to map-reduce programming with R and Hadoop, Jeffrey's presentation includes a step-by-step example of computing flight times from air traffic data. The last few slides some advanced features: how to work directly with files in HDFS from R with the rhdfs package; and how to simulate a Hadoop cluster on the local machine (useful for development, testing and learning RHadoop). Jeffrey also mentions that the RHadoop tutorial is a good resource for new users.

You can find Jeffrey's slides embedded below, and a video of the presentation is also available. You might also want to check out Jeffrey's older presentation Big Data Step-by-Step for tips on setting up a compute environment with Hadoop and R.

Running R on Hadoop – CHUG – 20120815 from Chicago Hadoop Users Group

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Getting Started with R and Hadoop

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)