Using R for Map-Reduce applications in Hadoop

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data Scientist Antonio Piccolboni recently published this comparison of the various language and interfaces available for programming Big Data analysis tasks in the map-reduce framework. The interfaces he reviewed included:

  • Java Hadoop (mature and efficient, but verbose and difficult to program)
  • Cascading (brings an SQL-like flavor to Java programming with Hadoop)
  • Pipes/C++ (a C++ interface to programming on Hadoop)
  • Hive (a high-level SQL-like language for Hadoop, concise and expressive but limited in flexibility)
  • Pig (a new high-level langauge for Hadoop)
  • Rhipe (an R package for map-reduce programming with Hadoop)
  • Dumbo (a Hadoop library for python)
  • Cascalog (a powerful but obtuse lisp-based interface to Hadoop)

In the conclusion of the review, Antonio zeroes in on the Rhipe's R-based interface as “closest to what he was looking for“:

… For a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely. First, it is just a library, which means that one can continue to use the tools he’s familiar with. I found it particularly useful to run map-reduce jobs in the interpreter, inspecting the inputs and outputs of each, an invaluable debugging help — but no, you can not step into a mapper or reducer, I use counters instead to trace what’s going on in there. I also like that one can read and write sequence files with one call, to examine the output of previous jobs and decide what to do next. Additionally since R is a statistical language and Hadoop is the tool of choice for big data analytics, this seems like a natural fit.

Antonio has also written several in-depth blog posts about Rhipe, including examples of doing relational joins within the Hadoop framework, and on graph analysis in Hadoop (useful for social-network applications).

Dataspora Blog: Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)