Using R for Map-Reduce applications in Hadoop

May 4, 2011
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Data Scientist Antonio Piccolboni recently published this comparison of the various language and interfaces available for programming Big Data analysis tasks in the map-reduce framework. The interfaces he reviewed included:

  • Java Hadoop (mature and efficient, but verbose and difficult to program)
  • Cascading (brings an SQL-like flavor to Java programming with Hadoop)
  • Pipes/C++ (a C++ interface to programming on Hadoop)
  • Hive (a high-level SQL-like language for Hadoop, concise and expressive but limited in flexibility)
  • Pig (a new high-level langauge for Hadoop)
  • Rhipe (an R package for map-reduce programming with Hadoop)
  • Dumbo (a Hadoop library for python)
  • Cascalog (a powerful but obtuse lisp-based interface to Hadoop)

In the conclusion of the review, Antonio zeroes in on the Rhipe's R-based interface as "closest to what he was looking for":

... For a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely. First, it is just a library, which means that one can continue to use the tools he’s familiar with. I found it particularly useful to run map-reduce jobs in the interpreter, inspecting the inputs and outputs of each, an invaluable debugging help — but no, you can not step into a mapper or reducer, I use counters instead to trace what’s going on in there. I also like that one can read and write sequence files with one call, to examine the output of previous jobs and decide what to do next. Additionally since R is a statistical language and Hadoop is the tool of choice for big data analytics, this seems like a natural fit.

Antonio has also written several in-depth blog posts about Rhipe, including examples of doing relational joins within the Hadoop framework, and on graph analysis in Hadoop (useful for social-network applications).

Dataspora Blog: Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.