How can R and Hadoop be used together?

November 26, 2013

(This article was first published on Pingax » R, and kindly contributed to R-bloggers)

By inspired from this Quora question, I have been started working on how can R and Hadoop integrated to be used together? By very hard verification process, finally I got the possible ways to use R and Hadoop together for performing Big Data Analytics.


This blog post is written with consideration of helping to a Data scientist, Data Engineers and Data Analysts who actually want a solution for running Machine Learning Application with Larger dataset. So, I would like to suggest some refined ways to get it possible. I assume here that you are interested to run a Machine Learning (Coursera – Join well known Online course by Professor Andrew NG) Algorithms over large size dataset due to some memory issues with single machine.

As such, R users are not required to learn a new language, e.g., Java, or environment, e.g., cluster software and hardware, to work with Hadoop. Moreover, functionality from R open source packages can be used in the writing of mapper and reducer functions.

Since the popularity of combined platform of R and Hadoop increases more and more, I think the Big Data Analytics can become a emerging trend. With the help of this parallel Data Analytics platform, Large organization can easily derive insightful insights to get bigger and bigger advantages from Big Data Analytics.

Let’s check about the outline of the ways, R and Hadoop can be integrated to scale data Analytics to Big Data Analytics. There are as given below,
1. RHadoop
4. HadoopStreaming (R package)
5. Hadoop Streaming (HadoopStreaming Utility)

Now have some warm discussion on real world test cases with popular Hadoop tools. To explain how this is possible, I am going to use various R and Hadoop tools. Why don’t we check a list for useful software that can be used.

We need following useful data driven tools group by technologies:

  1. Linux-based Operating system Fast, secure and stylishly simple, the Ubuntu operating system is used by 20 million people worldwide every day.
    1. Ubuntu
    2. Ubuntu is Fast, secure and stylishly simple, the Ubuntu operating system is used by 20 million people worldwide every day.

    3. CentOS
    4. CentOS is an Enterprise-class Linux Distribution derived from sources freely provided to the public by a prominent North American Enterprise Linux vendor.

    5. Redhat
  2. R
    1. R – R programming language for dealing with Machine Learning concepts
    2. RStudio – RSTudio One only well-known IDE for R
  3. Hadoop –
    1. Hadoop
    2. Hadoop is Open Source and Big Data Solution. Since its little bit hard to install Hadoop with its components, I would like to suggest you to try classic Hadoop Distribution provided by HortonWorks, Cloudera, mapR or Amazon EMR.

There are possibly five ways to use R and Hadoop together. Let’s lookup ahead on R and Hadoop integration –

  1. RHadoopRHadoop is a great open source solution for R and Hadoop provided by Revolution Analytics. RHadoop is bundled with four main R packages to manage and analyze the data with Hadoop framework.
  2. RHIPERHIPE is the R and Hadoop Integrated Programming Environment specially designed with Divide and Recombine (D&R) techniques to analyze the large datasets.
  3. ORCHORCH is Oracle R connector for Hadoop. ORCH can be used on the Oracle Big Data Appliance or on non-Oracle Hadoop clusters.
  4. HadoopStreamingHadoopstreaming utilities as R scripts which is R packages available at CRAN. This R package is developed by David S. Rosenberg with the consideration of making this Hadoop Streaming more easy as possible for R users.
  5. Hadoop StreamingHadoop Streamingis Hadoop utility which allows users to develop and run MapReduce program in language other than java.

In the next of my blogs, I am writing on How Machine Learning can be performed with Big Data platform R and Hadoop. If you want me to write on a particular Tools and Technologies can be used for doing the same, let me know.

Powered by Google+ Comments

The post How can R and Hadoop be used together? appeared first on Pingax.

To leave a comment for the author, please follow the link and comment on their blog: Pingax » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)