Step-by-Step Guide to Setting Up an R-Hadoop System

May 30, 2014

(This article was first published on, and kindly contributed to R-bloggers)

by Yanchang Zhao

Following my first R-Hadoop system setup guide written in Sept 2013, I have further tested setting up a Hadoop system for running R code, as well as using HBase. I have tested it both on a single computer and on a cluster of computers. The process is described in a newer version of guide to setting up an R-Hadoop system, which was updated on 30 May 2014. The guide also provides links to MapReduce and Hadoop documents and to examples of R-Hadoop code.

See the detailed guide at, and below is a summary of it.

A list of software used for this setup:
– OS and other tools:
Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0
– Hadoop and HBase:
Hadoop 1.1.2, HBase 0.94.17
– R and RHadoop packages:
R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0


1. Set up single-node Hadoop
1.1 Download Hadoop
1.2 Set up Hadoop in standalone mode
1.2.1 Set JAVA_HOME
1.2.2 Set up remote desktop and enabling self-login
1.2.3 Run Hadoop
1.3 Test Hadoop
1.3.1 Example 1 – calculate pi
1.3.2 Example 2 – word count

2 Set up Hadoop in cluster mode
2.1 Switching between different modes
2.2 Setup name node (master machine)
2.3 Set JAVA_HOME, set up remote desktop and enable self-login on all nodes
2.4 Copy public key
2.5 Firewall
2.6 Setup data nodes (slave machines)
2.7 Format name node
2.8 Run Hadoop
2.9 Test Hadoop

3. Set up HBase
3.1 Set up HBase
3.2 Switching between different modes

4. Install R

5. Install GCC, Homebrew, git, pkg-config and thrift
5.1 Download and install GCC
5.2 Install Homebrew
5.3 Install git and pkg-config
5.4 Install thrift 0.9.0

6. Environment settings: HADOOP_PREFIX and HADOOP_CMD

7. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr
7.1 Install relevant R packages
7.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING
7.3 Install RHadoop packages

8. Run an R job on Hadoop for word counting

If you have successfully built up your R-Hadoop system, could you please share your success with R users at this thread? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop.

If you have any comments or suggestions, or find errors in above process, please feel free to post your questions to the above thread or to RDataMining group at


To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

plotly webpage

dominolab webpage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)