Step-by-Step Guide to Setting Up an R-Hadoop System

[This article was first published on blog.RDataMining.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Yanchang Zhao
RDataMining.com

Following my first R-Hadoop system setup guide written in Sept 2013, I have further tested setting up a Hadoop system for running R code, as well as using HBase. I have tested it both on a single computer and on a cluster of computers. The process is described in a newer version of guide to setting up an R-Hadoop system, which was updated on 30 May 2014. The guide also provides links to MapReduce and Hadoop documents and to examples of R-Hadoop code.

See the detailed guide at http://www.rdatamining.com/tutorials/r-hadoop-setup-guide, and below is a summary of it.

A list of software used for this setup:
– OS and other tools:
Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0
– Hadoop and HBase:
Hadoop 1.1.2, HBase 0.94.17
– R and RHadoop packages:
R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0

Steps

1. Set up single-node Hadoop
1.1 Download Hadoop
1.2 Set up Hadoop in standalone mode
1.2.1 Set JAVA_HOME
1.2.2 Set up remote desktop and enabling self-login
1.2.3 Run Hadoop
1.3 Test Hadoop
1.3.1 Example 1 – calculate pi
1.3.2 Example 2 – word count

2 Set up Hadoop in cluster mode
2.1 Switching between different modes
2.2 Setup name node (master machine)
2.3 Set JAVA_HOME, set up remote desktop and enable self-login on all nodes
2.4 Copy public key
2.5 Firewall
2.6 Setup data nodes (slave machines)
2.7 Format name node
2.8 Run Hadoop
2.9 Test Hadoop

3. Set up HBase
3.1 Set up HBase
3.2 Switching between different modes

4. Install R

5. Install GCC, Homebrew, git, pkg-config and thrift
5.1 Download and install GCC
5.2 Install Homebrew
5.3 Install git and pkg-config
5.4 Install thrift 0.9.0

6. Environment settings: HADOOP_PREFIX and HADOOP_CMD

7. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr
7.1 Install relevant R packages
7.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING
7.3 Install RHadoop packages

8. Run an R job on Hadoop for word counting

If you have successfully built up your R-Hadoop system, could you please share your success with R users at this thread? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop.

If you have any comments or suggestions, or find errors in above process, please feel free to post your questions to the above thread or to RDataMining group at http://group.rdatamining.com.

Thanks.


To leave a comment for the author, please follow the link and comment on their blog: blog.RDataMining.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)