Resampling data in Hadoop with RHadoop

February 27, 2013

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

On Revolution Analytics partner Cloudera's blog, Uri Laserson has posted an excellent guide to resampling from a large data set in Hadoop. Resampling is an important step in fitting ensemble models (including random forests and other bagging techniques), and Uri provides a step-by-step guide to implementing resampling methods using RHadoop. He provides the complete map-reduce code in the R language, as well as a useful script for installing RHadoop on a Cloudera instance.  

By the way, if you're new to RHadoop, here's RHadoop creator and project leader Antonio Piccolboni introducting RHadoop at last year's Strata CA conference.


Cloudera blog: How-to: Resample from a Large Data Set in Parallel (with R on Hadoop)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)