Resampling data in Hadoop with RHadoop

February 27, 2013
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

On Revolution Analytics partner Cloudera's blog, Uri Laserson has posted an excellent guide to resampling from a large data set in Hadoop. Resampling is an important step in fitting ensemble models (including random forests and other bagging techniques), and Uri provides a step-by-step guide to implementing resampling methods using RHadoop. He provides the complete map-reduce code in the R language, as well as a useful script for installing RHadoop on a Cloudera instance.  

By the way, if you're new to RHadoop, here's RHadoop creator and project leader Antonio Piccolboni introducting RHadoop at last year's Strata CA conference.

  

Cloudera blog: How-to: Resample from a Large Data Set in Parallel (with R on Hadoop)

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.