Extending RevoScaleR for Mining Big Data – Discretization

April 12, 2013
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Derek McCrae Norton, Senior Sales Engineer

In this second installment of Extending RevoScaleR for Mining Big Data we look at how to use the building blocks provided by RevoScaleR to transform continuous variables into discrete.

Motivation: Discretize continuous variables on big data.

Discretization is a technique to convert continuous variables into discrete variables, and it is sometimes useful in data mining models such as Naïve Bayes.  There are two basic methods, Equal Width and Equal Frequency, as well as many advanced methods such as Chi2, ChiMerge, and Tree Based methods.

Discretize

If we consider the two basic methods, they are quite easy to implement in RevoScaleR.  

Equal Width - Simply divide range into k buckets. The range is precalculated in XDF files which means most of the work is already done!  

Equal Frequency - rxQuantile is a function that efficiently calculates k quantiles.

Bring it all together and use cut inside of a rxDataStep tranform to create new discretized variables.

You can test this out yourself with the function rxDiscretize at github.

Look for upcoming posts on other ways to extend RevoScaleR for Mining Big Data.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.