by Derek McCrae Norton, Senior Sales Engineer
In this second installment of Extending RevoScaleR for Mining Big Data we look at how to use the building blocks provided by RevoScaleR to transform continuous variables into discrete.
Motivation: Discretize continuous variables on big data.
Discretization is a technique to convert continuous variables into discrete variables, and it is sometimes useful in data mining models such as Naïve Bayes. There are two basic methods, Equal Width and Equal Frequency, as well as many advanced methods such as Chi2, ChiMerge, and Tree Based methods.
If we consider the two basic methods, they are quite easy to implement in RevoScaleR.
Equal Width - Simply divide range into k buckets. The range is precalculated in XDF files which means most of the work is already done!
Equal Frequency - rxQuantile is a function that efficiently calculates k quantiles.
Bring it all together and use cut inside of a rxDataStep tranform to create new discretized variables.
You can test this out yourself with the function rxDiscretize at github.
Look for upcoming posts on other ways to extend RevoScaleR for Mining Big Data.