by Derek McCrae Norton, Senior Sales Engineer
Motivation: Fit a Naive Bayes model to big data.
Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. This is often a good benchmark for other more complicated data mining models.
Really you are just calculating proportions for categorical variables (with possible Laplace correction), and probabilities based on a normal distribution for numeric variables. The proportions are easily calculated using rxCrossTabs, and the normal probabilities are easily calculated given a mean and standard deviation which we can get from rxSummary.
We can use existing e1071 code and replace the calculation of proportions and probabilities with big data versions. The results are not only not big data, but existing methods work on object!
You can test this out yourself with the function rxNaiveBayes at github.
It is pretty easy to extend RevoScaleR to do many tasks. These are only three example, but there are more on the github page. I also have a few more complicated examples that should be up eventually.
If you have an interest in helping to extend the functionality of RevoScaleR or just want to test some of the things I have created, please have a look at RevoEnhancements on github.