Extending RevoScaleR for Mining Big Data – Naive Bayes

May 3, 2013
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Derek McCrae Norton, Senior Sales Engineer

In this third installment (following part 1 and part 2) of Extending RevoScaleR for Mining Big Data we look at how to use the building blocks provided by RevoScaleR to create a Naive Bayes model.

Motivation: Fit a Naive Bayes model to big data.

500px-SimpleBayesNet

Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. This is often a good benchmark for other more complicated data mining models.

Really you are just calculating proportions for categorical variables (with possible Laplace correction), and probabilities based on a normal distribution for numeric variables. The proportions are easily calculated using rxCrossTabs, and the normal probabilities are easily calculated given a mean and standard deviation which we can get from rxSummary.

We can use existing e1071 code and replace the calculation of proportions and probabilities with big data versions. The results are not only not big data, but existing methods work on object!

You can test this out yourself with the function rxNaiveBayes at github.

Conclusions

It is pretty easy to extend RevoScaleR to do many tasks. These are only three example, but there are more on the github page. I also have a few more complicated examples that should be up eventually.

If you have an interest in helping to extend the functionality of RevoScaleR or just want to test some of the things I have created, please have a look at RevoEnhancements on github.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.