Extending RevoScaleR for Mining Big Data – Naive Bayes

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Derek McCrae Norton, Senior Sales Engineer

In this third installment (following part 1 and part 2) of Extending RevoScaleR for Mining Big Data we look at how to use the building blocks provided by RevoScaleR to create a Naive Bayes model.

Motivation: Fit a Naive Bayes model to big data.

500px-SimpleBayesNet

Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. This is often a good benchmark for other more complicated data mining models.

Really you are just calculating proportions for categorical variables (with possible Laplace correction), and probabilities based on a normal distribution for numeric variables. The proportions are easily calculated using rxCrossTabs, and the normal probabilities are easily calculated given a mean and standard deviation which we can get from rxSummary.

We can use existing e1071 code and replace the calculation of proportions and probabilities with big data versions. The results are not only not big data, but existing methods work on object!

You can test this out yourself with the function rxNaiveBayes at github.

Conclusions

It is pretty easy to extend RevoScaleR to do many tasks. These are only three example, but there are more on the github page. I also have a few more complicated examples that should be up eventually.

If you have an interest in helping to extend the functionality of RevoScaleR or just want to test some of the things I have created, please have a look at RevoEnhancements on github.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)