Site icon R-bloggers

Bayesian Naive Bayes for Classification with the Dirichlet Distribution

[This article was first published on Shifting sands, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have a classification task and was reading up on various approaches. In the specific case where all inputs are categorical, one can use “Bayesian Naïve Bayes” using the Dirichlet distribution. < o:p>

Poking through the freely available text by Barber, I found a rather detailed discussion in chapters 9 and 10, as well as example matlab code for the book, so took it upon myself to port it to R as a learning exercise. < o:p>

I was not immediately familiar with the Dirichlet distribution, but in this case it appeals to the intuitive counting approach to discrete event probabilities. < o:p>

In a nutshell we use the training data to learn the posterior distribution, which turns out to be counts of how often a given event occurs, grouped by class, feature and feature state.< o:p>

Prediction is a case of counting events in the test vector. The more this count differs from the per-class trained counts, the lower the probability the current candidate class is a match. < o:p>

Anyway, there are three files. The first is a straightforward port of Barber’s code, but this wasn’t very R-like, and in particular only seemed to handle input features with the same number of states.< o:p>

I developed my own version that expects everything to be represented as factors. It is all a bit rough and ready but appears to work and there is a test/example script up here. As a bigger test I ran it on a sample  car evaluation data set from here, the confusion matrix is as follows:< o:p>

testY   acc good unacc vgood< o:p>
  acc    83    3    29     0< o:p>
  good   16    5     0     0< o:p>
  unacc  17    0   346     0< o:p>
  vgood  13    0     0     6< o:p>

That’s it for now. Comments/feedback appreciated. You can find me on twitter here< o:p>

Links to files:

Barber Port< o:p>
R amenable implementation
Example Usage
Everything in one directory (with data) here

To leave a comment for the author, please follow the link and comment on their blog: Shifting sands.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.