Naive Bayes Classification in R (Part 2)

[This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Following on from Part 1 of this two-part post, I would now like to explain how the Naive Bayes classifier works before applying it to a classification problem involving breast cancer data. The dataset is sourced from Matjaz Zwitter and Milan Soklic from the Institute of Oncology, University Medical Center in Ljubljana, Slovenia (formerly Yugoslavia) and the attributes are as follows:

age: a series of ranges from 20-29 to 70-79

menopause: whether a patient was pre- or post-menopausal upon diagnosis

tumor.size: the largest diameter (mm) of excised tumor

inv.nodes: the number of axillary lymph nodes which contained metastatic breast cancer

node.caps: whether metastatic cancer was contained by the lymph node capsule

deg.malign: the histological grade of the tumor (1-3 with 3 = highly abnormal cells)

breast: which breast the cancer occurred in

breast.quad: region of the breast cancer occurred in (four quadrants with nipple = central)

irradiat: whether the patient underwent radiation therapy

Some preprocessing of these data was required as there were some NAs (9 in total). I imputed predicted values using separate Naive Bayes classifiers. The objective here is to attempt to predict, using these attributes, with relatively high accuracy whether a recurrence of breast cancer is likely to occur in patients who were previously diagnosed and treated for the disease. We can pursue this objective by using the Naive Bayes classification method.

Naive Bayes’ Classification

Below is the Naive Bayes’ Theorem:

P(A | B) = P(A) * P(B | A) / P(B)

Which can be derived from the general multiplication formula for AND events:

P(A and B) = P(A) * P(B | A)

P(B | A) = P(A and B) / P(A)

P(B | A) = P(B) * P(A | B) / P(A)

If I replace the letters with meaningful words as I have been adopting throughout, the Naive Bayes formula becomes:

P(outcome | evidence) = P(outcome) * P(evidence | outcome) / P(evidence)

It is with this formula that the Naive Bayes classifier calculates conditional probabilities for a class outcome given prior information or evidence (our attributes in this case). The reason it is termed “naive” is because we assume independence between attributes when in reality they may be dependent in some way. For the breast cancer dataset we will be working with, some attributes are clearly dependent such as age and menopause status while some may or may not be dependent such as histological grade and tumor size.

This assumption allows us to calculate the probability of the evidence by multiplying the individual probabilities of each piece of evidence occurring together using the simple multiplication rule for independent AND events. Another point to note is that this naivety results in probabilities that are not entirely mathematically correct but they are a good approximation and adequate for the purposes of classification. Indeed, the Naive Bayes classifier has proven to be highly effective and is commonly deployed in email spam filters.

Calculating Conditional Probabilities

Conditional probabilities are fundamental to the working of the Naive Bayes formula. Tables of conditional probabilities must be created in order to obtain values to use in the Naive Bayes algorithm. The R package e1071 contains a very nice function for creating a Naive Bayes model:

library(e1071)
model <- naiveBayes(class ~ ., data = breast_cancer)
class(model)
summary(model)
print(model)

The model has class “naiveBayes” and the summary tells us that the model provides a-priori probabilities of no-recurrence and recurrence events as well as conditional probability tables across all attributes. To examine the conditional probability tables just print the model.

One of our tasks for this assignment was to create code which would give us the same conditional probabilities as those output by the naiveBayes() function. I went about this in the following way:

tbl_list <- sapply(breast_cancer[-10], table, breast_cancer[ , 10])
tbl_list <- lapply(tbl_list, t)

cond_probs <- sapply(tbl_list, function(x) { 
  apply(x, 1, function(x) { 
    x / sum(x) }) })

cond_probs <- lapply(cond_probs, t)

print(cond_probs)

The first line of code uses the sapply function to loop over all attribute variables in the dataset and create tables against the class attribute. I then used the lapply function to transpose all tables in the list so the rows represented the class attribute.

To calculate conditional probabilities for each element in the tables, I used sapply, lapply and anonymous functions. I had to transpose the output in order to get the same structure as the naiveBayes model output. Finally, I printed out my calculated conditional probabilities and compared them with the naiveBayes output to validate the calculations.

Applying the Naive Bayes’ Classifier

So I’ve explained (hopefully reasonably well) how the Naive Bayes classifier works based on the fundamental rules of probability. Now it’s time to apply the model to the data. This is easily done in R by using the predict() function.

preds <- predict(model, newdata = breast_cancer)

You will see that I have trained the model using the entire dataset and then made predictions on the same dataset. In our assignment we were asked to train the model and test it on the dataset, treating the dataset as an unlabeled test set. This is unconventional as the training set and test set are then identical but I believe the assignment was intended to just test our understanding of how the method works. In practice, one would use a training set for the model to learn from and a test set to assess model accuracy.

If one outcome class is more abundant in the dataset, as is the case with the breast cancer data (no-recurrence: 201, recurrence: 85), the data is unbalanced. This is okay for a generative Naive Bayes model as you want your model to learn from real-world events and to capture the truth. Manipulating the data to achieve less skew would be dangerous.

Applying the model to the data gives the following confusion matrix from which a model accuracy of 75% can be calculated:

  conf_matrix <- table(preds, breast_cancer$class)

This post has only scraped the surface of classification methods in machine learning but has been a useful revision for myself and perhaps it may help others new to the Naive Bayes classifier. Please feel free to comment and correct any errors that may be present.

 

Featured image By Dennis Hill from The OC, So. Cal. – misc 24, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=54704175

 


To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)