Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A while back, I adapted the Naive Bayes model used in this excellent video to classify text documents in Python. Classifying text documents is a fairly straightforward task which is fun to do, and so I wanted to test it out by classifying articles from the Guardian belonging to four categories. In the rest of this post, I’m going to outline some theory on probabilities and Bayes’ Law. Then, we’ll pre-process the Guardian news articles and attempt to classify as many of them to the correct section.

## Probabilities and Bayes’ law

In order to properly understand Bayes’ theorem, we need to wrap our heads around conditional probabilities. I will first go over an example where the probabilities are independent.

Suppose that we throw a die. The probability of getting any number is $\frac{1}{6}$ as there are 6 numbers on the die which all have an equal probability of showing up. Now suppose we throw the same die twice. The probability of, say, getting a value of 6 on both throws is $P(x,y) = \frac{1}{6} \frac{1}{6} = \frac{1}{36}$. The reason we are allowed to calculate the joint probability of x and y is because these events are independent. That is, the result of the first throw will not affect the result of the second throw, and we can calculate the joint probability by using $P(x, y) = P(x) P(y)$.

Conditional probabilities concern events that are not independent of one another. Simply put, the conditional probability of an event x occurring is the probability that x will occur given that we already know some other event, y, has occurred. For events that are not independent, we can calculate the probability that both events happened by using:

$P(x, y) = P(x) P(x|y)$.

We can then get to conditional probabilities by dividing this formula by $P(x)$ such that we get

$P(x|y) = \frac{P(x, y)}{P(x)}$.

How does all of this relate to Bayes’ Law? Well, the cool thing about Bayes’ law is that it offers us a way to estimate $P(x|y)$ given that we have empirical data on $P(y|x)$. Consider the convenience of this approach through the following (archetypical) example. Suppose that we want to build a spam filter for an e-mail client. In this case, we have two classes: spam and not spam (or ‘ham’, as it is often referred to in text books). We want to classify each incoming email as either one of these classes by looking closely at each word used in the email. The probability that we are interested in is $P(word|spam)$, or “what is the probability of this word occurring in an email given that this email is spam”.

The problem with this approach is that, in order for this to work properly, we need data on all spam and non-spam emails. Clearly, this is infeasible. Bayes’ law offers us a way to turn the logic around, such that it allows us to predict $P(y|x)$, or “what is the probability of this email being spam, given that I know that it contains this word”. The goal is to take these probabilities and extrapolate them to new data.

If we return to the section above on conditional probabilities for a moment, we can deduce that there is a relationship between the joint probabilities of events x and y $P(x, y)$ and the conditional probabilities $P(x|y)$ and $P(y|x)$, such that:

$P(x|y) P(y) = P(x, y) = P(y|x) P(x)$.

Assuming that $P(x) \neq 0$, we solve for $P(y|x)$ to get to Bayes’ Law

$P(y|x) = \frac{P(x|y) P(y)}{P(x)}$

In the formula above, the left-hand side is referred to as the posterior probability. This is the probability that an observation belongs to class $C_i$ given a number of classes. On the right side of the equation, $P(x|y)$ is referred to as the likelihood of an observation belonging to a class $C_i$. The likelihood is usually estimated as some density function that conveys how “likely” it is for a given observation to belong to a class.

$P(y)$ and $P(x)$ are referred to as ‘priors’. Essentially, priors can be treated as prior beliefs about the data, and ‘weight’ the data according to some belief we hold about their distribution. In the context of the Naive Bayes approach (see below), $P(y)$ is the probability of seeing a randomly chosen observation belonging to the $C_{th}$ class. So if we have a dataset with 50.000 observations of which 10.000 belong to class $C_1$ and 40.000 observations belong to class $C_2$, the priors are $(0.20, 0.80)$. $P(x)$ is regarded as a ‘scaling factor’.

## From Bayes’ law to the Naive Bayes Algorithm

Given a modeling problem, there are many ways to estimate the different elements of Bayes’ law. Naive Bayes is often used for high-dimensional datasets. This makes Naive Bayes applicable for a wide variety of problems, such as spam filters.

Naive Bayes is called ‘naive’ because it treats each of its inputs as independent. Especially in the case of text data, this is an erroneous assumption, as textual features are often tied to one another in various ways. Nonetheless, the Naive Bayes classifier usually achieves very good results on text data (as we will see).

In the context of a multinomial Naive Bayes model using a bag-of-words approach,1 the priors are defined as the number of observations in each class ($P(y)$ in the formula above) and the number of observations for each word over all classes ($P(x)$ in the formula above). The likelihood is defined as the number of times a word occurs in each class.

As an example, consider the following five documents:

Document w1 w2 w3 section
doc1 trump running president national
doc2 abe army japan world
doc3 trump japan army national
doc4 china trump president national
test trump china army ???

We want to calculate the section of the fifth document (the ‘test’ observation) by using documents one to four. The prior probabilities are $P(World) = 0.25$ and $P(National) = 0.75$. The likelihood for each word given its class is:

Word_section Likelihood
trump_national 0.3333333
running_national 0.1111111
president_national 0.2222222
abe_national 0.0000010
japan_national 0.1111111
army_national 0.1111111
china_national 0.1111111
china_world 0.0000010
trump_world 0.0000010
running_world 0.0000010
president_world 0.0000010
abe_world 0.3333333
army_world 0.3333333
japan_world 0.3333333

We can now calculate the posterior probability for both sections by using:

$p(class|document) = \frac{p(class) p(trump|c) p(china|c) p(army|c)}{p(words)}$

Given the words we see in document five, we calculate the probability that the documents belongs to each class, which amounts to $P(World) = 2.78E-14$ and $P(National) = 0.00103$. We then determine for which class the posterior probability is maximized, and that is the class to which we assign the document. In this case, we would assign document five to the class ‘National’.

As the denominator is the same for all classes in the formula above, it is common to leave it out and to calculate the log-odds of the data by taking the log of both sides. This also means that instead of multiplying the probabilities, we will be simply calculating the sum of the log-odds. To ensure we never take the log of zero, we add a ‘smoother’ (usually a small number).

## The data

As an example of how Naive Bayes works, we will now classify 5.966 articles belonging to four sections of the Guardian. The data we will be using for classification is scraped using the Guardian API. You can find the script to download articles on GitHub. The goal is to correctly classify as many articles to their category by using the article title and body.

Section Number.obs Proportion
Politics 1493 0.250
Sport 1504 0.252
Travel 1472 0.247
World news 1497 0.251

The data is split into a training set (90%), and a test set (10%) to evaluate the performance of the classifier.

## Pre-processing: cleaning text data

There is an entire field dedicated to the use of text as data. In this context, our pre-processing steps are relatively easy:

1. Clean data – remove whitespaces, punctuation, numbers etc.
2. Remove stopwords – remove words like for, if, and etc.
3. Stem data – try to find as many overlap between different variations of words.
4. Feature selection – avoid overfitting by selecting most important features.

### Cleaning data, removing stopwords and stemming

With text analysis, we try to remove as much ‘noise’ as we can from the data. This means that words like play. and Played should (in the general case) be counted as the same word. However, play. contains a period and Played contains a capital letter. As such, they will be treated as separate words. By cleaning text, we can remove punctuation, capitals, numbers and whitespaces such that we get play and played.

Although cleaning text will improve our features a lot, the words play and played will still be categorized as different words. In order to get around this problem, we stem data. Stemming attempts to capture the ‘root’ of each word, such that e.g. ‘love’ and ‘loved’ both become ‘lov’. An alternative approach to stemming is to ‘lemmatize’ the data. Lemmatization tries to determine the ‘position-of-speech’ of each word in a sentence. Using the information on the word position, lemmatization can then determine the proper base of the word.

Let’s consider an example where we clean a sentence, remove stopwords and stem the text:

“The president said voters should keep pressure on members of Congress to reach across party lines and ensure initiatives in his budget are enacted”

First, we clean the text of capital letters, punctuation and remove stopwords. This gives us:

“president said voters keep pressure members congress reach across party lines ensure initiatives budget enacted”

Next, we stem the text:

“presid said voter keep pressur member congress reach across parti line ensur initi budget enact”

Note that if we use POS tags, we don’t need to remove stopwords. Nor do we need to convert capital letters. Doing so would remove a lot of valuable information (e.g. the difference between “Bill” and “bill”). We can always filter a sentence for POS tags that aren’t nouns or verbs, and apply stopword removal after lemmatizing the text to filter out common words like ‘be’:

“The-DT president-NN said-VBD voters-NNS should-MD keep-VB pressure-NN on-IN members-NNS of-IN Congress-NNP to-TO reach-VB across-RP party-NN lines-NNS and-CC ensure-NN initiatives-NNS in-IN his-PRP\$ budget-NN are-VBP enacted-VBN”

If we then lemmatize this sentence, we end up with:

“The president say voter should keep pressure on member of Congress to reach across party line and ensure initiative in his budget be enact”

Or, if we choose to keep only nouns, verbs and remove stopwords, we get:

“president say voter keep pressure member Congress reach party line ensure initiative budget enact”

The choice between stemming and lemmatizing can be computational. Stemming is resource-friendly, while lemmatizing can be computationally intensive. This choice is further informed by the modeling technique. Recall that Naive Bayes treats all features as independent. As such, taking things like POS into account may not help much in providing higher accuracy, since the model doesn’t take this information into account. Models like Support Vector Machines, on the other hand, are better suited to process this information.

### Feature selection

When working with text data (like any other high-dimensional data), it is a good idea to remove sparse features (or variables). These are features which don’t add a lot of new information to the model, and may lead to overfitting. As an example, consider taking 20.000 news articles from the New York Times API and 4.000 articles from the Guardian API. Your goal is to train a model on the NYT articles and predict one of four categories of the Guardian articles. If you use all features that are present in the NYT articles, you end up with a lot of noise that may not be useful – or could be counter-informative – when predicting the section of the Guardian articles.

There are several ways to perform feature selection. A common method is to look at term frequency, establish a threshold (common values are 5 or 10), and remove all features that fall below that threshold. An extension of this approach is term-frequency inverse document frequency.

One method that is very applicable to the ‘bag-of-words’ approach is to perform a chi-squared test. In essence, the chi-square test is a non-parametric test of independence for categorical data. In the context of text data, the chi-squared test is used to determine whether the occurrence of a word in a category differs from the total number of occurrences of that word. We then assign a score to each word, rank them in order of this score, and establish a cut-off point for the number of words we want to include in determining our classification.

When we look at the ten most informative words in the Guardian dataset, we get the following:

Word Score
cameron 443533.02
minist 259265.63
world 242961.40
win 50576.13
england 49697.83
game 39441.80
tori 36113.20
polit 27203.11
support 25723.21

These make sense. The words leader and world are likely connected to the section world news. The words cameron, minist, tori and polit likely occur in the section politics and so on.

## Classifying the Guardian Articles using Naive Bayes

Now we’re ready for the fun part! We’ll be classifying the Guardian articles using this script that I adapted from this video.2 Of course, you can also use an out-of-the-box model from e.g. the sklearn python module, but using a custom script helps to better understand what’s going on under the hood.

Firstly, it is important to consider what we need to do ‘better’ than. That is, if we picked a random label for each observation, what would our accuracy be? Secondly, we need to consider what our expected accuracy will be if we simply assigned the label of the largest group to each observation. The accuracy rates for these models would 25% and 33% respectively, so these are the models we aim to outperform.

The data for our model will be pre-processed by removing numbers, punctuation and stopwords from the text, after which the porter stemmer is applied to stem the words. We will evaluate model performance using k-fold cross-validation with $k=5$ and testing the model on the test set.

The results from the k-fold validation results in an average accuracy of 95.5%. Likewise, the model results in an overall accuracy of <95.3% on the test set. The high concordance between these two figures indicates that, despite the simplicity of the model, it correctly classifies roughly 95% of the unseen articles in the test set. As such, we are vastly outperforming the base models of 25% and 33%.

Although accuracy gives us a good overall metric of how well the model is doing, it is important to look at the performance per class. This is shown in the confusion matrix below.

Politics Sport Travel World news
Politics 140 0 1 8
Sport 2 145 1 2
Travel 1 0 141 5
World news 1 0 7 142

Here, it becomes clear that the Travel and World news classes contain higher false positives than the other classes. Overall, these results make sense. While a section like Sport is separated well from the other sections, the world_news section is less clearly bounded, and often contains news on international politics.

# Conclusion

In this post, we looked at an implementation of the Naive Bayes algorithm to classify news articles from the Guardian. Even though Naive Bayes is a simple approach to such a problem, it performs very well on the Guardian dataset. We could improve on this result by looking at e.g. different word combinations using n-grams, by using a different stemming technique, by fine-tuning parameters like the number of features we include in the model or the smoother with use in the Naive Bayes model, or by choosing a more sophisticated model. Further information on Bayes’ law and Naive Bayes can be found in the references list below.

• Schutt, Rachel, and Cathy O’Neil. Doing data science: Straight talk from the frontline. ” O’Reilly Media, Inc.”, 2013. pp. 98-104
• James, Gareth, et al. An introduction to statistical learning. New York: springer, 2013. pp. 138-142
• Zumel, Nina, John Mount, and Jim Porzak. Practical data science with R. Manning, 2014. pp. 134-138
• Machine Learning (Part 3 of 5): Naive Bayes Classifier. URL: https://www.youtube.com/watch?v=j3IGd5CjsVA
• Multinomial Naive Bayes: A worked example. URL: https://class.coursera.org/nlp/lecture/28

# Notes

1. Multinomial naive bayes classifiers differ from multivariate naive bayes classifiers in the sense that they take a different input. That is, multivariate models take binary features, whereas multinomial models take term frequencies as features. For more information, see this helpful overview

2. I highly recommend that you watch this video. It gives one of the best explanations of Naive Bayes that I have come across