A while back, I adapted the Naive Bayes model used in this excellent video to classify text documents in Python. Classifying text documents is a fairly straightforward task which is fun to do, and so I wanted to test it out by classifying articles from the Guardian belonging to four categories. In the rest of this post, I’m going to outline some theory on probabilities and Bayes’ Law. Then, we’ll preprocess the Guardian news articles and attempt to classify as many of them to the correct section.
Probabilities and Bayes’ law
In order to properly understand Bayes’ theorem, we need to wrap our heads around conditional probabilities. I will first go over an example where the probabilities are independent.
Suppose that we throw a die. The probability of getting any number is as there are 6 numbers on the die which all have an equal probability of showing up. Now suppose we throw the same die twice. The probability of, say, getting a value of 6 on both throws is . The reason we are allowed to calculate the joint probability of x and y is because these events are independent. That is, the result of the first throw will not affect the result of the second throw, and we can calculate the joint probability by using .
Conditional probabilities concern events that are not independent of one another. Simply put, the conditional probability of an event x occurring is the probability that x will occur given that we already know some other event, y, has occurred. For events that are not independent, we can calculate the probability that both events happened by using:
.
We can then get to conditional probabilities by dividing this formula by such that we get
.
How does all of this relate to Bayes’ Law? Well, the cool thing about Bayes’ law is that it offers us a way to estimate given that we have empirical data on . Consider the convenience of this approach through the following (archetypical) example. Suppose that we want to build a spam filter for an email client. In this case, we have two classes: spam and not spam (or ‘ham’, as it is often referred to in text books). We want to classify each incoming email as either one of these classes by looking closely at each word used in the email. The probability that we are interested in is , or “what is the probability of this word occurring in an email given that this email is spam”.
The problem with this approach is that, in order for this to work properly, we need data on all spam and nonspam emails. Clearly, this is infeasible. Bayes’ law offers us a way to turn the logic around, such that it allows us to predict , or “what is the probability of this email being spam, given that I know that it contains this word”. The goal is to take these probabilities and extrapolate them to new data.
If we return to the section above on conditional probabilities for a moment, we can deduce that there is a relationship between the joint probabilities of events x and y and the conditional probabilities and , such that:
.
Assuming that , we solve for to get to Bayes’ Law
In the formula above, the lefthand side is referred to as the posterior probability. This is the probability that an observation belongs to class given a number of classes. On the right side of the equation, is referred to as the likelihood of an observation belonging to a class . The likelihood is usually estimated as some density function that conveys how “likely” it is for a given observation to belong to a class.
and are referred to as ‘priors’. Essentially, priors can be treated as prior beliefs about the data, and ‘weight’ the data according to some belief we hold about their distribution. In the context of the Naive Bayes approach (see below), is the probability of seeing a randomly chosen observation belonging to the class. So if we have a dataset with 50.000 observations of which 10.000 belong to class and 40.000 observations belong to class , the priors are . is regarded as a ‘scaling factor’.
From Bayes’ law to the Naive Bayes Algorithm
Given a modeling problem, there are many ways to estimate the different elements of Bayes’ law. Naive Bayes is often used for highdimensional datasets. This makes Naive Bayes applicable for a wide variety of problems, such as spam filters.
Naive Bayes is called ‘naive’ because it treats each of its inputs as independent. Especially in the case of text data, this is an erroneous assumption, as textual features are often tied to one another in various ways. Nonetheless, the Naive Bayes classifier usually achieves very good results on text data (as we will see).
In the context of a multinomial Naive Bayes model using a bagofwords approach,^{1} the priors are defined as the number of observations in each class ( in the formula above) and the number of observations for each word over all classes ( in the formula above). The likelihood is defined as the number of times a word occurs in each class.
As an example, consider the following five documents:
Document  w1  w2  w3  section 

doc1  trump  running  president  national 
doc2  abe  army  japan  world 
doc3  trump  japan  army  national 
doc4  china  trump  president  national 
test  trump  china  army  ??? 
We want to calculate the section of the fifth document (the ‘test’ observation) by using documents one to four. The prior probabilities are and . The likelihood for each word given its class is:
Word_section  Likelihood 

trump_national  0.3333333 
running_national  0.1111111 
president_national  0.2222222 
abe_national  0.0000010 
japan_national  0.1111111 
army_national  0.1111111 
china_national  0.1111111 
china_world  0.0000010 
trump_world  0.0000010 
running_world  0.0000010 
president_world  0.0000010 
abe_world  0.3333333 
army_world  0.3333333 
japan_world  0.3333333 
We can now calculate the posterior probability for both sections by using:
Given the words we see in document five, we calculate the probability that the documents belongs to each class, which amounts to and . We then determine for which class the posterior probability is maximized, and that is the class to which we assign the document. In this case, we would assign document five to the class ‘National’.
As the denominator is the same for all classes in the formula above, it is common to leave it out and to calculate the logodds of the data by taking the log of both sides. This also means that instead of multiplying the probabilities, we will be simply calculating the sum of the logodds. To ensure we never take the log of zero, we add a ‘smoother’ (usually a small number).
The data
As an example of how Naive Bayes works, we will now classify 5.966 articles belonging to four sections of the Guardian. The data we will be using for classification is scraped using the Guardian API. You can find the script to download articles on GitHub. The goal is to correctly classify as many articles to their category by using the article title and body.
Section  Number.obs  Proportion 

Politics  1493  0.250 
Sport  1504  0.252 
Travel  1472  0.247 
World news  1497  0.251 
The data is split into a training set (90%), and a test set (10%) to evaluate the performance of the classifier.
Preprocessing: cleaning text data
There is an entire field dedicated to the use of text as data. In this context, our preprocessing steps are relatively easy:
 Clean data – remove whitespaces, punctuation, numbers etc.
 Remove stopwords – remove words like for, if, and etc.
 Stem data – try to find as many overlap between different variations of words.
 Feature selection – avoid overfitting by selecting most important features.
Cleaning data, removing stopwords and stemming
With text analysis, we try to remove as much ‘noise’ as we can from the data. This means that words like play.
and Played
should (in the general case) be counted as the same word. However, play.
contains a period and Played
contains a capital letter. As such, they will be treated as separate words. By cleaning text, we can remove punctuation, capitals, numbers and whitespaces such that we get play
and played
.
Although cleaning text will improve our features a lot, the words play
and played
will still be categorized as different words. In order to get around this problem, we stem data. Stemming attempts to capture the ‘root’ of each word, such that e.g. ‘love’ and ‘loved’ both become ‘lov’. An alternative approach to stemming is to ‘lemmatize’ the data. Lemmatization tries to determine the ‘positionofspeech’ of each word in a sentence. Using the information on the word position, lemmatization can then determine the proper base of the word.
Let’s consider an example where we clean a sentence, remove stopwords and stem the text:
“The president said voters should keep pressure on members of Congress to reach across party lines and ensure initiatives in his budget are enacted”
First, we clean the text of capital letters, punctuation and remove stopwords. This gives us:
“president said voters keep pressure members congress reach across party lines ensure initiatives budget enacted”
Next, we stem the text:
“presid said voter keep pressur member congress reach across parti line ensur initi budget enact”
Note that if we use POS tags, we don’t need to remove stopwords. Nor do we need to convert capital letters. Doing so would remove a lot of valuable information (e.g. the difference between “Bill” and “bill”). We can always filter a sentence for POS tags that aren’t nouns or verbs, and apply stopword removal after lemmatizing the text to filter out common words like ‘be’:
“TheDT presidentNN saidVBD votersNNS shouldMD keepVB pressureNN onIN membersNNS ofIN CongressNNP toTO reachVB acrossRP partyNN linesNNS andCC ensureNN initiativesNNS inIN hisPRP$ budgetNN areVBP enactedVBN”
If we then lemmatize this sentence, we end up with:
“The president say voter should keep pressure on member of Congress to reach across party line and ensure initiative in his budget be enact”
Or, if we choose to keep only nouns, verbs and remove stopwords, we get:
“president say voter keep pressure member Congress reach party line ensure initiative budget enact”
The choice between stemming and lemmatizing can be computational. Stemming is resourcefriendly, while lemmatizing can be computationally intensive. This choice is further informed by the modeling technique. Recall that Naive Bayes treats all features as independent. As such, taking things like POS into account may not help much in providing higher accuracy, since the model doesn’t take this information into account. Models like Support Vector Machines, on the other hand, are better suited to process this information.
Feature selection
When working with text data (like any other highdimensional data), it is a good idea to remove sparse features (or variables). These are features which don’t add a lot of new information to the model, and may lead to overfitting. As an example, consider taking 20.000 news articles from the New York Times API and 4.000 articles from the Guardian API. Your goal is to train a model on the NYT articles and predict one of four categories of the Guardian articles. If you use all features that are present in the NYT articles, you end up with a lot of noise that may not be useful – or could be counterinformative – when predicting the section of the Guardian articles.
There are several ways to perform feature selection. A common method is to look at term frequency, establish a threshold (common values are 5 or 10), and remove all features that fall below that threshold. An extension of this approach is termfrequency inverse document frequency.
One method that is very applicable to the ‘bagofwords’ approach is to perform a chisquared test. In essence, the chisquare test is a nonparametric test of independence for categorical data. In the context of text data, the chisquared test is used to determine whether the occurrence of a word in a category differs from the total number of occurrences of that word. We then assign a score to each word, rank them in order of this score, and establish a cutoff point for the number of words we want to include in determining our classification.
When we look at the ten most informative words in the Guardian dataset, we get the following:
Word  Score 

leader  3203102.22 
cameron  443533.02 
minist  259265.63 
world  242961.40 
win  50576.13 
england  49697.83 
game  39441.80 
tori  36113.20 
polit  27203.11 
support  25723.21 
These make sense. The words leader
and world
are likely connected to the section world news
. The words cameron
, minist
, tori
and polit
likely occur in the section politics
and so on.
Classifying the Guardian Articles using Naive Bayes
Now we’re ready for the fun part! We’ll be classifying the Guardian articles using this script that I adapted from this video.^{2} Of course, you can also use an outofthebox model from e.g. the sklearn python module, but using a custom script helps to better understand what’s going on under the hood.
Firstly, it is important to consider what we need to do ‘better’ than. That is, if we picked a random label for each observation, what would our accuracy be? Secondly, we need to consider what our expected accuracy will be if we simply assigned the label of the largest group to each observation. The accuracy rates for these models would 25% and 33% respectively, so these are the models we aim to outperform.
The data for our model will be preprocessed by removing numbers, punctuation and stopwords from the text, after which the porter stemmer is applied to stem the words. We will evaluate model performance using kfold crossvalidation with and testing the model on the test set.
The results from the kfold validation results in an average accuracy of 95.5%. Likewise, the model results in an overall accuracy of <95.3% on the test set. The high concordance between these two figures indicates that, despite the simplicity of the model, it correctly classifies roughly 95% of the unseen articles in the test set. As such, we are vastly outperforming the base models of 25% and 33%.
Although accuracy gives us a good overall metric of how well the model is doing, it is important to look at the performance per class. This is shown in the confusion matrix below.
Politics  Sport  Travel  World news  

Politics  140  0  1  8 
Sport  2  145  1  2 
Travel  1  0  141  5 
World news  1  0  7  142 
Here, it becomes clear that the Travel
and World news
classes contain higher false positives than the other classes. Overall, these results make sense. While a section like Sport
is separated well from the other sections, the world_news
section is less clearly bounded, and often contains news on international politics.
Conclusion
In this post, we looked at an implementation of the Naive Bayes algorithm to classify news articles from the Guardian. Even though Naive Bayes is a simple approach to such a problem, it performs very well on the Guardian dataset. We could improve on this result by looking at e.g. different word combinations using ngrams, by using a different stemming technique, by finetuning parameters like the number of features we include in the model or the smoother with use in the Naive Bayes model, or by choosing a more sophisticated model. Further information on Bayes’ law and Naive Bayes can be found in the references list below.
Further reading
 Schutt, Rachel, and Cathy O’Neil. Doing data science: Straight talk from the frontline. ” O’Reilly Media, Inc.”, 2013. pp. 98104
 James, Gareth, et al. An introduction to statistical learning. New York: springer, 2013. pp. 138142
 Zumel, Nina, John Mount, and Jim Porzak. Practical data science with R. Manning, 2014. pp. 134138
 Machine Learning (Part 3 of 5): Naive Bayes Classifier. URL: https://www.youtube.com/watch?v=j3IGd5CjsVA
 Multinomial Naive Bayes: A worked example. URL: https://class.coursera.org/nlp/lecture/28
Notes

Multinomial naive bayes classifiers differ from multivariate naive bayes classifiers in the sense that they take a different input. That is, multivariate models take binary features, whereas multinomial models take term frequencies as features. For more information, see this helpful overview ↩

I highly recommend that you watch this video. It gives one of the best explanations of Naive Bayes that I have come across ↩
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...