genre, text classification & naive bayes
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this short post, we outline a Naive Bayes (NB) approach to genre-based text classification. First, we introduce & describe a corpus derived from Google News’ RSS feed, which includes source and genre information. We then train, test & evaluate the efficacy of an NB classifier applied to online news genres, with some fairly nice results. Here, we focus on the nuts/bolts of an R-based workflow, and leave discussion of theory & Bayesian assumptions for another day.
library(e1071)
library(caret)
library(corpuslingr)
library(tidyverse)
library(knitr)
#library(quicknews)#devtools::install_github("jaytimm/quicknews")
Building a historical, genre-based corpus
For demonstration purposes, I have built a fairly small corpus comprised of national news articles from Google News’ RSS feed. The corpus was built using my quicknews package, and assembled over the course of roughly two weeks (10/29/18 ~ 11/15/18).
Metadata were collected and articles scraped/annotated/aggregated to BOWs three times a day using the Windows task scheduler app. The R script used for corpus assembly (which should scale quite nicely to different/novel search types) is available here.
Metadata include the genre (as defined by Google News) & domain name of article source; they also include the full/raw text of each article. The script above updates corpus (as a collection of BOWs) and metadata (as TIF) each time it is called by the task scheduler.
setwd("C:\\Users\\jason\\Google Drive\\GitHub\\packages\\quicknews\\data-raw")
qnews_tif <- readRDS('qnews_eg_tif.rds')
qnews_bow <- readRDS('qnews_eg_corpus.rds')%>%
  mutate(doc_id = as.integer(doc_id)) %>%
  left_join(qnews_tif %>% select(doc_id, search))
The table below summarizes the composition of our corpus in terms of genre. So, in a little over two weeks, we have assembled a ~2.8K text corpus comprised of ~1.8 million words. And fairly balanced as well from a genre perspective.
qnews_bow %>%
  filter(pos != 'PUNCT') %>%
  group_by(search) %>%
  summarize(tokens = sum(txtf),
            texts = length(unique(doc_id))) %>%
  bind_rows(totals) %>%
  DT::datatable(options = list(pageLength = 8,dom = 't', scrollX = TRUE),
              rownames = FALSE, width="450", escape=FALSE)
The plot below illustrates the growth of our corpus (by genre) over time.
library(ggthemes) qnews_bow %>% filter(pos != 'PUNCT') %>% left_join(qnews_tif) %>% group_by(date, search) %>% summarize(tokens = sum(txtf)) %>% group_by(search) %>% mutate(cum_tok = cumsum(tokens))%>% filter(tokens > 350) %>% ggplot(aes(x=date, y=cum_tok, fill = search)) + geom_area(alpha = 0.75, color = 'gray') + ggthemes::scale_fill_economist()+ theme(legend.position = "bottom")+ scale_y_continuous(labels = function(x) paste0(format(x/1000000), ' mil')) + labs(title = "Composition of corpus (in tokens) over time")

Lastly, and largely for good measure, we take a quick look at corpus composition in terms of article sources. The plot below summarizes the top content generators within each genre as measured by article counts.
qnews_tif %>% 
  group_by(search, source) %>% 
  summarize(count = n()) %>%
  arrange(search,(count))%>%
  top_n(n=7,wt=jitter(count))%>%
  ungroup()%>%
#Hack1 to sort order within facet
  mutate(order = row_number(), 
         source=factor(paste(order,source,sep="_"), 
                      levels = paste(order, source, sep = "_")))%>%
  ggplot(aes(x=source, 
             y=count, 
             fill=search)) + 
  geom_col(show.legend = FALSE) +  
  facet_wrap(~search, scales = "free_y", ncol = 2) +
#Hack2 to modify labels
  scale_x_discrete(labels = function(x) gsub("^.*_", "", x))+
  ggthemes::theme_fivethirtyeight()+ 
  ggthemes::scale_fill_economist() +
  theme(plot.title = element_text(size=12))+ 
  coord_flip()+
  labs(title="Most frequent domains by Google News search topic")

Building a Naive Bayes classifier
The table below illustrates the structure of our genre-based corpus (post some removal of stop words & other less informative lemma types). The corpus, then, represents each constituent article as a bag-of-words (BOW).
corpus_as_bow <- qnews_bow %>%  
  filter(pos %in% c('VERB', 'ADV','NOUN', 'ADJ', 'PROPN') & 
           !lemma %in% c(toupper(corpuslingr::clr_ref_stops),'') & 
           grepl('^[A-Z]', lemma)) 
corpus_as_bow %>%
  head() %>%
  DT::datatable(options = list(pageLength = 8,dom = 't', scrollX = TRUE),
              rownames = FALSE, width="450", escape=FALSE)
To build a Naive Bayes classifier, we first need to transform our BOW corpus into a document-term matrix. The two structures are functionally equivalent; the latter is simply a “wide” rendition of the former.
In addition to this basic transformation, the pipe below (1) excludes superfrequent/ infrequent terms from the matrix based on document frequencies, and (2) normalizes each document vector to account for variation in document length (— smarter approaches exist).
set.seed(99) corpus_as_dtm <- corpus_as_bow %>% group_by(doc_id, search, lemma) %>% summarize(n=n()) %>% group_by(lemma) %>% mutate(docf = length(unique(doc_id))) %>% ungroup() %>% mutate(docf = docf/length(unique(doc_id))) %>% filter(docf > .02 & docf < .35) %>% #Filter infrequent/super-frequent. select(-docf) %>% group_by(doc_id) %>% mutate(n = n/sqrt(sum(n^2))) %>% #Doc freqs as unit vector spread(lemma,n) %>% replace(., is.na(.), 0)%>% ungroup() %>% mutate(search = as.factor(search))
Using the caret package, then, we divide the above matrix into a training set (as 70% of full data set) and a test set (as 30%). The createDataPartition function conveniently creates two equally proportioned samples.
set.seed(99) trainIndex <- caret::createDataPartition(corpus_as_dtm$search, p=0.7)$Resample1 train_data <- corpus_as_dtm[trainIndex, ] test_data <- corpus_as_dtm[-trainIndex, ] #Demo distributions.
With the naiveBayes function from the e1071 package, we build our Naive Bayes classifier based on the training portion of the document-term matrix.
classifier <- e1071::naiveBayes(
  as.matrix(train_data[,3:ncol(train_data)]),
                    train_data$search,
                    laplace = 0.5) 
Then we implement the classifier on the test portion of the document-term matrix.
test_predicted <- 
  predict(classifier,
          as.matrix(test_data[,3:ncol(test_data)]))
Output contains a vector of genre predictions for each text in the test data set. eg:
head(test_predicted) ## [1] topic_nation topic_nation topic_business topic_health ## [5] topic_business topic_health ## 7 Levels: topic_business topic_entertainment ... topic_world
Model assessment & confusion matrix
So, to get a sense of classifier efficacy in identifying the genre of a given article posted on Google News, we calculate a cross-tab of observed & predicted genres via the confusionMatrix function from caret.
cfm <- caret::confusionMatrix(data = test_predicted,
                              test_data$search)
Overall fitness statistics of our model can be accessed via the overall element from the list of outputs generated by the confusionMatrix function. So, classifier accuracy is quite good at ~ 80%. I am sure this is not at gold standard levels; however, seemingly alright for a simple lemma-based approach to text classification.
cfm$overall ## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull ## 0.81084337 0.77858092 0.78251138 0.83694717 0.16144578 ## AccuracyPValue McnemarPValue ## 0.00000000 0.02121729
Lastly, we visualize the confusion matrix based on the table element of output as a tile plot below. Tiles in the most prominent shade of blue reflect correct classifications.
ggplot(data = as.data.frame(cfm$table) ,
           aes(x = Reference, y = Prediction)) +
    geom_tile(aes(fill = log(Freq)), colour = "white") +
    scale_fill_gradient(low = "white", high = "steelblue") +
    geom_text(aes(x = Reference, y = Prediction, label = Freq)) +
    theme(legend.position = "none",
          axis.text.x=element_text(angle=45,hjust=1)) + 
    labs(title="Confusion Matrix")

Output from the confusionMatrix function also includes a host of model diagnostics. The table below summarizes some of the more common diagnostics by genre, including sensitivity (1 - proportion of false negatives), specificity (1 - proportion of false positives), and the average of the two, balanced accuracy.
cfm$byClass %>% data.frame() %>%
  select (Sensitivity, Specificity, Balanced.Accuracy) %>%
  rownames_to_column(var = 'topic') %>%
  mutate(topic = gsub('Class: ','', topic)) %>%
  mutate_if(is.numeric, round, 2) %>%
  DT::datatable(options = list(pageLength = 7,dom = 't', scrollX = TRUE),
              rownames = FALSE, width="450", escape=FALSE) 
Summary
So, a super-quick code-through for building a fairly simple Naive Bayes classifier for genre-based text classification. Largely an excuse on my end to collate some thoughts & resources, and to have a resource to point to (which seems to be lacking re NB text classification in R). Cheers.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
