Site icon R-bloggers

When the Predictions are more accurate than the Response

[This article was first published on sweissblaug, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The purpose of this post is to discuss the methodology of classifying paragraphs of documents, where the document is only assigned one topic. 

In my usage, a document is an article that is assigned a Section (Business, Economics / Finance, Science, Europe, Middle East, etc). Of course each article can have multiple topics discussed. For Example; a discussion on Ukraine today might discuss the economics of its exchange rate along with European Union and Russian Military. I’m assuming each paragraph discusses on particular topic, and I’m interested in each which paragraphs discuss economics.


The methodology I propose is to use the original classification to build a model and use this model to fit the paragraphs. The code below suggests this method increases classification performance.
##Assume there are two classes in each document. Each paragraph has data - in this case a random normal - and the higher the value the higher probability it is associated with a particular topic. This could be thought of as word frequency 

##create documents - each document consists of 100 paragraphs 
documents=lapply(1:100, function(x) rnorm(100))

##create topics for each paragraph - 100 documents with a 100 paragraphs each
paragraph_classes=lapply(documents, function(x) rbinom(100,size=1,prob=1/(1+exp(-x))))

## a document consists of several paragraphs but the document assigned the most common topic
document_classes=(sapply(paragraph_classes, function(x) sum(x>0) )>50)

unlisted_documents=unlist(documents)
unlisted_class=rep(document_classes, each=100)

##the original correct classficiation rate is around 54%
table(unlist(paragraph_classes),unlisted_class)
##    unlisted_class
##     FALSE TRUE
##   0  2643 2353
##   1  2257 2747
sum(diag(table(unlist(paragraph_classes),unlisted_class)))/length(unlisted_class)
## [1] 0.539
##build a model. each paragraph response variable is the original class assigned
glm=glm(unlisted_class~unlisted_documents, family="binomial")


#the predicted classification using model on data - can see classification results are over 14% greater than original classficiations
predicted_values=scale(predict(glm))
table(predicted_values>0,unlist(paragraph_classes)>.5)
##        
##         FALSE TRUE
##   FALSE  3386 1647
##   TRUE   1610 3357
sum(diag(table(predicted_values>0,unlist(paragraph_classes)>.5)))/length(unlisted_class)
## [1] 0.6743

To leave a comment for the author, please follow the link and comment on their blog: sweissblaug.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.