When the Predictions are more accurate than the Response

May 15, 2016
By

(This article was first published on sweissblaug, and kindly contributed to R-bloggers)

The methodology I propose is to use the original classification to build a model and use this model to fit the paragraphs. The code below suggests this method increases classification performance.
##Assume there are two classes in each document. Each paragraph has data - in this case a random normal - and the higher the value the higher probability it is associated with a particular topic. This could be thought of as word frequency 

##create documents - each document consists of 100 paragraphs
documents=lapply(1:100, function(x) rnorm(100))

##create topics for each paragraph - 100 documents with a 100 paragraphs each
paragraph_classes=lapply(documents, function(x) rbinom(100,size=1,prob=1/(1+exp(-x))))

## a document consists of several paragraphs but the document assigned the most common topic
document_classes=(sapply(paragraph_classes, function(x) sum(x>0) )>50)

unlisted_documents=unlist(documents)
unlisted_class=rep(document_classes, each=100)

##the original correct classficiation rate is around 54%
table(unlist(paragraph_classes),unlisted_class)
##    unlisted_class
## FALSE TRUE
## 0 2643 2353
## 1 2257 2747
sum(diag(table(unlist(paragraph_classes),unlisted_class)))/length(unlisted_class)
## [1] 0.539
##build a model. each paragraph response variable is the original class assigned
glm=glm(unlisted_class~unlisted_documents, family="binomial")


#the predicted classification using model on data - can see classification results are over 14% greater than original classficiations
predicted_values=scale(predict(glm))
table(predicted_values>0,unlist(paragraph_classes)>.5)
##        
## FALSE TRUE
## FALSE 3386 1647
## TRUE 1610 3357
sum(diag(table(predicted_values>0,unlist(paragraph_classes)>.5)))/length(unlisted_class)
## [1] 0.6743

To leave a comment for the author, please follow the link and comment on their blog: sweissblaug.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)