# When the Predictions are more accurate than the Response

May 15, 2016
By

(This article was first published on sweissblaug, and kindly contributed to R-bloggers)

The methodology I propose is to use the original classification to build a model and use this model to fit the paragraphs. The code below suggests this method increases classification performance.
``##Assume there are two classes in each document. Each paragraph has data - in this case a random normal - and the higher the value the higher probability it is associated with a particular topic. This could be thought of as word frequency ##create documents - each document consists of 100 paragraphs documents=lapply(1:100, function(x) rnorm(100))##create topics for each paragraph - 100 documents with a 100 paragraphs eachparagraph_classes=lapply(documents, function(x) rbinom(100,size=1,prob=1/(1+exp(-x))))## a document consists of several paragraphs but the document assigned the most common topicdocument_classes=(sapply(paragraph_classes, function(x) sum(x>0) )>50)unlisted_documents=unlist(documents)unlisted_class=rep(document_classes, each=100)##the original correct classficiation rate is around 54%table(unlist(paragraph_classes),unlisted_class)``
``##    unlisted_class##     FALSE TRUE##   0  2643 2353##   1  2257 2747``
``sum(diag(table(unlist(paragraph_classes),unlisted_class)))/length(unlisted_class)``
``## [1] 0.539``
``##build a model. each paragraph response variable is the original class assignedglm=glm(unlisted_class~unlisted_documents, family="binomial")#the predicted classification using model on data - can see classification results are over 14% greater than original classficiationspredicted_values=scale(predict(glm))table(predicted_values>0,unlist(paragraph_classes)>.5)``
``##        ##         FALSE TRUE##   FALSE  3386 1647##   TRUE   1610 3357``
``sum(diag(table(predicted_values>0,unlist(paragraph_classes)>.5)))/length(unlisted_class)``
``## [1] 0.6743``

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...