Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression]

June 21, 2016
By

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

Welcome to Introduction to R for Data Science, Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression [Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().]

The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

Check out the Course Overview to acess the learning material presented thus far.

Data Science Serbia Course Pages [in Serbian]

Startit Course Pages [in Serbian]

image

Lecturers

Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.

Intro to Text Mining in R + Binomial Logistic Regression. Intro to Text Mining in R + Binomiral Logistic Regression: Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().

Intro to R for Data Science SlideShare :: Session 8

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression] from Goran S. Milovanovic

R script :: Session 8

########################################################
# Introduction to R for Data Science
# SESSION 8 :: 16 June, 2016
# Binomiral Logistic Regression + Intro to Text Mining in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
 
# clear
rm(list=ls())
 
# libraries
library(tm)
library(tm.plugin.webmining)
 
#### NOTE
#### Suggestion: Skip WebCorpus() calls from the tm.plugin.webmining
#### (skip everything before: # START HERE load)
#### Data set is available from GitHub :: https://github.com/GoranMilovanovic/IntroRDataScience
#### File: Session8.RData
#### Download link: https://goo.gl/cgJ3J3
 
#### Part I Information Retrieval: ICT market
# 2 categories: dotcom vs hardware companies
# dotCom category:
# NASDAQ:GOOGL is Alphabet Inc, NASDAQ:AMZN is Amazon, NASDAQ:JD is JD.com, NASDAQ:FB is Facebook,
# NYSE:BABA is Alibaba
# hardware category:
# NYSE:HPQ is HP, NASDAQ:AAPL is Apple, KRX:005930 is Samsung Electronics, TPE:2354 is Foxconn,
# NYSE:IBM is IBM
 
# source: Google Finance
# search queries: .com vs. hardware companies
searchQueries <- list(c('NASDAQ:GOOGL','NASDAQ:AMZN','NASDAQ:JD','NASDAQ:FB','NYSE:BABA'),
                      c('NYSE:HPQ','NASDAQ:AAPL','KRX:005930','TPE:2354','NYSE:IBM'));
 
# retrive w. tm.plugin.webmining 
# retrieve news for dotCom
dotCom <- lapply(searchQueries[[1]], function(x) {
                 WebCorpus(GoogleFinanceSource(x))
});
# now retrieve news for hardware companies
hardware <- lapply(searchQueries[[2]], function(x) {
  WebCorpus(GoogleFinanceSource(x))
});
 
# source: Google News
searchQueriesNews <- list(c("Google", "Amazon", "JD.com", "Facebook", "Alibaba"), 
                       c("Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM"))
 
# retrieve news for dotCom companies 
dotComNews <- lapply(searchQueriesNews[[1]], function(x) {
  googleNewsSRC <- GoogleNewsSource(x,
                                    params = list(hl = "en",
                                                  q = x,
                                                  ie = "utf-8",
                                                  num = 30,
                                                  output = "rss"))
 
  WebCorpus(googleNewsSRC)
});
 
# retrieve news for hardware companies
hardwareNews <- lapply(searchQueriesNews[[2]], function(x) {
  googleNewsSRC <- GoogleNewsSource(x,
                                    params = list(hl = "en",
                                                  q = x,
                                                  ie = "utf-8",
                                                  num = 30,
                                                  output = "rss"))
 
  WebCorpus(googleNewsSRC)
});
 
# save workspace; NOTE: tm has functions to save corpora
sessionDir <- 
  "YOUR WORKING DIRECTORY HERE"
setwd(sessionDir)
save.image(file="Session8.RData")
 
# START HERE load
sessionDir <- 
  "YOUR WORKING DIRECTORY HERE"
setwd(sessionDir)
load(file="Session8.RData")
 
# tm corpora are lists
 
# a tm corpus (WebCorpus, more specifically)
dotCom[[1]]
# a PlaintTextDocument in the first dotCom corpus
class(dotCom[[1]][[1]])
dotCom[[1]][[1]]
# document metadata
dotCom[[1]][[1]]$meta
# document content
dotCom[[1]][[1]]$content
 
# let's add another tag the document metadata structure: dotCom
dotCom <- lapply(dotCom, function(x) {
  x <- tm_map(x, function(doc) { # tm_map works over tm corpora, similarly to lapply
    meta(doc, "category") <- "dotCom"
    return(doc)
    })
})
dotCom[[1]][[1]]$meta
 
# add a category tag to the document metadata structure: dotComNews
dotComNews <- lapply(dotComNews, function(x) {
  x <- tm_map(x, function(doc) {
    meta(doc, "category") <- "dotCom"
    return(doc)
  })
})
dotComNews[[1]][[1]]$meta
 
# add a category tag to the document metadata structure: hardware
hardware <- lapply(hardware, function(x) {
  x <- tm_map(x, function(doc) {
    meta(doc, "category") <- "hardware"
    return(doc)
  })
})
hardware[[1]][[1]]$meta
 
# add a category tag to the document metadata structure: hardwareNews
hardwareNews <- lapply(hardwareNews, function(x) {
  x <- tm_map(x, function(doc) {
    meta(doc, "category") <- "hardware"
    return(doc)
  })
})
hardwareNews[[1]][[1]]$meta
 
# combine corpora w. do.call() and c()
dotCom <- do.call(c, dotCom) # do.call comes handy; similar to lapply
# Learn more about do.call: http://www.stat.berkeley.edu/~s133/Docall.html
hardware <- do.call(c, hardware)
dotComNews <- do.call(c, dotComNews)
hardwareNews <- do.call(c, hardwareNews)
workCorpus <- c(dotCom, dotComNews, hardware, hardwareNews)
 
# are there any duplicates?
urls <- as.character(meta(workCorpus,tag="origin"))
uurls <- unique(urls)
w <- unlist(lapply(uurls, function(x) {which(urls %in% x)[1]}))
workCorpus <- workCorpus[w]
 
# empty docs?
docLengths <- as.numeric(lapply(workCorpus, function(x) {nchar(x$content)}))
w <- which(docLengths<10)
workCorpus <- workCorpus[-w]
 
#### Part II Text Preprocessing
 
### text pre-processing
workCorpus[[1]]$content # we need to clean-up the docs
# reminder: https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
# Regex in R
removeSpecial <- function(x) {
  # replacing w. space might turn out to be handy
  x$content <- gsub("\\t|\\r|\\n"," ",x$content)
  return(x)
}
# example: 
cleanDoc <- removeSpecial(workCorpus[[1]])
cleanDoc$content
# removeSpecial with tm_map {tm}
workCorpus <- tm_map(workCorpus, removeSpecial)
workCorpus[[1]]$content
 
#### {tm} by the book text pre-processing
 
# remove punctuation
workCorpus <- tm_map(workCorpus, removePunctuation) # built in
workCorpus[[1]]$content
 
# remove numbers
workCorpus <- tm_map(workCorpus, removeNumbers) # built in
workCorpus[[1]]$content
 
# all characters tolower
toLower <- function(x) {
  x$content <- tolower(x$content)
  return(x)
}
workCorpus <- tm_map(workCorpus, toLower)
workCorpus[[1]]$content
 
# remove stop words
workCorpus <- tm_map(workCorpus, removeWords, stopwords("english")) # built in
workCorpus[[1]]$content
 
# stemming
# see: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
library(SnowballC) # nice stemming algorithm (Porter, 1980)
# see: https://cran.r-project.org/web/packages/SnowballC/index.html
workCorpus <- tm_map(workCorpus, stemDocument)
workCorpus[[1]]$content
 
# remove potentially dangerous predictors: company names
# Why? Why? (In a nutshell: we are trying to discover something here,
# not to 'confirm' our potentially redundant knowledge)
predWords <- tolower(c("Alphabet", "Google", "Amazon", "JD.com", "Facebook", "Alibaba", 
                       "HP", "Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM"))
predWordsRegex <- paste(stemDocument(predWords),collapse="|");
replacePreds <- function(x) {
  x$content <- gsub(predWordsRegex,
                    " ", 
                    x$content)
  return(x)
}
workCorpus <- tm_map(workCorpus, replacePreds)
workCorpus[[1]]$content
 
# strip whitespace
workCorpus <- tm_map(workCorpus, stripWhitespace)
workCorpus[[1]]$content
 
#### Part III Feature Selection: Term-Document Frequency Matrix (TDM)
 
# TDM
# rows = terms; columns = docs
# we will use the td-idf weighting
# see: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
dT <- TermDocumentMatrix(workCorpus,
                              control = list(tolower = FALSE,
                                             wordLengths <- c(3, Inf),
                                             weighting = weightTfIdf)
)
 
#  dT is a *sparse matrix*; maybe you want to learn more about the R {slam} package
# https://cran.r-project.org/web/packages/slam/index.html
class(dT)
dT$i # rows w. non-zero entries
dT$j # columns w. non-zero entries
dT$v # non-zero entry in [i,j]
dT$nrow # "true" row dimension
dT$ncol # "true" column dimension
dT$dimnames$Terms # self-explanatory
dT$dimnames$Docs # self-explanatory
 
# remove sparse terms
docTerm <- removeSparseTerms(dT, sparse = .75) # sparse is in (0,1]
dim(docTerm)
docTerm$dimnames$Terms
# see: [1] http://www.inside-r.org/packages/cran/tm/docs/removeSparseTerms
# A term-document matrix where those terms from x are removed which have 
# at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) 
# elements. [1]
# NOTE: **very important**; in a real-world application
# you would probably need to run many models with features obtained from various levels of sparcity
# and perform model selection.
 
# as.matrix
docTerm <- as.matrix(docTerm)
colnames(docTerm) <- paste0("d",seq(1:dim(docTerm)[2]))
# dimensions
dim(docTerm)
# inspect candidate features
tfIdf <- rowSums(docTerm)
tfIdf <- sort(tfIdf, decreasing = T)
head(tfIdf, 10)
tail(tfIdf, 10)
# list candidate features
length(tfIdf)
names(tfIdf)
# pick some words w
w <- sample(1:length(tfIdf),4,replace = F)
par(mfcol = c(2,2))
for (i in 1:length(w)) {
  hist(docTerm[w[i],],
       main = paste0("Distribution of '",names(tfIdf)[w[i]],"'"),
       cex.main = .85,
       xlab = "Tf-Idf",
       ylab = "Count"
       )
} # quite interesting, isn't it? - How do you perform regression with these?

image
#### Part IV Binomial Logistic Regression
 
# How well can the selected words predict the document category?
# How to relate continuous, non-normally distributed predictors, to a categorical outcome?
# Idea: Logistic function
par(mfcol=c(1,1))
logistic <- function(t) {exp(t)/(1+exp(t))}
curve(logistic, from = -10, to = 10, n = 1000,
      main="Logistic Function",
      cex.main = .85,
      xlab = "t",
      ylab = "Logistic(t)",
      cex.lab = .85)
# and then let t be b0 + b1*x1 + b2*x2 + ... + bn*xn

image
# prepare data set
dataSet <- data.frame(t(docTerm))
dataSet$Category <- as.character(meta(workCorpus, tag="category")) 
table(dataSet$Category)
 
# does every word play at least some role in each category?
w1 <- which(colSums(dataSet[which(dataSet$Category == "dotCom"), 1:(dim(dataSet)[2]-1)])==0)
colnames(dataSet)[w1]
w2 <- which(colSums(dataSet[which(dataSet$Category == "hardware"), 1:(dim(dataSet)[2]-1)])==0)
colnames(dataSet)[w2]
 
# recode category
library(plyr)
dataSet$Category <- as.numeric(revalue(dataSet$Category, c("dotCom"=1, "hardware"=0)))
 
#### Logistic Regression
# Binomial Logistic Regression: use glm w. logit link (link='logit' is default)
bLRmodel <- glm(Category ~.,
             family=binomial(link='logit'),
             control = list(maxit = 500),
             data=dataSet)
 
sumLR <- summary(bLRmodel)
sumLR
 
# Coefficients
# NOTE: coefficients here relate a unit change in the predictor to the logit[P(Outcome)] 
# logit(p) = log(p/(1-p)) - also known as log-odds
sumLR$coefficients
class(sumLR$coefficients)
coefLR <- as.data.frame(sumLR$coefficients)
# Wald statistics significant? (this Wald z is normally distributed)
coefLR <- coefLR[order(-coefLR$Estimate), ]
w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)")))
# which predictors worked?
rownames(coefLR)[w]
# NOTE: Wald statistic (z) is dangerous:
# as the coefficient gets higher, its standard error inflates thus underestimating z
# Beware of z...
 
# plot coefficients {ggplot2}
library(ggplot2)
plotFrame <- coefLR[w,]
plotFrame$Estimate <- round(plotFrame$Estimate,2)
plotFrame$Features <- rownames(plotFrame)
plotFrame <- plotFrame[order(-plotFrame$Estimate), ]
plotFrame$Features <- factor(plotFrame$Features, levels = plotFrame$Features, ordered=T)
ggplot(data = plotFrame, aes(x = plotFrame$Features, y = plotFrame$Estimate)) +
  geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) +
  xlab("Features") + ylab("Regression Coefficients") +
  ggtitle("Logistic Regression: Coeficients (sig. Wald test)") +
  theme(axis.text.x = element_text(angle=90))

image
# fitted probabilities
fitted(bLRmodel)
hist(fitted(bLRmodel),50)
plot(density(fitted(bLRmodel)),
     main = "Predicted Probabilities: Density")
polygon(density(fitted(bLRmodel)), 
                col="red", 
                border="black")

image
# coefficients related to odds (and not log-odds): simply
exp(bLRmodel$coefficients)[w]
# check max
max(exp(bLRmodel$coefficients)[w]) # huge? why? - think!
 
#### Reminder: Maximum Likelihood Estimation
normData <- rnorm(10000, mean = 5.75, sd = 1.25)
normLogLike <- function(params,x) {
  mean <- params[1]
  sd <- abs(params[2]) # dnorm generates NaNs if sd < 0
  dens <- dnorm(x,mean,sd)
  w <- which(dens==0)
  dens[w] <- .Machine$double.xmin	
  return(-(sum(log(dens)))) # negative logLike, for minimization w. optim()
}
# test
normLogLike(normData,c(2,.77))
# ML estimation
# random initial values
startMean <- runif(1,-100,100)
startSd <- runif(1,-100,100)
mlFit <- optim(c(startMean, startSd),
               fn = normLogLike,
               x=normData,
               control = list(maxit = 50000))
# ML estimates
mlFit$par
# cmp. true paramaters: mean = 5.75, sd = 1.25
 
# model Chi-Square
CSq <- bLRmodel$null.deviance - bLRmodel$deviance # this difference ~ Chi-Square Distribution
CSq
dfCSq <- bLRmodel$df.null - bLRmodel$df.residual # null - residual (model) degrees of freedom
dfCSq
# Chi-Square significance test in R:
pCSq = 1-pchisq(CSq, dfCSq) # 1 - c.d.f. = P(Chi-Square larger than this obtained by chance)
pCSq
 
# AIC = Akaike information criterion  (-2*LogLikelihood+2*k, k = num.parameters)
bLRmodel$aic

Session 8 Appendix

Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)