Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression]
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to Introduction to R for Data Science, Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression [Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().]
The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.
Check out the Course Overview to acess the learning material presented thus far.
Data Science Serbia Course Pages [in Serbian]
Startit Course Pages [in Serbian]
Lecturers
- dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Data Science Serbia
- Goran S. Milovanović, Phd, DataScientist@DiploFoundation, Data Science Mentor at Springboard, Data Science Serbia
Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.
Intro to Text Mining in R + Binomial Logistic Regression. Intro to Text Mining in R + Binomiral Logistic Regression: Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().
Intro to R for Data Science SlideShare :: Session 8
R script :: Session 8
######################################################## # Introduction to R for Data Science # SESSION 8 :: 16 June, 2016 # Binomiral Logistic Regression + Intro to Text Mining in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## # clear rm(list=ls()) # libraries library(tm) library(tm.plugin.webmining) #### NOTE #### Suggestion: Skip WebCorpus() calls from the tm.plugin.webmining #### (skip everything before: # START HERE load) #### Data set is available from GitHub :: https://github.com/GoranMilovanovic/IntroRDataScience #### File: Session8.RData #### Download link: https://goo.gl/cgJ3J3 #### Part I Information Retrieval: ICT market # 2 categories: dotcom vs hardware companies # dotCom category: # NASDAQ:GOOGL is Alphabet Inc, NASDAQ:AMZN is Amazon, NASDAQ:JD is JD.com, NASDAQ:FB is Facebook, # NYSE:BABA is Alibaba # hardware category: # NYSE:HPQ is HP, NASDAQ:AAPL is Apple, KRX:005930 is Samsung Electronics, TPE:2354 is Foxconn, # NYSE:IBM is IBM # source: Google Finance # search queries: .com vs. hardware companies searchQueries <- list(c('NASDAQ:GOOGL','NASDAQ:AMZN','NASDAQ:JD','NASDAQ:FB','NYSE:BABA'), c('NYSE:HPQ','NASDAQ:AAPL','KRX:005930','TPE:2354','NYSE:IBM')); # retrive w. tm.plugin.webmining # retrieve news for dotCom dotCom <- lapply(searchQueries[[1]], function(x) { WebCorpus(GoogleFinanceSource(x)) }); # now retrieve news for hardware companies hardware <- lapply(searchQueries[[2]], function(x) { WebCorpus(GoogleFinanceSource(x)) }); # source: Google News searchQueriesNews <- list(c("Google", "Amazon", "JD.com", "Facebook", "Alibaba"), c("Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM")) # retrieve news for dotCom companies dotComNews <- lapply(searchQueriesNews[[1]], function(x) { googleNewsSRC <- GoogleNewsSource(x, params = list(hl = "en", q = x, ie = "utf-8", num = 30, output = "rss")) WebCorpus(googleNewsSRC) }); # retrieve news for hardware companies hardwareNews <- lapply(searchQueriesNews[[2]], function(x) { googleNewsSRC <- GoogleNewsSource(x, params = list(hl = "en", q = x, ie = "utf-8", num = 30, output = "rss")) WebCorpus(googleNewsSRC) }); # save workspace; NOTE: tm has functions to save corpora sessionDir <- "YOUR WORKING DIRECTORY HERE" setwd(sessionDir) save.image(file="Session8.RData") # START HERE load sessionDir <- "YOUR WORKING DIRECTORY HERE" setwd(sessionDir) load(file="Session8.RData") # tm corpora are lists # a tm corpus (WebCorpus, more specifically) dotCom[[1]] # a PlaintTextDocument in the first dotCom corpus class(dotCom[[1]][[1]]) dotCom[[1]][[1]] # document metadata dotCom[[1]][[1]]$meta # document content dotCom[[1]][[1]]$content # let's add another tag the document metadata structure: dotCom dotCom <- lapply(dotCom, function(x) { x <- tm_map(x, function(doc) { # tm_map works over tm corpora, similarly to lapply meta(doc, "category") <- "dotCom" return(doc) }) }) dotCom[[1]][[1]]$meta # add a category tag to the document metadata structure: dotComNews dotComNews <- lapply(dotComNews, function(x) { x <- tm_map(x, function(doc) { meta(doc, "category") <- "dotCom" return(doc) }) }) dotComNews[[1]][[1]]$meta # add a category tag to the document metadata structure: hardware hardware <- lapply(hardware, function(x) { x <- tm_map(x, function(doc) { meta(doc, "category") <- "hardware" return(doc) }) }) hardware[[1]][[1]]$meta # add a category tag to the document metadata structure: hardwareNews hardwareNews <- lapply(hardwareNews, function(x) { x <- tm_map(x, function(doc) { meta(doc, "category") <- "hardware" return(doc) }) }) hardwareNews[[1]][[1]]$meta # combine corpora w. do.call() and c() dotCom <- do.call(c, dotCom) # do.call comes handy; similar to lapply # Learn more about do.call: http://www.stat.berkeley.edu/~s133/Docall.html hardware <- do.call(c, hardware) dotComNews <- do.call(c, dotComNews) hardwareNews <- do.call(c, hardwareNews) workCorpus <- c(dotCom, dotComNews, hardware, hardwareNews) # are there any duplicates? urls <- as.character(meta(workCorpus,tag="origin")) uurls <- unique(urls) w <- unlist(lapply(uurls, function(x) {which(urls %in% x)[1]})) workCorpus <- workCorpus[w] # empty docs? docLengths <- as.numeric(lapply(workCorpus, function(x) {nchar(x$content)})) w <- which(docLengths<10) workCorpus <- workCorpus[-w] #### Part II Text Preprocessing ### text pre-processing workCorpus[[1]]$content # we need to clean-up the docs # reminder: https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html # Regex in R removeSpecial <- function(x) { # replacing w. space might turn out to be handy x$content <- gsub("\\t|\\r|\\n"," ",x$content) return(x) } # example: cleanDoc <- removeSpecial(workCorpus[[1]]) cleanDoc$content # removeSpecial with tm_map {tm} workCorpus <- tm_map(workCorpus, removeSpecial) workCorpus[[1]]$content #### {tm} by the book text pre-processing # remove punctuation workCorpus <- tm_map(workCorpus, removePunctuation) # built in workCorpus[[1]]$content # remove numbers workCorpus <- tm_map(workCorpus, removeNumbers) # built in workCorpus[[1]]$content # all characters tolower toLower <- function(x) { x$content <- tolower(x$content) return(x) } workCorpus <- tm_map(workCorpus, toLower) workCorpus[[1]]$content # remove stop words workCorpus <- tm_map(workCorpus, removeWords, stopwords("english")) # built in workCorpus[[1]]$content # stemming # see: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html library(SnowballC) # nice stemming algorithm (Porter, 1980) # see: https://cran.r-project.org/web/packages/SnowballC/index.html workCorpus <- tm_map(workCorpus, stemDocument) workCorpus[[1]]$content # remove potentially dangerous predictors: company names # Why? Why? (In a nutshell: we are trying to discover something here, # not to 'confirm' our potentially redundant knowledge) predWords <- tolower(c("Alphabet", "Google", "Amazon", "JD.com", "Facebook", "Alibaba", "HP", "Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM")) predWordsRegex <- paste(stemDocument(predWords),collapse="|"); replacePreds <- function(x) { x$content <- gsub(predWordsRegex, " ", x$content) return(x) } workCorpus <- tm_map(workCorpus, replacePreds) workCorpus[[1]]$content # strip whitespace workCorpus <- tm_map(workCorpus, stripWhitespace) workCorpus[[1]]$content #### Part III Feature Selection: Term-Document Frequency Matrix (TDM) # TDM # rows = terms; columns = docs # we will use the td-idf weighting # see: https://en.wikipedia.org/wiki/Tf%E2%80%93idf dT <- TermDocumentMatrix(workCorpus, control = list(tolower = FALSE, wordLengths <- c(3, Inf), weighting = weightTfIdf) ) # dT is a *sparse matrix*; maybe you want to learn more about the R {slam} package # https://cran.r-project.org/web/packages/slam/index.html class(dT) dT$i # rows w. non-zero entries dT$j # columns w. non-zero entries dT$v # non-zero entry in [i,j] dT$nrow # "true" row dimension dT$ncol # "true" column dimension dT$dimnames$Terms # self-explanatory dT$dimnames$Docs # self-explanatory # remove sparse terms docTerm <- removeSparseTerms(dT, sparse = .75) # sparse is in (0,1] dim(docTerm) docTerm$dimnames$Terms # see: [1] http://www.inside-r.org/packages/cran/tm/docs/removeSparseTerms # A term-document matrix where those terms from x are removed which have # at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) # elements. [1] # NOTE: **very important**; in a real-world application # you would probably need to run many models with features obtained from various levels of sparcity # and perform model selection. # as.matrix docTerm <- as.matrix(docTerm) colnames(docTerm) <- paste0("d",seq(1:dim(docTerm)[2])) # dimensions dim(docTerm) # inspect candidate features tfIdf <- rowSums(docTerm) tfIdf <- sort(tfIdf, decreasing = T) head(tfIdf, 10) tail(tfIdf, 10) # list candidate features length(tfIdf) names(tfIdf) # pick some words w w <- sample(1:length(tfIdf),4,replace = F) par(mfcol = c(2,2)) for (i in 1:length(w)) { hist(docTerm[w[i],], main = paste0("Distribution of '",names(tfIdf)[w[i]],"'"), cex.main = .85, xlab = "Tf-Idf", ylab = "Count" ) } # quite interesting, isn't it? - How do you perform regression with these?
#### Part IV Binomial Logistic Regression # How well can the selected words predict the document category? # How to relate continuous, non-normally distributed predictors, to a categorical outcome? # Idea: Logistic function par(mfcol=c(1,1)) logistic <- function(t) {exp(t)/(1+exp(t))} curve(logistic, from = -10, to = 10, n = 1000, main="Logistic Function", cex.main = .85, xlab = "t", ylab = "Logistic(t)", cex.lab = .85) # and then let t be b0 + b1*x1 + b2*x2 + ... + bn*xn
# prepare data set dataSet <- data.frame(t(docTerm)) dataSet$Category <- as.character(meta(workCorpus, tag="category")) table(dataSet$Category) # does every word play at least some role in each category? w1 <- which(colSums(dataSet[which(dataSet$Category == "dotCom"), 1:(dim(dataSet)[2]-1)])==0) colnames(dataSet)[w1] w2 <- which(colSums(dataSet[which(dataSet$Category == "hardware"), 1:(dim(dataSet)[2]-1)])==0) colnames(dataSet)[w2] # recode category library(plyr) dataSet$Category <- as.numeric(revalue(dataSet$Category, c("dotCom"=1, "hardware"=0))) #### Logistic Regression # Binomial Logistic Regression: use glm w. logit link (link='logit' is default) bLRmodel <- glm(Category ~., family=binomial(link='logit'), control = list(maxit = 500), data=dataSet) sumLR <- summary(bLRmodel) sumLR # Coefficients # NOTE: coefficients here relate a unit change in the predictor to the logit[P(Outcome)] # logit(p) = log(p/(1-p)) - also known as log-odds sumLR$coefficients class(sumLR$coefficients) coefLR <- as.data.frame(sumLR$coefficients) # Wald statistics significant? (this Wald z is normally distributed) coefLR <- coefLR[order(-coefLR$Estimate), ] w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)"))) # which predictors worked? rownames(coefLR)[w] # NOTE: Wald statistic (z) is dangerous: # as the coefficient gets higher, its standard error inflates thus underestimating z # Beware of z... # plot coefficients {ggplot2} library(ggplot2) plotFrame <- coefLR[w,] plotFrame$Estimate <- round(plotFrame$Estimate,2) plotFrame$Features <- rownames(plotFrame) plotFrame <- plotFrame[order(-plotFrame$Estimate), ] plotFrame$Features <- factor(plotFrame$Features, levels = plotFrame$Features, ordered=T) ggplot(data = plotFrame, aes(x = plotFrame$Features, y = plotFrame$Estimate)) + geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) + xlab("Features") + ylab("Regression Coefficients") + ggtitle("Logistic Regression: Coeficients (sig. Wald test)") + theme(axis.text.x = element_text(angle=90))
# fitted probabilities fitted(bLRmodel) hist(fitted(bLRmodel),50) plot(density(fitted(bLRmodel)), main = "Predicted Probabilities: Density") polygon(density(fitted(bLRmodel)), col="red", border="black")
# coefficients related to odds (and not log-odds): simply exp(bLRmodel$coefficients)[w] # check max max(exp(bLRmodel$coefficients)[w]) # huge? why? - think! #### Reminder: Maximum Likelihood Estimation normData <- rnorm(10000, mean = 5.75, sd = 1.25) normLogLike <- function(params,x) { mean <- params[1] sd <- abs(params[2]) # dnorm generates NaNs if sd < 0 dens <- dnorm(x,mean,sd) w <- which(dens==0) dens[w] <- .Machine$double.xmin return(-(sum(log(dens)))) # negative logLike, for minimization w. optim() } # test normLogLike(normData,c(2,.77)) # ML estimation # random initial values startMean <- runif(1,-100,100) startSd <- runif(1,-100,100) mlFit <- optim(c(startMean, startSd), fn = normLogLike, x=normData, control = list(maxit = 50000)) # ML estimates mlFit$par # cmp. true paramaters: mean = 5.75, sd = 1.25 # model Chi-Square CSq <- bLRmodel$null.deviance - bLRmodel$deviance # this difference ~ Chi-Square Distribution CSq dfCSq <- bLRmodel$df.null - bLRmodel$df.residual # null - residual (model) degrees of freedom dfCSq # Chi-Square significance test in R: pCSq = 1-pchisq(CSq, dfCSq) # 1 - c.d.f. = P(Chi-Square larger than this obtained by chance) pCSq # AIC = Akaike information criterion (-2*LogLikelihood+2*k, k = num.parameters) bLRmodel$aic
Session 8 Appendix
Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]
- Yves Croissant, Estimation of multinomial logit models in R: The mlogit Packages
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.