** Appendix to Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression**

Welcome to Introduction to R for Data Science, Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression [*Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim(). ]*

The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

Check out the Course Overview to acess the learning material presented thus far.

Data Science Serbia Course Pages [in Serbian]

Startit Course Pages [in Serbian]

**Lecturers**

- dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Data Science Serbia
- Goran S. Milovanović, Phd, [email protected], Data Science Mentor at Springboard, Data Science Serbia

**Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.**

**R script :: Session 8**

**Split data into training and test**

#### w. training vs. test data set # split into test and training dim(dataSet) choice <- sample(1:475,250,replace = F) test <- which(!(c(1:475) %in% choice)) trainData <- dataSet[choice,] newData <- dataSet[test,] # check! sum(dataSet$Category[choice])/length(choice) # proportion of dotCom in training sum(dataSet$Category[test])/length(choice) # proportion of dotCom in test # Binomial Logistic Regression: use glm w. logit link bLRmodel <- glm(Category ~., family=binomial(link='logit'), control = list(maxit = 500), data=trainData) sumLR <- summary(bLRmodel) sumLR # Coefficients sumLR$coefficients class(sumLR$coefficients) coefLR <- as.data.frame(sumLR$coefficients) # Wald statistics significant? (this Wald z is normally distributed) coefLR <- coefLR[order(-coefLR$Estimate), ] w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)"))) # which predictors worked? rownames(coefLR)[w] # plot coefficients {ggplot2} plotFrame <- coefLR[w,] plotFrame$Estimate <- round(plotFrame$Estimate,2) plotFrame$Features <- rownames(plotFrame) plotFrame <- plotFrame[order(-plotFrame$Estimate), ] plotFrame$Features <- factor(plotFrame$Features, levels = plotFrame$Features, ordered=T) ggplot(data = plotFrame, aes(x = plotFrame$Features, y = plotFrame$Estimate)) + geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) + xlab("Features") + ylab("Regression Coefficients") + ggtitle("Logistic Regression: Coeficients (sig. Wald test)") + theme(axis.text.x = element_text(angle=90)) # fitted probabilities fitted(bLRmodel) hist(fitted(bLRmodel),50) plot(density(fitted(bLRmodel)), main = "Predicted Probabilities: Density") polygon(density(fitted(bLRmodel)), col="red", border="black") # Prediction from the model predictions <- predict(bLRmodel, newdata=newData, type='response') predictions <- ifelse(predictions >= 0.5,1,0) trueCategory <- newData$Category meanClasError <- mean(predictions != trueCategory) accuracy <- 1-meanClasError accuracy # probably rather poor..? - Why? - Think! # Try to train a binomial regression model many times by randomly assigning # documents to the training and test data set # What happens? Why? # *Look* at your data set and *think* about it before actually modeling it.

**Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]**

- Yves Croissant, Estimation of multinomial logit models in R: The mlogit Packages

