# Introduction to R for Data Science :: Session 8 [Appendix]

June 20, 2016
By

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

## Appendix to Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression

Welcome to Introduction to R for Data Science, Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression [Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().]

The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

Check out the Course Overview to acess the learning material presented thus far.

Data Science Serbia Course Pages [in Serbian]

Startit Course Pages [in Serbian]

## Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.

Intro to Text Mining in R + Binomial Logistic Regression. Intro to Text Mining in R + Binomiral Logistic Regression: Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().

Session 8 R Script

## R script :: Session 8

Split data into training and test

```#### w. training vs. test data set
# split into test and training
dim(dataSet)
choice <- sample(1:475,250,replace = F)
test <- which(!(c(1:475) %in% choice))
trainData <- dataSet[choice,]
newData <- dataSet[test,]
# check!
sum(dataSet\$Category[choice])/length(choice) # proportion of dotCom in training
sum(dataSet\$Category[test])/length(choice) # proportion of dotCom in test

# Binomial Logistic Regression: use glm w. logit link
bLRmodel <- glm(Category ~.,
control = list(maxit = 500),
data=trainData)

sumLR <- summary(bLRmodel)
sumLR

# Coefficients
sumLR\$coefficients
class(sumLR\$coefficients)
coefLR <- as.data.frame(sumLR\$coefficients)
# Wald statistics significant? (this Wald z is normally distributed)
coefLR <- coefLR[order(-coefLR\$Estimate), ]
w <- which((coefLR\$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)")))
# which predictors worked?
rownames(coefLR)[w]

# plot coefficients {ggplot2}
plotFrame <- coefLR[w,]
plotFrame\$Estimate <- round(plotFrame\$Estimate,2)
plotFrame\$Features <- rownames(plotFrame)
plotFrame <- plotFrame[order(-plotFrame\$Estimate), ]
plotFrame\$Features <- factor(plotFrame\$Features, levels = plotFrame\$Features, ordered=T)
ggplot(data = plotFrame, aes(x = plotFrame\$Features, y = plotFrame\$Estimate)) +
geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) +
xlab("Features") + ylab("Regression Coefficients") +
ggtitle("Logistic Regression: Coeficients (sig. Wald test)") +
theme(axis.text.x = element_text(angle=90))

# fitted probabilities
fitted(bLRmodel)
hist(fitted(bLRmodel),50)
plot(density(fitted(bLRmodel)),
main = "Predicted Probabilities: Density")
polygon(density(fitted(bLRmodel)),
col="red",
border="black")

# Prediction from the model
predictions <- predict(bLRmodel,
newdata=newData,
type='response')

predictions <- ifelse(predictions >= 0.5,1,0)
trueCategory <- newData\$Category

meanClasError <- mean(predictions != trueCategory)
accuracy <- 1-meanClasError
accuracy # probably rather poor..? - Why? - Think!

# Try to train a binomial regression model many times by randomly assigning
# documents to the training and test data set
# What happens? Why?

# *Look* at your data set and *think* about it before actually modeling it.```

## Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...