Introduction to R for Data Science :: Session 8 [Appendix]

June 20, 2016
By

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

Appendix to Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression

Welcome to Introduction to R for Data Science, Session 8: Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression [Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().]

The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

Check out the Course Overview to acess the learning material presented thus far.

Data Science Serbia Course Pages [in Serbian]

Startit Course Pages [in Serbian]

image

Lecturers

Summary of Session 8, 17. June 2016 :: Intro to Text Mining in R + Binomial Logistic Regression.

Intro to Text Mining in R + Binomial Logistic Regression. Intro to Text Mining in R + Binomiral Logistic Regression: Web-scraping with tm.plugin.webmining. The tm package corpora structures: assessing document metadata and content. Typical corpus transformations and Term-Document Matrix production. A simple binomial regression model with tf-idf scores as features and its shortcommings due to sparse data. Reminder: Maximum Likelihood Estimation with Nelder-Mead from optim().

Session 8 R Script

Further Readings

R script :: Session 8

Split data into training and test

#### w. training vs. test data set
# split into test and training
dim(dataSet)
choice <- sample(1:475,250,replace = F)
test <- which(!(c(1:475) %in% choice))
trainData <- dataSet[choice,]
newData <- dataSet[test,]
# check!
sum(dataSet$Category[choice])/length(choice) # proportion of dotCom in training
sum(dataSet$Category[test])/length(choice) # proportion of dotCom in test
 
# Binomial Logistic Regression: use glm w. logit link
bLRmodel <- glm(Category ~.,
                family=binomial(link='logit'),
                control = list(maxit = 500),
                data=trainData)
 
sumLR <- summary(bLRmodel)
sumLR
 
# Coefficients
sumLR$coefficients
class(sumLR$coefficients)
coefLR <- as.data.frame(sumLR$coefficients)
# Wald statistics significant? (this Wald z is normally distributed)
coefLR <- coefLR[order(-coefLR$Estimate), ]
w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)")))
# which predictors worked?
rownames(coefLR)[w]
 
# plot coefficients {ggplot2}
plotFrame <- coefLR[w,]
plotFrame$Estimate <- round(plotFrame$Estimate,2)
plotFrame$Features <- rownames(plotFrame)
plotFrame <- plotFrame[order(-plotFrame$Estimate), ]
plotFrame$Features <- factor(plotFrame$Features, levels = plotFrame$Features, ordered=T)
ggplot(data = plotFrame, aes(x = plotFrame$Features, y = plotFrame$Estimate)) +
  geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) +
  xlab("Features") + ylab("Regression Coefficients") +
  ggtitle("Logistic Regression: Coeficients (sig. Wald test)") +
  theme(axis.text.x = element_text(angle=90))
 
# fitted probabilities
fitted(bLRmodel)
hist(fitted(bLRmodel),50)
plot(density(fitted(bLRmodel)),
     main = "Predicted Probabilities: Density")
polygon(density(fitted(bLRmodel)), 
        col="red", 
        border="black")
 
# Prediction from the model
predictions <- predict(bLRmodel,
                       newdata=newData,
                       type='response')
 
predictions <- ifelse(predictions >= 0.5,1,0)
trueCategory <- newData$Category
 
meanClasError <- mean(predictions != trueCategory)
accuracy <- 1-meanClasError
accuracy # probably rather poor..? - Why? - Think!
 
# Try to train a binomial regression model many times by randomly assigning 
# documents to the training and test data set
# What happens? Why?
 
# *Look* at your data set and *think* about it before actually modeling it.

image
image

Readings :: Session 9: Binomial and Multinomial Logistic Regression [23. June, 2016, @Startit.rs, 19h CET]

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)