[This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post analyzes sentiment of people towards the 2016 presidential candidates post the First Presidential Debate using Tweets. We focus on Tweets containing the names of Hillary Clinton and Donald Trump.

library(twitteR)
library(ROAuth)
require(RCurl)
library(stringr)
library(tm)
library(plyr)
library(tm)
library(wordcloud)

consumer_key = "consumer_key"
consumer_secret = "consumer_secret"
token_secret = "token_secret"
access_token = "access_token"
authenticate <- OAuthFactory$new(consumerKey = consumer_key, consumerSecret = consumer_secret, requestURL="https://api.twitter.com/oauth/request_token", accessURL="https://api.twitter.com/oauth/access_token", authURL="https://api.twitter.com/oauth/authorize") setup_twitter_oauth(consumer_key, consumer_secret, access_token, token_secret) Web Scraping Let’s search what has been tweeted about the two candidates from the democratic and republican parties. We will do a sentimental analysis of the tweets. > Hillary <- searchTwitter("hillary + clinton", n=10000, lang='en', since=format(Sys.Date()-1)) > Donald <- searchTwitter("donald + trump", n=10000, lang='en', since=format(Sys.Date()-1)) > hillary_txt <- sapply(Hillary, function(x) x$getText())
> donald_txt <- sapply(Donald, function(x) x$getText()) > NumTweets <- c(length(hillary_txt), length(donald_txt)) > tweets <- c(hillary_txt, donald_txt) > head(tweets) #Converting it into Text hillary_txt <- sapply(Hillary, function(x) x$getText())
donald_txt <- sapply(Donald, function(x) x$getText()) #Getting the Number of tweets NumTweets <- c(length(hillary_txt), length(donald_txt)) #Combining the tweets tweets <- c(hillary_txt, donald_txt) head(tweets) ### Sentiment Analysis Now, we will apply the lexicon based sentiment analysis approach which was proposed by(Hu and Liu, KDD-2004). The positive and the negative words dictionary were created and read. It can be found here The R code for function score.sentiment can be found here. #apply function score.sentiment scores <- score.sentiment(tweets, pos, neg, .progress='text') |===================================|100% #add variables to a data frame scores$Candidate = factor(rep(c("Hillary", "Donald", "Bernie", "Ted"), NumTweets))
scores$very.pos = as.numeric(scores$score >= 2)
scores$very.neg = as.numeric(scores$score <= -2)
#how many very positives and very negatives
numpos <- sum(scores$very.pos) numneg <- sum(scores$very.neg)
#Calculating the global score
global_score = paste0(round(100 * numpos / (numpos + numneg)),"%")
global_score
[1] "75%"

Now, let’s compare which candidate has been able to generate more positive social sentiment post the First Presidential Debate 2016.

boxplot(score~Candidate, data=scores, col='blue')

It can be seen that Trump has more positive sentiment, than compared to Clinton post the first debate.

Unlike his recent controversial remarks, presidential candidate Donald Trump has more positive comments while the sentiment towards Hillary Clinton is less positive. The following histogram also convey the same message.

#### Conclusion

Generally speaking, the tweets conveyed positive sentiment with total score of 75%.

Looking at the individual presidential candidates, Donald Trump has been able to generate more positive sentiment post the first debate.

### Predicting the Sentiment of Tweets

Now that we have the tweets, let’s predict their sentiments. The objective is to classify to tweets as Positive, Neutral, or Negative.

# Load the required packages
install.packages("SnowballC",repos='http://cran.us.r-project.org')
install.packages("rpart.plot", repos='http://cran.us.r-project.org')
install.packages("ROCR", repos='http://cran.us.r-project.org')
install.packages('randomForest', repos='http://cran.us.r-project.org')
library(SnowballC)
library(rpart.plot)
library(ROCR)
library(randomForest)
tweetCorpus <- Corpus(VectorSource(tweets))

#remove punctuation marks
tweetsCorpus <- tm_map(tweetCorpus, removePunctuation)
#remove stopwords
tweetsCorpus <- tm_map(tweetsCorpus, removeWords, c("bernie", "sanders", "trump", "donald", "cruz", "ted", "hillary", "clinton", stopwords("english")))

#remove white spaces
tweetsCorpus <- tm_map(tweetsCorpus, stripWhitespace)

#transform to text which wordcloud can use
tweet_dtm <- tm_map(tweetsCorpus, PlainTextDocument)
terms <- DocumentTermMatrix(tweet_dtm)
> terms

Sparsity indicates the number of common words in the tweets. Higher sparsity means the correlation among the tweets in low (there are many zeros in the text document matrix).

Because the number of terms indicates the number of columns in our document. Let’s see the most common words and remove the less frequent words.

length(findFreqTerms(terms, lowfreq=30)) # this find the words that appears at least 30 times
#let's remove the sparse terms
sparseTerms <- removeSparseTerms(terms, 0.995)
sparseTerms

Now, let’s convert the sparse matrix into a data frame.

dataframe <- as.data.frame(as.matrix(sparseTerms))

let's convert the column names into proper format since some of the words in the tweets may start with a number

colnames(dataframe) <- make.names(colnames(dataframe))
dataframe <- as.data.frame(as.matrix(sparseTerms))

Since some of the words in the tweets may start with a number, let's make sure the column names are in the proper format.

colnames(dataframe) <- make.names(colnames(dataframe))

Now, let's get the sentiment of each tweets from the scores.sentiment function.

dataframe$Negative <- as.factor(scores$score <=-1)
dataframe$score <- NULL dataframe$Score <-NULL

#### The Predictive Models

We will use CART and logistic regression to predict negative sentiment.Let us split the data into training and testing datasets.

set.seed(1000)
library(caTools)
split <- sample.split(dataframe$Negative, SplitRatio=0.7) trainData <- subset(dataframe, split==TRUE) testData <- subset(dataframe, split==FALSE) modelCART <- rpart(Negative ~., data=trainData, method="class") prp(modelCART) From the model, tweets containing the words: crazy, attack, much, giuliani, hate, blige, failed, women, plan, case conveys negative sentiments. Now, let’s make prediction on the test dataset. #make prediction predictCART <- predict(modelCART, newdata = testData, type="class") table(testData$Negative, predictCART)
#Accurary
Accuracy <- (491+50)/sum(table(testData$Negative, predictCART)) round(Accuracy,3) Let’s plot the ROC curve. An ROC curve demonstrates several things: 1. It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity). 2. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. 3. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. 4. The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. 5. The area under the curve is a measure of text accuracy. Prediction_ROC <- predict(modelCART, newdata = testData) pred <- prediction(Prediction_ROC[,2], testData$Negative)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)

The area under the curve can be calculated as -:

performance(pred, "auc")@y.values

Now, let us compare the CART model with a random forest classification model.

#Random forest model
modelForest <- randomForest(Negative ~ ., data = trainData, nodesize = 25, ntrees = 200)
predictForest <- predict(modelForest, newdata = testData)
table(testData$Negative, predictForest) Calculating Accuracy Accuracy <- (495+41)/sum(table(testData$Negative, predictForest))
> round(Accuracy,3)

#### Conclusion

Presidential candidate Donald Trump has been able to generate more positive sentiment than Secretary Clinton.The performances of the CART and random forest classification models are almost similar. Both models are reasonably good in predicting negative tweets.

The entire project and the analysis can be found on here .

Sentimental Analysis of the First Presidential Debate of 2016 Using Machine Learning was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.