Sentimental Analysis of the First Presidential Debate of 2016 Using Machine Learning

[This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post analyzes sentiment of people towards the 2016 presidential candidates post the First Presidential Debate using Tweets. We focus on Tweets containing the names of Hillary Clinton and Donald Trump.

Let’s load the required packages

library(twitteR)
library(ROAuth)
require(RCurl)
library(stringr)
library(tm)
library(plyr)
library(tm)
library(wordcloud)

Setting up Twitter Authentication

consumer_key = "consumer_key"
consumer_secret = "consumer_secret"
token_secret = "token_secret"
access_token = "access_token"
authenticate <- OAuthFactory$new(consumerKey = consumer_key,
consumerSecret = consumer_secret,
requestURL="https://api.twitter.com/oauth/request_token",
accessURL="https://api.twitter.com/oauth/access_token",
authURL="https://api.twitter.com/oauth/authorize")
setup_twitter_oauth(consumer_key, consumer_secret, access_token, token_secret)

Web Scraping

Let’s search what has been tweeted about the two candidates from the democratic and republican parties. We will do a sentimental analysis of the tweets.

> Hillary <- searchTwitter("hillary + clinton", n=10000, lang='en', since=format(Sys.Date()-1))
> Donald <- searchTwitter("donald + trump", n=10000, lang='en', since=format(Sys.Date()-1))
> hillary_txt <- sapply(Hillary, function(x) x$getText())
> donald_txt <- sapply(Donald, function(x) x$getText())
> NumTweets <- c(length(hillary_txt), length(donald_txt))
> tweets <- c(hillary_txt, donald_txt)
> head(tweets)
#Converting it into Text
hillary_txt <- sapply(Hillary, function(x) x$getText())
donald_txt <- sapply(Donald, function(x) x$getText())
#Getting the Number of tweets
NumTweets <- c(length(hillary_txt), length(donald_txt))
#Combining the tweets
tweets <- c(hillary_txt, donald_txt)
head(tweets)

Sentiment Analysis

Now, we will apply the lexicon based sentiment analysis approach which was proposed by(Hu and Liu, KDD-2004).

The positive and the negative words dictionary were created and read. It can be found here

The R code for function score.sentiment can be found here.

#apply function score.sentiment
scores <- score.sentiment(tweets, pos, neg, .progress='text')

|===================================|100%

#add variables to a data frame
scores$Candidate = factor(rep(c("Hillary", "Donald", "Bernie", "Ted"), NumTweets))
scores$very.pos = as.numeric(scores$score >= 2)
scores$very.neg = as.numeric(scores$score <= -2)
#how many very positives and very negatives
numpos <- sum(scores$very.pos)
numneg <- sum(scores$very.neg)
#Calculating the global score
global_score = paste0(round(100 * numpos / (numpos + numneg)),"%")
global_score
[1] "75%"
The global approval score for both the candidates

Now, let’s compare which candidate has been able to generate more positive social sentiment post the First Presidential Debate 2016.

boxplot(score~Candidate, data=scores, col='blue')
Box Plot Displaying the Sentimental Scores of the Candidates

It can be seen that Trump has more positive sentiment, than compared to Clinton post the first debate.

Unlike his recent controversial remarks, presidential candidate Donald Trump has more positive comments while the sentiment towards Hillary Clinton is less positive. The following histogram also convey the same message.

Conclusion

Generally speaking, the tweets conveyed positive sentiment with total score of 75%.

Looking at the individual presidential candidates, Donald Trump has been able to generate more positive sentiment post the first debate.

Predicting the Sentiment of Tweets

Now that we have the tweets, let’s predict their sentiments. The objective is to classify to tweets as Positive, Neutral, or Negative.

# Load the required packages
install.packages("SnowballC",repos='http://cran.us.r-project.org')
install.packages("rpart.plot", repos='http://cran.us.r-project.org')
install.packages("ROCR", repos='http://cran.us.r-project.org')
install.packages('randomForest', repos='http://cran.us.r-project.org')
library(SnowballC)
library(rpart.plot)
library(ROCR)
library(randomForest)
tweetCorpus <- Corpus(VectorSource(tweets))

#remove punctuation marks
tweetsCorpus <- tm_map(tweetCorpus, removePunctuation)
#remove stopwords
tweetsCorpus <- tm_map(tweetsCorpus, removeWords, c("bernie", "sanders", "trump", "donald", "cruz", "ted", "hillary", "clinton", stopwords("english")))
    
#remove white spaces
tweetsCorpus <- tm_map(tweetsCorpus, stripWhitespace)

#transform to text which wordcloud can use
tweet_dtm <- tm_map(tweetsCorpus, PlainTextDocument)
terms <- DocumentTermMatrix(tweet_dtm)
> terms
Higher Sparsity indicated

Sparsity indicates the number of common words in the tweets. Higher sparsity means the correlation among the tweets in low (there are many zeros in the text document matrix).

Because the number of terms indicates the number of columns in our document. Let’s see the most common words and remove the less frequent words.

length(findFreqTerms(terms, lowfreq=30)) # this find the words that appears at least 30 times
#let's remove the sparse terms
sparseTerms <- removeSparseTerms(terms, 0.995)
sparseTerms
This has reduced the number of terms from 200 to just 337 terms.

Now, let’s convert the sparse matrix into a data frame.

dataframe <- as.data.frame(as.matrix(sparseTerms))

let's convert the column names into proper format since some of the words in the tweets may start with a number

colnames(dataframe) <- make.names(colnames(dataframe))
dataframe <- as.data.frame(as.matrix(sparseTerms))

Since some of the words in the tweets may start with a number, let's make sure the column names are in the proper format.

colnames(dataframe) <- make.names(colnames(dataframe))

Now, let's get the sentiment of each tweets from the scores.sentiment function.

dataframe$Negative <- as.factor(scores$score <=-1)
dataframe$score <- NULL
dataframe$Score <-NULL

The Predictive Models

We will use CART and logistic regression to predict negative sentiment.Let us split the data into training and testing datasets.

set.seed(1000)
library(caTools)
split <- sample.split(dataframe$Negative, SplitRatio=0.7)
trainData <- subset(dataframe, split==TRUE)
testData <- subset(dataframe, split==FALSE)
modelCART <- rpart(Negative ~., data=trainData, method="class")
prp(modelCART)
The cart model

From the model, tweets containing the words: crazy, attack, much, giuliani, hate, blige, failed, women, plan, case conveys negative sentiments.

Now, let’s make prediction on the test dataset.

#make prediction
predictCART <- predict(modelCART, newdata = testData, type="class")
table(testData$Negative, predictCART)
#Accurary
Accuracy <- (491+50)/sum(table(testData$Negative, predictCART))
round(Accuracy,3)
Accuracy

Let’s plot the ROC curve.

An ROC curve demonstrates several things:

  1. It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
  2. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
  3. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
  4. The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test.
  5. The area under the curve is a measure of text accuracy.
Prediction_ROC <- predict(modelCART, newdata = testData)
pred <- prediction(Prediction_ROC[,2], testData$Negative)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)

The area under the curve can be calculated as -:

performance(pred, "auc")@y.values
Area under the curve

Now, let us compare the CART model with a random forest classification model.

#Random forest model
modelForest <- randomForest(Negative ~ ., data = trainData, nodesize = 25, ntrees = 200)
predictForest <- predict(modelForest, newdata = testData)
table(testData$Negative, predictForest)

Calculating Accuracy

Accuracy <- (495+41)/sum(table(testData$Negative, predictForest))
> round(Accuracy,3)
Accuracy

Conclusion

Presidential candidate Donald Trump has been able to generate more positive sentiment than Secretary Clinton.The performances of the CART and random forest classification models are almost similar. Both models are reasonably good in predicting negative tweets.

The entire project and the analysis can be found on here .


Sentimental Analysis of the First Presidential Debate of 2016 Using Machine Learning was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)