This post analyzes sentiment of people towards the 2016 presidential candidates post the First Presidential Debate using Tweets. We focus on Tweets containing the names of Hillary Clinton and Donald Trump.
Let’s load the required packages
library(twitteR) library(ROAuth) require(RCurl) library(stringr) library(tm) library(plyr) library(tm) library(wordcloud)
Setting up Twitter Authentication
consumer_key = "consumer_key" consumer_secret = "consumer_secret" token_secret = "token_secret" access_token = "access_token" authenticate <- OAuthFactory$new(consumerKey = consumer_key, consumerSecret = consumer_secret, requestURL="https://api.twitter.com/oauth/request_token", accessURL="https://api.twitter.com/oauth/access_token", authURL="https://api.twitter.com/oauth/authorize") setup_twitter_oauth(consumer_key, consumer_secret, access_token, token_secret)
Let’s search what has been tweeted about the two candidates from the democratic and republican parties. We will do a sentimental analysis of the tweets.
> Hillary <- searchTwitter("hillary + clinton", n=10000, lang='en', since=format(Sys.Date()-1)) > Donald <- searchTwitter("donald + trump", n=10000, lang='en', since=format(Sys.Date()-1)) > hillary_txt <- sapply(Hillary, function(x) x$getText()) > donald_txt <- sapply(Donald, function(x) x$getText()) > NumTweets <- c(length(hillary_txt), length(donald_txt)) > tweets <- c(hillary_txt, donald_txt) > head(tweets) #Converting it into Text hillary_txt <- sapply(Hillary, function(x) x$getText()) donald_txt <- sapply(Donald, function(x) x$getText()) #Getting the Number of tweets NumTweets <- c(length(hillary_txt), length(donald_txt)) #Combining the tweets tweets <- c(hillary_txt, donald_txt) head(tweets)
Now, we will apply the lexicon based sentiment analysis approach which was proposed by(Hu and Liu, KDD-2004).
The positive and the negative words dictionary were created and read. It can be found here
The R code for function score.sentiment can be found here.
#apply function score.sentiment scores <- score.sentiment(tweets, pos, neg, .progress='text')
#add variables to a data frame scores$Candidate = factor(rep(c("Hillary", "Donald", "Bernie", "Ted"), NumTweets)) scores$very.pos = as.numeric(scores$score >= 2) scores$very.neg = as.numeric(scores$score <= -2) #how many very positives and very negatives numpos <- sum(scores$very.pos) numneg <- sum(scores$very.neg) #Calculating the global score global_score = paste0(round(100 * numpos / (numpos + numneg)),"%") global_score  "75%"
Now, let’s compare which candidate has been able to generate more positive social sentiment post the First Presidential Debate 2016.
boxplot(score~Candidate, data=scores, col='blue')
It can be seen that Trump has more positive sentiment, than compared to Clinton post the first debate.
Unlike his recent controversial remarks, presidential candidate Donald Trump has more positive comments while the sentiment towards Hillary Clinton is less positive. The following histogram also convey the same message.
Generally speaking, the tweets conveyed positive sentiment with total score of 75%.
Looking at the individual presidential candidates, Donald Trump has been able to generate more positive sentiment post the first debate.
Predicting the Sentiment of Tweets
Now that we have the tweets, let’s predict their sentiments. The objective is to classify to tweets as Positive, Neutral, or Negative.
# Load the required packages install.packages("SnowballC",repos='http://cran.us.r-project.org') install.packages("rpart.plot", repos='http://cran.us.r-project.org') install.packages("ROCR", repos='http://cran.us.r-project.org') install.packages('randomForest', repos='http://cran.us.r-project.org') library(SnowballC) library(rpart.plot) library(ROCR) library(randomForest) tweetCorpus <- Corpus(VectorSource(tweets)) #remove punctuation marks tweetsCorpus <- tm_map(tweetCorpus, removePunctuation) #remove stopwords tweetsCorpus <- tm_map(tweetsCorpus, removeWords, c("bernie", "sanders", "trump", "donald", "cruz", "ted", "hillary", "clinton", stopwords("english"))) #remove white spaces tweetsCorpus <- tm_map(tweetsCorpus, stripWhitespace) #transform to text which wordcloud can use tweet_dtm <- tm_map(tweetsCorpus, PlainTextDocument) terms <- DocumentTermMatrix(tweet_dtm) > terms
Sparsity indicates the number of common words in the tweets. Higher sparsity means the correlation among the tweets in low (there are many zeros in the text document matrix).
Because the number of terms indicates the number of columns in our document. Let’s see the most common words and remove the less frequent words.
length(findFreqTerms(terms, lowfreq=30)) # this find the words that appears at least 30 times #let's remove the sparse terms sparseTerms <- removeSparseTerms(terms, 0.995) sparseTerms
Now, let’s convert the sparse matrix into a data frame.
dataframe <- as.data.frame(as.matrix(sparseTerms))
let's convert the column names into proper format since some of the words in the tweets may start with a number
colnames(dataframe) <- make.names(colnames(dataframe)) dataframe <- as.data.frame(as.matrix(sparseTerms))
Since some of the words in the tweets may start with a number, let's make sure the column names are in the proper format.
colnames(dataframe) <- make.names(colnames(dataframe))
Now, let's get the sentiment of each tweets from the scores.sentiment function.
dataframe$Negative <- as.factor(scores$score <=-1) dataframe$score <- NULL dataframe$Score <-NULL
The Predictive Models
We will use CART and logistic regression to predict negative sentiment.Let us split the data into training and testing datasets.
set.seed(1000) library(caTools) split <- sample.split(dataframe$Negative, SplitRatio=0.7) trainData <- subset(dataframe, split==TRUE) testData <- subset(dataframe, split==FALSE) modelCART <- rpart(Negative ~., data=trainData, method="class") prp(modelCART)
From the model, tweets containing the words: crazy, attack, much, giuliani, hate, blige, failed, women, plan, case conveys negative sentiments.
Now, let’s make prediction on the test dataset.
#make prediction predictCART <- predict(modelCART, newdata = testData, type="class") table(testData$Negative, predictCART)
#Accurary Accuracy <- (491+50)/sum(table(testData$Negative, predictCART)) round(Accuracy,3)
Let’s plot the ROC curve.
An ROC curve demonstrates several things:
- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
- The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test.
- The area under the curve is a measure of text accuracy.
Prediction_ROC <- predict(modelCART, newdata = testData) pred <- prediction(Prediction_ROC[,2], testData$Negative) perf <- performance(pred, "tpr", "fpr") plot(perf, colorize = TRUE)
The area under the curve can be calculated as -:
Now, let us compare the CART model with a random forest classification model.
#Random forest model modelForest <- randomForest(Negative ~ ., data = trainData, nodesize = 25, ntrees = 200) predictForest <- predict(modelForest, newdata = testData) table(testData$Negative, predictForest)
Accuracy <- (495+41)/sum(table(testData$Negative, predictForest)) > round(Accuracy,3)
Presidential candidate Donald Trump has been able to generate more positive sentiment than Secretary Clinton.The performances of the CART and random forest classification models are almost similar. Both models are reasonably good in predicting negative tweets.
The entire project and the analysis can be found on here .
Sentimental Analysis of the First Presidential Debate of 2016 Using Machine Learning was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.