Twitter sentiment analysis with R

[This article was first published on Analyze Core » R language, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently I designed a relatively simple code in R to analyze the content of Twitter posts by using the categories identified as positive, negative and neutral. The idea of processing tweets is based on a presentation http://www.slideshare.net/ajayohri/twitter-analysis-by-kaify-rais. The algorithm evaluates tweets based on the number of positive and negative words in the tweet. The words in the tweet correspond with the words in dictionaries that you can find on the internet, but you can create a list yourself. It is also possible to edit this list or dictionary. Great work, but I discovered some issue.

There are some limitations in the API of Twitter. It depends on the total number of tweets you access via API, but usually you can get tweets for the last 7-8 days (not longer, and it can be 1-2 days only). The 7 to 8 day time limit to access a tweet creates a limitation to understanding what activities or events influenced a tweets or analyzing historical trends.

I created a cumulative file to bypass this limit and accumulate historical data. If you access tweets regularly, then you can analyze the dynamics of the interactions via chart like this one:

plot

Furthermore, this algorithm is made as a function, and all you need to do is enter the keyword that you need. The process can be repeated several times a day and data of each keyword will be saved in separate file. It is useful for analyzing several keywords simultaneously (e.g. several brands names or the names of competitors).

Let’s get started. We need to create Twitter Application (https://apps.twitter.com/) for connecting to Twitter’s API. Then we get Consumer Key and Consumer Secret.

#connect all libraries
 library(twitteR)
 library(ROAuth)
 library(plyr)
 library(dplyr)
 library(stringr)
 library(ggplot2)
#connect to API
 download.file(url='http://curl.haxx.se/ca/cacert.pem', destfile='cacert.pem')
 reqURL <- 'https://api.twitter.com/oauth/request_token'
 accessURL <- 'https://api.twitter.com/oauth/access_token'
 authURL <- 'https://api.twitter.com/oauth/authorize'
 consumerKey <- '____________' #put the Consumer Key from Twitter Application
 consumerSecret <- '______________'  #put the Consumer Secret from Twitter Application
 Cred <- OAuthFactory$new(consumerKey=consumerKey,
                                                       consumerSecret=consumerSecret,
                                                       requestURL=reqURL,
                                                       accessURL=accessURL,
                                                       authURL=authURL)
 Cred$handshake(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl')) #There is URL in Console. You need to go to it, get code and enter it on Console
save(Cred, file='twitter authentication.Rdata')
 load('twitter authentication.Rdata') #Once you launch the code first time, you can start from this line in the future (libraries should be connected)
 registerTwitterOAuth(Cred)
#the function of tweets accessing and analyzing
 search <- function(searchterm)
 {
 #access tweets and create cumulative file
list <- searchTwitter(searchterm, cainfo='cacert.pem', n=1500)
 df <- twListToDF(list)
 df <- df[, order(names(df))]
 df$created <- strftime(df$created, '%Y-%m-%d')
 if (file.exists(paste(searchterm, '_stack.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack.csv'), row.names=F)
#merge last access with cumulative file and remove duplicates
 stack <- read.csv(file=paste(searchterm, '_stack.csv'))
 stack <- rbind(stack, df)
 stack <- subset(stack, !duplicated(stack$text))
 write.csv(stack, file=paste(searchterm, '_stack.csv'), row.names=F)
#evaluation tweets function
 score.sentiment <- function(sentences, pos.words, neg.words, .progress='none')
 {
 require(plyr)
 require(stringr)
 scores <- laply(sentences, function(sentence, pos.words, neg.words){
 sentence <- gsub('[[:punct:]]', "", sentence)
 sentence <- gsub('[[:cntrl:]]', "", sentence)
 sentence <- gsub('\d+', "", sentence)
 sentence <- tolower(sentence)
 word.list <- str_split(sentence, '\s+')
 words <- unlist(word.list)
 pos.matches <- match(words, pos.words)
 neg.matches <- match(words, neg.words)
 pos.matches <- !is.na(pos.matches)
 neg.matches <- !is.na(neg.matches)
 score <- sum(pos.matches) - sum(neg.matches)
 return(score)
 }, pos.words, neg.words, .progress=.progress)
 scores.df <- data.frame(score=scores, text=sentences)
 return(scores.df)
 }
pos <- scan('C:/___________/positive-words.txt', what='character', comment.char=';') #folder with positive dictionary
 neg <- scan('C:/___________/negative-words.txt', what='character', comment.char=';') #folder with negative dictionary
 pos.words <- c(pos, 'upgrade')
 neg.words <- c(neg, 'wtf', 'wait', 'waiting', 'epicfail')
Dataset <- stack
 Dataset$text <- as.factor(Dataset$text)
 scores <- score.sentiment(Dataset$text, pos.words, neg.words, .progress='text')
 write.csv(scores, file=paste(searchterm, '_scores.csv'), row.names=TRUE) #save evaluation results into the file
#total evaluation: positive / negative / neutral
 stat <- scores
 stat$created <- stack$created
 stat$created <- as.Date(stat$created)
 stat <- mutate(stat, tweet=ifelse(stat$score > 0, 'positive', ifelse(stat$score < 0, 'negative', 'neutral')))
 by.tweet <- group_by(stat, tweet, created)
 by.tweet <- summarise(by.tweet, number=n())
 write.csv(by.tweet, file=paste(searchterm, '_opin.csv'), row.names=TRUE)
#create chart
 ggplot(by.tweet, aes(created, number)) + geom_line(aes(group=tweet, color=tweet), size=2) +
 geom_point(aes(group=tweet, color=tweet), size=4) +
 theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1)) +
 #stat_summary(fun.y = 'sum', fun.ymin='sum', fun.ymax='sum', colour = 'yellow', size=2, geom = 'line') +
 ggtitle(searchterm)
ggsave(file=paste(searchterm, '_plot.jpeg'))
}
search("______") #enter keyword

 

Finally we get 4 files:

  • cumulative file of all initial data,
  • file with tweets rating (number of points by the number of positive or negative words),
  • file with number of tweets of each type (positive / negative / neutral) as at date,
  • and chart looks like:

 

plot

To leave a comment for the author, please follow the link and comment on their blog: Analyze Core » R language.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)