Sentiment Analysis using R

[This article was first published on Data Perspective, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

September 23, 2013

Today I will explain you how to create a basic Movie review engine based on the tweets by people using R.
The implementation of the Review Engine will be as follows:
  •          Gets Tweets from Twitter
  •          Clean the data
  •          Create a Word Cloud
  •          Create a data dictionary
  •          Score each tweet.
Gets Tweets from Twitter:
                First step is to fetch the data from Twitter. In R, we have facility to call the twitter API using package twitter. Below are the steps for fetch the tweets using twitter package. Each tweet data contains:
  • Text
  • Is re-tweeted
  • Re-tweet count
  • Tweeted User name
  • Latitude/Longitude 
  • Replied to, etc.
For our case we only consider Text feature of the Tweet as we are interested on the review of the movie. We can also use the other features such as Latitude/Longitude, replied to, etc. do other analysis on the tweeted data.

          tweets = searchTwitter(“#ChennaiExpress”, n=500, lang=”en”)

Clean the data:
In the next step, we need to clean the data so that we can use it for our analysis. Cleaning of data is a very important step in Data Analysis. This step includes:

Extracting only text from Tweets:
tweets_txt = sapply(tweets,function(x) x$getText())

Removing Url links, Reply to, punctuations, non-alphanumeric, symbols, spaces etc.
             tweets_cl = gsub(“(RT|via)((?:\b\W*@\w+)+)”,””,tweets)
             tweets_cl = gsub(“http[^[:blank:]]+”, “”, tweets_cl)
             tweets_cl = gsub(“@\w+”, “”, tweets_cl)
             tweets_cl = gsub(“[ t]{2,}”, “”, tweets_cl)
             tweets_cl = gsub(“^\s+|\s+$”, “”, tweets_cl)
             tweets_cl = gsub(“[[:punct:]]”, ” “, tweets_cl)
             tweets_cl = gsub(“[^[:alnum:]]”, ” “, tweets_cl)
             tweets_cl <- gsub('\d+', '', tweets_cl)
Create a Word Cloud:
At this point let us view Word-Cloud of frequently tweeted words in the data considered for visual understanding/analyzing the data.
Create a data dictionary:
In this step, we create use a Dictionary of words containing positive, negative words which are downloaded from here. These 2 types of words are used as keywords for classifying the each tweet into one of the 4 categories: Very Positive, Positive, Negative and Very Negative.
Score each tweet:
In this step, we will write a function which will calculate rating of the movie. The function is given below. After calculating the scores we plot graphs showing the rating as “WORST”,”BAD”,”GOOD”,”VERYGOOD”

Future steps in this project will be:
  • To create a UI preferably using .NET, as I’m a dot-net developer 😉
  • To Build a Movie Review Model which can classify a new tweet as and when provided?

#include required libraries

#get the tweets
tweets = searchTwitter(“#ChennaiExpress”, n=500, lang=”en”)
tweets_txt = sapply(tweets[1:50],function(x) x$getText())

#function to clean data
cleanTweets = function(tweets)
tweets_cl = gsub(“(RT|via)((?:\b\W*@\w+)+)”,””,tweets)
tweets_cl = gsub(“http[^[:blank:]]+”, “”, tweets_cl)
tweets_cl = gsub(“@\w+”, “”, tweets_cl)
tweets_cl = gsub(“[ t]{2,}”, “”, tweets_cl)
tweets_cl = gsub(“^\s+|\s+$”, “”, tweets_cl)
tweets_cl = gsub(“[[:punct:]]”, ” “, tweets_cl)
tweets_cl = gsub(“[^[:alnum:]]”, ” “, tweets_cl)
tweets_cl <- gsub('\d+', '', tweets_cl)

#function to calculate number of words in each category within a sentence
sentimentScore <- function(sentences, vNegTerms, negTerms, posTerms, vPosTerms){
  final_scores <- matrix('', 0, 5)
  scores <- laply(sentences, function(sentence, vNegTerms, negTerms, posTerms, vPosTerms){
    initial_sentence <- sentence
    #remove unnecessary characters and split up by word
        sentence = cleanTweets(sentence)
        sentence <- tolower(sentence)
        wordList <- str_split(sentence, '\s+')
    words <- unlist(wordList)
    #build vector with matches between sentence and each category
    vPosMatches <- match(words, vPosTerms)
    posMatches <- match(words, posTerms)
    vNegMatches <- match(words, vNegTerms)
    negMatches <- match(words, negTerms)
    #sum up number of words in each category
    vPosMatches <- sum(!
    posMatches <- sum(!
    vNegMatches <- sum(!
    negMatches <- sum(!
    score <- c(vNegMatches, negMatches, posMatches, vPosMatches)
    #add row to scores table
    newrow <- c(initial_sentence, score)
    final_scores <- rbind(final_scores, newrow)
  }, vNegTerms, negTerms, posTerms, vPosTerms)

#load pos,neg statements
afinn_list <- read.delim(file='~/AFINN-111.txt', header=FALSE, stringsAsFactors=FALSE)
names(afinn_list) <- c('word', 'score')
afinn_list$word <- tolower(afinn_list$word)

#categorize words as very negative to very positive and add some movie-specific words
vNegTerms <- afinn_list$word[afinn_list$score==-5 | afinn_list$score==-4]
negTerms <- c(afinn_list$word[afinn_list$score==-3 | afinn_list$score==-2 | afinn_list$score==-1], "second-rate", "moronic", "third-rate", "flawed", "juvenile", "boring", "distasteful", "ordinary", "disgusting", "senseless", "static", "brutal", "confused", "disappointing", "bloody", "silly", "tired", "predictable", "stupid", "uninteresting", "trite", "uneven", "outdated", "dreadful", "bland")
posTerms <- c(afinn_list$word[afinn_list$score==3 | afinn_list$score==2 | afinn_list$score==1], "first-rate", "insightful", "clever", "charming", "comical", "charismatic", "enjoyable", "absorbing", "sensitive", "intriguing", "powerful", "pleasant", "surprising", "thought-provoking", "imaginative", "unpretentious")
vPosTerms <- c(afinn_list$word[afinn_list$score==5 | afinn_list$score==4], "uproarious", "riveting", "fascinating", "dazzling", "legendary")   

#Calculate score on each tweet
tweetResult <-, vNegTerms, negTerms, posTerms, vPosTerms))
tweetResult$’2′ = as.numeric(tweetResult$’2′)
tweetResult$’3′ = as.numeric(tweetResult$’3′)
tweetResult$’4′ = as.numeric(tweetResult$’4′)
tweetResult$’5′ = as.numeric(tweetResult$’5′)
counts = c(sum(tweetResult$’2′),sum(tweetResult$’3′),sum(tweetResult$’4′),sum(tweetResult$’5′))
names = c(“Worst”,”BAD”,”GOOD”,”VERY GOOD”)
mr = list(counts,names)
colors = c(“red”, “yellow”, “green”, “violet”)
barplot(mr[[1]], main=”Movie Review”, xlab=”Number of votes”,legend=mr[[2]],col=colors)

To leave a comment for the author, please follow the link and comment on their blog: Data Perspective. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)