# Sentiment Analysis using R

August 20, 2013
By

(This article was first published on Data Perspective, and kindly contributed to R-bloggers)

# September 23, 2013

Today I will explain you how to create a basic Movie review engine based on the tweets by people using R.
The implementation of the Review Engine will be as follows:
•          Gets Tweets from Twitter
•          Clean the data
•          Create a Word Cloud
•          Create a data dictionary
•          Score each tweet.

Gets Tweets from Twitter:

First step is to fetch the data from Twitter. In R, we have facility to call the twitter API using package twitter. Below are the steps for fetch the tweets using twitter package. Each tweet data contains:
• Text
• Is re-tweeted
• Re-tweet count
• Tweeted User name
• Latitude/Longitude
• Replied to, etc.

For our case we only consider Text feature of the Tweet as we are interested on the review of the movie. We can also use the other features such as Latitude/Longitude, replied to, etc. do other analysis on the tweeted data.

library(tm)
tweets = searchTwitter(“#ChennaiExpress”, n=500, lang=”en”)

Clean the data:
In the next step, we need to clean the data so that we can use it for our analysis. Cleaning of data is a very important step in Data Analysis. This step includes:

Extracting only text from Tweets:
tweets_txt = sapply(tweets,function(x) x\$getText())

Removing Url links, Reply to, punctuations, non-alphanumeric, symbols, spaces etc.
tweets_cl = gsub(“(RT|via)((?:\b\W*@\w+)+)”,””,tweets)
tweets_cl = gsub(“http[^[:blank:]]+”, “”, tweets_cl)
tweets_cl = gsub(“@\w+”, “”, tweets_cl)
tweets_cl = gsub(“[ t]{2,}”, “”, tweets_cl)
tweets_cl = gsub(“^\s+|\s+\$”, “”, tweets_cl)
tweets_cl = gsub(“[[:punct:]]”, ” “, tweets_cl)
tweets_cl = gsub(“[^[:alnum:]]”, ” “, tweets_cl)
tweets_cl <- gsub(‘\d+’, ”, tweets_cl)
Create a Word Cloud:
At this point let us view Word-Cloud of frequently tweeted words in the data considered for visual understanding/analyzing the data.
library(wordcloud)
wordcloud(tweets_cl)

Create a data dictionary:
In this step, we create use a Dictionary of words containing positive, negative words which are downloaded from here. These 2 types of words are used as keywords for classifying the each tweet into one of the 4 categories: Very Positive, Positive, Negative and Very Negative.
Score each tweet:
In this step, we will write a function which will calculate rating of the movie. The function is given below. After calculating the scores we plot graphs showing the rating as “WORST”,”BAD”,”GOOD”,”VERYGOOD”

Future steps in this project will be:
• To create a UI preferably using .NET, as I’m a dot-net developer 😉
• To Build a Movie Review Model which can classify a new tweet as and when provided?

Code:

#include required libraries
library(plyr)
library(stringr)
#get the tweets
tweets = searchTwitter(“#ChennaiExpress”, n=500, lang=”en”)
tweets_txt = sapply(tweets[1:50],function(x) x\$getText())
#function to clean data
cleanTweets = function(tweets)
{
tweets_cl = gsub(“(RT|via)((?:\b\W*@\w+)+)”,””,tweets)
tweets_cl = gsub(“http[^[:blank:]]+”, “”, tweets_cl)
tweets_cl = gsub(“@\w+”, “”, tweets_cl)
tweets_cl = gsub(“[ t]{2,}”, “”, tweets_cl)
tweets_cl = gsub(“^\s+|\s+\$”, “”, tweets_cl)
tweets_cl = gsub(“[[:punct:]]”, ” “, tweets_cl)
tweets_cl = gsub(“[^[:alnum:]]”, ” “, tweets_cl)
tweets_cl <- gsub(‘\d+’, ”, tweets_cl)
return(tweets_cl)
}
#function to calculate number of words in each category within a sentence
sentimentScore <- function(sentences, vNegTerms, negTerms, posTerms, vPosTerms){
final_scores <- matrix(”, 0, 5)
scores <- laply(sentences, function(sentence, vNegTerms, negTerms, posTerms, vPosTerms){
initial_sentence <- sentence
#remove unnecessary characters and split up by word
sentence = cleanTweets(sentence)
sentence <- tolower(sentence)
wordList <- str_split(sentence, ‘\s+’)
words <- unlist(wordList)
#build vector with matches between sentence and each category
vPosMatches <- match(words, vPosTerms)
posMatches <- match(words, posTerms)
vNegMatches <- match(words, vNegTerms)
negMatches <- match(words, negTerms)
#sum up number of words in each category
vPosMatches <- sum(!is.na(vPosMatches))
posMatches <- sum(!is.na(posMatches))
vNegMatches <- sum(!is.na(vNegMatches))
negMatches <- sum(!is.na(negMatches))
score <- c(vNegMatches, negMatches, posMatches, vPosMatches)
#add row to scores table
newrow <- c(initial_sentence, score)
final_scores <- rbind(final_scores, newrow)
return(final_scores)
}, vNegTerms, negTerms, posTerms, vPosTerms)
return(scores)
}
names(afinn_list) <- c(‘word’, ‘score’)
afinn_list\$word <- tolower(afinn_list\$word)
#categorize words as very negative to very positive and add some movie-specific words
vNegTerms <- afinn_list\$word[afinn_list\$score==-5 | afinn_list\$score==-4]
negTerms <- c(afinn_list\$word[afinn_list\$score==-3 | afinn_list\$score==-2 | afinn_list\$score==-1], “second-rate”, “moronic”, “third-rate”, “flawed”, “juvenile”, “boring”, “distasteful”, “ordinary”, “disgusting”, “senseless”, “static”, “brutal”, “confused”, “disappointing”, “bloody”, “silly”, “tired”, “predictable”, “stupid”, “uninteresting”, “trite”, “uneven”, “outdated”, “dreadful”, “bland”)
posTerms <- c(afinn_list\$word[afinn_list\$score==3 | afinn_list\$score==2 | afinn_list\$score==1], “first-rate”, “insightful”, “clever”, “charming”, “comical”, “charismatic”, “enjoyable”, “absorbing”, “sensitive”, “intriguing”, “powerful”, “pleasant”, “surprising”, “thought-provoking”, “imaginative”, “unpretentious”)
vPosTerms <- c(afinn_list\$word[afinn_list\$score==5 | afinn_list\$score==4], “uproarious”, “riveting”, “fascinating”, “dazzling”, “legendary”)
#Calculate score on each tweet
tweetResult <- as.data.frame(sentimentScore(tweets_txt, vNegTerms, negTerms, posTerms, vPosTerms))
tweetResult\$’2′ = as.numeric(tweetResult\$’2′)
tweetResult\$’3′ = as.numeric(tweetResult\$’3′)
tweetResult\$’4′ = as.numeric(tweetResult\$’4′)
tweetResult\$’5′ = as.numeric(tweetResult\$’5′)
counts = c(sum(tweetResult\$’2′),sum(tweetResult\$’3′),sum(tweetResult\$’4′),sum(tweetResult\$’5′))
names = c(“Worst”,”BAD”,”GOOD”,”VERY GOOD”)
mr = list(counts,names)
colors = c(“red”, “yellow”, “green”, “violet”)
barplot(mr[[1]], main=”Movie Review”, xlab=”Number of votes”,legend=mr[[2]],col=colors)

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...