Do we appreciate sunbathing in Spring ?

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We are currently experiencing an extremely hot month in Montréal (and more generally in North America). Looking at people having a beer, and starting the first barbecue of the year, I was wondering: if we asked people if global warming was a good or a bad thing, what do you think they will answer ? Wearing a T-shirt in Montréal in March is nice, trust me ! So how can we study, from a quantitative point of view, depending on the time of the year, what people think of global warming ?

A few month ago, I went quickly through

score.sentiment = function(sentences, pos.words,
neg.words, .progress='none')
{
require(plyr)
require(stringr)
scores = laply(sentences, 
function(sentence, pos.words, neg.words) {
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
sentence = tolower(sentence)
word.list = strsplit(sentence, '\\s+')
words = unlist(word.list)
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}

hu.liu.pos = scan("positive-words.txt", what="character",
comment.char=';')
hu.liu.neg = scan('negative-words.txt', what='character',
comment.char=';')

> score.sentiment("It's awesome I am so happy,
thank you all",
+ hu.liu.pos,hu.liu.neg)$score
[1] 3

> score.sentiment("I'm desperate, life is a nightmare,
I want to die",
+ hu.liu.pos,hu.liu.neg)$score
[1] -3

But one can easy see a big problem with this methodology. What if the sentence included negations ? E.g.

> score.sentiment("I'm no longer desperate, life is
not a nightmare anymore I don't want to die",
+ hu.liu.pos,hu.liu.neg)$score
[1] -3

Here the sentence is negative, extremely negative, if we look only at the score. But it should be the opposite. I simple idea is to change (slightly) the function, so that once a negation is found in the sentence, we take the opposite of the score. Hence, we just add at the end of the function

if("not"%in%words){score=-score}

Here we obtain

> score.sentiment.neg("I'm no longer desperate,
life is not a nightmare anymore I don't want to die",
+ hu.liu.pos,hu.liu.neg)$score
[1] 3

But does it really work ? Let us focus on Tweets,

library(twitteR)

Consider the following tweet-extractions, based on two words, a negative word, and the negation of a positive word,

> tweets=searchTwitter('not happy',n=1000)
> NH.text= lapply(tweets, function(t) t$getText() )
> NH.scores = score.sentiment(NH.text,
+ hu.liu.pos,hu.liu.neg)
 
> tweets=searchTwitter('unhappy',n=1000)
> UH.text= lapply(tweets, function(t) t$getText() )
> UH.scores = score.sentiment(UH.text,
+ hu.liu.pos,hu.liu.neg)

> plot(density(NH.scores$score,bw=.8),col="red")
> lines(density(UH.scores$score,bw=.8),col="blue")

> UH.scores = score.sentiment.neg(UH.text,
+ hu.liu.pos,hu.liu.neg)
> NH.scores = score.sentiment.neg(NH.text,
+ hu.liu.pos,hu.liu.neg)
> plot(density(NH.scores$score,bw=.8),col="red")
> lines(density(UH.scores$score,bw=.8),col="blue")

> w.tweets=searchTwitter("snow",since= LISTEDATE[k],
+ until= LISTEDATE[k+1],geocode="40,-100,2000mi")
> W.text= lapply(w.tweets, function(t) t$getText() )
> W.scores = score.sentiment.neg(W.text,
+ hu.liu.pos,hu.liu.neg, .progress='text')
> M[k]=mean(W.scores$score)
We obtain here the following score function, over three years, on Twitter,

Well, we have to admit that the pattern is not that obvious. There might me small (local) bump in the winter, but it is not obvious…

Let us get back to the point used to introduce this post. If we study what people “feel” when they mention global warming, let us run the same code, again in North America
> w.tweets=searchTwitter("global warming",since= LISTEDATE[k],
+ until= LISTEDATE[k+1],geocode="40,-100,2000mi")
Actually, I was expecting a nice cycle, with positive scores in Spring, and perhaps negative scores during heat waves, in the middle of the Summer…

What we simply observe is that global warming was related to “negative words” on Twitter a few years ago, but we have reached a neutral position nowadays.

And to be honest, I do not really know how to interpret that: is there a problem with the technique I use (obviously, I use here a very simple scoring function, even after integrating a minor correction to take into consideration negations) or is there something going one that can be interpreted ?

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)