To start off, I need a collection of emoticons associated with some text. And where else would I find this, but that gigantic compendium of everyday emotions, the definitive corpus of our age – Twitter.
The methodology is this: I collect lots of tweets containing emoticons, assign each one a ‘sentiment’ score, and then order the emoticons based on the average sentiment score of tweets containing each emoticon
The tweet gathering process is fairly direct. I parse tweets obtained from the streaming API which contain any of a set of predefined emoticons and write them out to a file. If you want to, have a look at the Python code here. For the purpose of the R analysis, the tweet texts are already in a file. Each line is then (a) parsed for the emoticons it contains, and (b) assigned a sentiment score.
Finally, we plot each tweet on an emoticon-score plot. Like so:
- o.O and :* score higher than 🙂
I think the ubiquity of 🙂 is its burden. People feel 🙂 for all sorts of reasons. Also, the score for o.O is computed over a much smaller number of tweets, and is possibly unstable.
- I can understand people using 🙂 at sad stuff, but what kind of a person uses 🙁 for happy tweets? (There aren’t many of these, but a couple of them are too far right.) Let’s look at one of those tweets:
Wow I was sleeping sooooo good which doesn’t happen very often & They called from work & woke me up .. Now I can’t go back to sleep 🙁
That makes sense. It’s a tweet that turned sour half way through, but overall, had a pretty high density of positive words, so it’s no surprise that our scorer tagged it with a positive score
- Here’s a tweet with a 8D in it:
Got to take a pic with heage ! Who has by far been the most fun, funny and candid lecturer(in my… http://t.co/WaOTW8D2YO
Notice anything funny? It’s a happy tweet, but the emoticon we were looking for, is conspicuously absent! Actually, the 8D does occur in the tweet – albeit in a url http://t.co/WaOTW8D2YO
Thanks to Twitter’s automatic url compression using t.co, it’s entirely possible to see an arbitrary collection of alphanumeric characters in a tweet without any semantic information. So be wary of the scores for stuff like 8D and xD.
 A linear scale where positive is happy, negative is unhappy
 Twitter’s Search API handles punctuation poorly, so that’s not an option
 Assignment of this score is done via a relatively simple lookup mechanism. This file provides a good evaluation