The Happiest Emoticons
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To start off, I need a collection of emoticons associated with some text. And where else would I find this, but that gigantic compendium of everyday emotions, the definitive corpus of our age – Twitter.
The methodology is this: I collect lots of tweets containing emoticons, assign each one a ‘sentiment’ score[1], and then order the emoticons based on the average sentiment score of tweets containing each emoticon
The tweet gathering process is fairly direct. I parse tweets obtained from the streaming API[2] which contain any of a set of predefined emoticons and write them out to a file. If you want to, have a look at the Python code here. For the purpose of the R analysis, the tweet texts are already in a file. Each line is then (a) parsed for the emoticons it contains, and (b) assigned a sentiment score[3].
Finally, we plot each tweet on an emoticon-score plot. Like so:
- ย o.O and :* score higher than ๐
I think the ubiquity of ๐ is its burden. People feel ๐ for all sorts of reasons. Also, the score for o.O is computed over a much smaller number of tweets, and is possibly unstable. - I can understand people using ๐ at sad stuff, but what kind of a person uses ๐ for happy tweets? (There aren’t many of these, but a couple of them are too far right.) Let’s look at one of ย those tweets:
Wow I was sleeping sooooo good which doesn’t happen very often & They called from work & woke me up .. Now I can’t go back to sleep ๐ ย
That makes sense. It’s a tweet that turned sour half way through, but overall, had a pretty high density of positive words, so it’s no surprise that our scorer tagged it with a positive score - Here’s a tweet with a 8D in it:
Got to take a pic with heage ! Who has by far been the most fun, funny and candid lecturer(in my… http://t.co/WaOTW8D2YO
Notice anything funny? It’s a happy tweet, but the emoticon we were looking for, is conspicuously absent! Actually, the 8D does occur in the tweet – albeit in a urlย http://t.co/WaOTW8D2YO
Thanks to Twitter’s automatic url compression using t.co, it’s entirely possible to see an arbitrary collection of alphanumeric characters in a tweet without any semantic information. So be wary of the scores for stuff like 8D and xD.
Notes
[1] A linear scale where positive is happy, negative is unhappy
[2] Twitter’s Search API handles punctuation poorly, so that’s not an option
[3] Assignment of this score is done via a relatively simple lookup mechanism. This file provides a good evaluation
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.