The Happiest Emoticons

September 10, 2013
By

(This article was first published on Hot Damn, Data!, and kindly contributed to R-bloggers)

Clearly, a :) is happier than a :( but what about a :-* and a :-D ? Or a :-| and a :-o ? In this post I attempt to rank emoticons in order of how happy someone has to be to use each one. (And punctuate horribly to avoid mixing punctuation with the emoticon)

To start off, I need a collection of emoticons associated with some text. And where else would I find this, but that gigantic compendium of everyday emotions, the definitive corpus of our age - Twitter.
The methodology is this: I collect lots of tweets containing emoticons, assign each one a 'sentiment' score[1], and then order the emoticons based on the average sentiment score of tweets containing each emoticon

The tweet gathering process is fairly direct. I parse tweets obtained from the streaming API[2] which contain any of a set of predefined emoticons and write them out to a file. If you want to, have a look at the Python code here. For the purpose of the R analysis, the tweet texts are already in a file. Each line is then (a) parsed for the emoticons it contains, and (b) assigned a sentiment score[3].

Finally, we plot each tweet on an emoticon-score plot. Like so:

The tiny vertical black lines mark the mean score for each emoticon.
There is no ordering to the colour scale. The colours just help differentiate each row.

Okay, so here's a list of observations and (partial) explanations for some surprises
  1.  o.O and :* score higher than :-)
    I think the ubiquity of :-) is its burden. People feel :-) for all sorts of reasons. Also, the score for o.O is computed over a much smaller number of tweets, and is possibly unstable.
  2. I can understand people using :-) at sad stuff, but what kind of a person uses :-( for happy tweets? (There aren't many of these, but a couple of them are too far right.) Let's look at one of  those tweets:

    Wow I was sleeping sooooo good which doesn't happen very often & They called from work & woke me up .. Now I can't go back to sleep :-(  

    That makes sense. It's a tweet that turned sour half way through, but overall, had a pretty high density of positive words, so it's no surprise that our scorer tagged it with a positive score
  3. Here's a tweet with a 8D in it:

    Got to take a pic with heage ! Who has by far been the most fun, funny and candid lecturer(in my... http://t.co/WaOTW8D2YO

    Notice anything funny? It's a happy tweet, but the emoticon we were looking for, is conspicuously absent! Actually, the 8D does occur in the tweet - albeit in a url http://t.co/WaOTW8D2YO
    Thanks to Twitter's automatic url compression using t.co, it's entirely possible to see an arbitrary collection of alphanumeric characters in a tweet without any semantic information. So be wary of the scores for stuff like 8D and xD.
So the next time you can't tell what someone is trying to convey with an emoticon, this chart might come in handy as a reference. In the meantime, if you're happy and you know it, contort your pupils o.O


Notes
[1] A linear scale where positive is happy, negative is unhappy
[2] Twitter's Search API handles punctuation poorly, so that's not an option
[3] Assignment of this score is done via a relatively simple lookup mechanism. This file provides a good evaluation

To leave a comment for the author, please follow the link and comment on his blog: Hot Damn, Data!.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.