Emojis Analysis in R

[This article was first published on Opiate for the masses, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A while ago I developed and shared an emoji decoder because I was facing problems when retrieving data from Twitter and Instragram. In a nutshell, the issue is that R encodes emojis in a way that makes it a hassle identifying them. This is where the decoder/dictionary comes into play.

After I put together my decoder, new emojis have been released. For example, certain emojis came with different skin colours, which adds a little bit of complexity to the analysis. In fact, emojis with skin tone information consist of two unicode codepoints: the codepoint for the emoji (e.g. “U+1F466” for boy) and the codepoint for the respective skin tone (e.g. “U+1F3FB” for light skin tone). The emoji “boy: light kin tone” thus appears as “U+1F466 U+1F3FB”. The descriptions between the plain emoji and the emoji with skin tone information differ slightly (e.g. “princess” vs “princess: light skin tone”), in which case simple string matching would fail to identify these as being the same emoji. It’s gonna require some special attention when cleaning the data. More on that in the code. This list is a more complete version than the one I used for my decoder. Felipe released a new decoder based on the new list which I will use in the post.

Alright, so with the decoder at hand, we’re able to identify the emojis in, say, a tweet retrieved with the twitteR package. What now? Quite some people contacted me since I released the article asking for advice concerning emojis analysis and I’d like to cover some questions in this post. The whole code I used for the analysis in this article is available here.

Most used emoji

One such question was how to determine a users most used emoji.

We’ll start collecting some sample data to perform our analysis on. In case you didn’t already know, Paris Hilton is my favorite victim when it comes to emojis analysis or social media analysis, for the simple reason that she uses as lot ot them and shares a lot of content. The emojis_matching function is the heart of all the analysis performed in this post.

# get some sample data
usermedia <- userTimeline(user = "parishilton", n = 3200) %>%
  twListToDF
# convert to a format we can work with (very important!)
usermedia$text <- iconv(usermedia$text, from = "latin1", to = "ascii", sub = "byte")

# rank emojis by occurence in data, super basic
rank <- emojis_matching(usermedia$text, matchto, description) %>% 
  group_by(description) %>% 
  summarise(n = sum(count)) %>%
  arrange(-n)

head(rank, 10)
# A tibble: 10 × 2
                       description     n
                             <chr> <int>
1                         sparkles   386
2                         princess   101
3                     party popper    69
4                             fire    53
5  people with bunny ears partying    51
6                        red heart    46
7                    musical notes    38
8                        palm tree    38
9                  sparkling heart    37
10                         balloon    34

Paris Hilton’s favorite emoji is: SPARKLES! Who would have thought 😉 Her alltime favorite MUSICAL NOTES landed on place 7 (status quo 2017-03-24).

Tweet with most emojis

Another possible use case is to determine which tweet contains the most emojis. We can use the emojis_matchingfunction for this question again and than arrange by descending count. Done. Easy peasy lemon squeezy. This is how the output looks like:

From here, it’s easy to calculate the average number of emojis per tweet:

mean(tweets$n, na.rm = TRUE)
[1] 3.646341

Sentiment analysis with emojis

Doing some research for this article, I came accross this paper, which extensively analyzes the valence (positive, negative, neutral) of emojis in a scientific manner. The authors made a csv file available containing all the emojis with their respective valences. I will base the sentiment analysis on this file.

For a reason I ignore, the csv file available doesn’t contain the sentiment score. The article gives guidance as how to compute the senteiment score of each emoji based on the data in the csv file, but at this point I prefer to scrape the list with the sentiment scores. It’s available here. After merging it with our firt emoji list to get the R encoding info, it looks like this:

str(emojis_merged)
'data.frame':	615 obs. of  12 variables:
 $ unicode        :Class 'u_char'  int [1:615] 169 174 8618 8986 8987 9193 9200 9203 9410 9642 ...
 $ char           : chr  "©" "®" "↪" "⌚" ...
 $ occurrences    : int  416 137 8 17 7 6 13 7 5 159 ...
 $ position       : num  0.74 0.353 0.727 0.649 0.797 0.732 0.646 0.829 0.765 0.145 ...
 $ negative       : num  0.131 0.071 0.091 0.15 0.3 0.111 0.188 0.4 0.25 0.099 ...
 $ neutral        : num  0.621 0.579 0.727 0.5 0.3 0.667 0.188 0.2 0.25 0.605 ...
 $ positive       : num  0.248 0.35 0.182 0.35 0.4 0.222 0.625 0.4 0.5 0.296 ...
 $ sentiment_score: num  0.117 0.279 0.091 0.2 0.1 0.111 0.438 0 0.25 0.198 ...
 $ description.x  : chr  "copyright sign" "registered sign" "rightwards arrow with hook" "watch" ...
 $ block          : chr  "Latin-1 Supplement" "Latin-1 Supplement" "Arrows" "Miscellaneous Technical" ...
 $ description.y  : chr  "copyright" "registered" "left arrow curving right" "watch" ...
 $ r.encoding     : chr  "<c2><a9>" "<c2><ae>" "<e2><86><aa>" "<e2><8c><9a>" ...

Ok, so we have a list of emojis with their respective unicode codepoints and sentiment scores. The next step consists of matching sentiments to the tweets.

sentiments <- emojis_matching(usermedia$text, emojis_merged$r.encoding, 
  emojis_merged$description.x, emojis_merged$sentiment_score) %>%
  mutate(sentiment = count*as.numeric(sentiment)) %>%
  group_by(text) %>% 
  summarise(sentiment_score = sum(sentiment))

What we get is a list of the tweets with there respective, aggregated sentiment scores:

Note that the score od a single tweet is the sum of the sentiment score of all emojis in the tweet. The higher the score, the more positive the tweet. One could put this number in relation to the number of emojis in the tweet or choose a more binary format like “positve” “negative”. Most of Paris Hilton’s tweets are positive, which was to be expected, with only two tweets identified as having a negative valence. Some tweets don’t have any sentiment score, this is due to the fact that they didn’t contain any (identifiable) emoji.

Emojis associated with words in tweets

One question that came to my mind was: what words appear in the same tweets as emojis? What are the top n words associated with each emoji? Before we can perform this kind of text analysis, we need to do the usual house keeping: clean the texts from links, strange characters, punctuation etc. I recommand having a look at the code to see what I exactly did, especially at the cleaning pipe. Furthermore, I wrote the function wordFreqEmojis that outputs a data frame of emojis with the top 5 words (default value, can be changed) they are used most frequently with.

# get emojis for each tweet and clean tweets
raw_texts <- emojis_matching(usermedia$text, matchto, description) %>% 
  select(-sentiment, -count) %>%
  mutate(text = cleanPosts(text)) %>%
  filter(text != "")
  
# get data frame with emojis and top words
words_emojis <- wordFreqEmojis(raw_texts, raw_texts$text, raw_texts$description) %>% 
  filter(!is.na(words))

Browsing through words_emojis, it’s obvious that the data makes a lot of sense. The emojis associated with the word “adios” are for instance “sun”, “water wave”, “bikini” and “airplane”. These words all indicate the twitter user is travelling.

In natural language programming, there are literally endless possibilities. One could fine tune the results by using other stopwords, working with word stems or considering ngrams instead of single words just to name a few. Besides words, one can also find cooccuring emojis.

List of further emojis analysis ideas

Summing up, here are some ideas for further analysis I didn’t implement in this article but can be done with emojis:

  • combine traditional text based sentiment analysis with emojis basd sentiment analysis
  • analyse coocccurence of emojis
  • identify topics based on emojis
  • track trends in emojis use (weekdays, time of the day, “happy” emojis vs “sad” emojis, etc.)

Final thoughts

Emoji analysis is unlikely to make a good job at replacing natural language processing in a sentiment analysis context. Emojis can help easily identify positive content, but they’re not so good at identifying negative or serious, business related content as far as I can tell. It makes sense since most of the emojis have a positive meaning. Also, not everyone makes the same use of emoji and not every positive tweet contains an emoji, so again, I don’t think it’s a good idea to base your sentiment analysis exclusively on emojis. I’d rather suggest to perform traditional sentiment analysis and enrich it with emoji data. Also, emoji analysis can become quite difficult due to the constantly growing number of emojis. Some of them have more than one unicode codepoint, this can be a challenge in the analysis.

In this article, I only showed how to perform simple positive/negative sentiment analysis and very basic association with words, but one could come up with much more detailed approaches. One could harvest much more information from emojis than just their level of positiveness. I’m thinking activities, different sentiments, patriotism, fondness for children, frequency of travelling, etc.

Emojis Analysis in R was originally published by Kirill Pomogajko at Opiate for the masses on March 24, 2017.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)