A quick #WorldEmojiDay exploration

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s celebrate #WorldEmojiDay with a quick exploration of my own
twitter account
.

The ?

We’ll need:

From Github

  • {emo}

<span class="n">remote</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"hadley/emo"</span><span class="p">)</span><span class="w">
</span>

From CRAN

  • {dplyr}
  • {tidyr}
  • {rtweet}
  • {tidytext}

Note: This page has been created at:

<span class="n">Sys.time</span><span class="p">()</span><span class="w">
</span>
## [1] "2018-07-17 17:22:29 CEST"

The ?

Let’s get my last 3200 tweets:

<span class="n">library</span><span class="p">(</span><span class="n">emo</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">rtweet</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span>
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
<span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_timeline</span><span class="p">(</span><span class="w">
  </span><span class="s2">"_ColinFay"</span><span class="p">,</span><span class="w">
  </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3200</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w">
</span>
##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "hashtags"                "symbols"                
## [17] "urls_url"                "urls_t.co"              
## [19] "urls_expanded_url"       "media_url"              
## [21] "media_t.co"              "media_expanded_url"     
## [23] "media_type"              "ext_media_url"          
## [25] "ext_media_t.co"          "ext_media_expanded_url" 
## [27] "ext_media_type"          "mentions_user_id"       
## [29] "mentions_screen_name"    "lang"                   
## [31] "quoted_status_id"        "quoted_text"            
## [33] "quoted_created_at"       "quoted_source"          
## [35] "quoted_favorite_count"   "quoted_retweet_count"   
## [37] "quoted_user_id"          "quoted_screen_name"     
## [39] "quoted_name"             "quoted_followers_count" 
## [41] "quoted_friends_count"    "quoted_statuses_count"  
## [43] "quoted_location"         "quoted_description"     
## [45] "quoted_verified"         "retweet_status_id"      
## [47] "retweet_text"            "retweet_created_at"     
## [49] "retweet_source"          "retweet_favorite_count" 
## [51] "retweet_retweet_count"   "retweet_user_id"        
## [53] "retweet_screen_name"     "retweet_name"           
## [55] "retweet_followers_count" "retweet_friends_count"  
## [57] "retweet_statuses_count"  "retweet_location"       
## [59] "retweet_description"     "retweet_verified"       
## [61] "place_url"               "place_name"             
## [63] "place_full_name"         "place_type"             
## [65] "country"                 "country_code"           
## [67] "geo_coords"              "coords_coords"          
## [69] "bbox_coords"             "status_url"             
## [71] "name"                    "location"               
## [73] "description"             "url"                    
## [75] "protected"               "followers_count"        
## [77] "friends_count"           "listed_count"           
## [79] "statuses_count"          "favourites_count"       
## [81] "account_created_at"      "verified"               
## [83] "profile_url"             "profile_expanded_url"   
## [85] "account_lang"            "profile_banner_url"     
## [87] "profile_background_url"  "profile_image_url"

Here is what the text column looks like:

<span class="n">res</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">pull</span><span class="p">(</span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">.</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">]</span><span class="w">
</span>
## [1] "@GoldbergData It adds a little label at the top left with the text you provide. \nCan be useful if you want to add some legends in a markdown / shiny app, for example"
## [2] "#RStats \nCool new feature in ggplot2 v3 — tagging plots : https://t.co/jFUqX2Tj5T"                                                                                    
## [3] "#RStats — A perfect introduction to \U0001f5fa with the {sf} \U0001f4e6 &amp; Co by @statnmap : \nhttps://t.co/IrmcSBDMDy https://t.co/m3TyUjrxYF"                     
## [4] "@vsbuffalo Amen to that"                                                                                                                                               
## [5] "#RStats — \U0001f680 Setting up RStudio Server, Shiny Server and PostgreSQL :\nhttps://t.co/J1Y7edNAj0"

As you can see, the emojis are not printed in the console, but converted
to weird characters like \U0001f4e6 and such. These are unicode
characters: translations of the emojis into a language your machine can
understand. I won’t go deeper into this, here are two resources you can
read if you want to know more about encoding:

The ?

Let’s use the {emo} package to extract the emojis from the text.
Inspired by {stringr}, this package has a ji_extract_all function
that is designed to extract all the emojis from a character vector.
We’ll use it on out text column, then extract the date and emo column.
We then pass the result to tidyr::unnest in order to remove the empty
emo rows (i.e, the tweets without an emoji).

<span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w">
</span><span class="n">emos</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">emo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ji_extract_all</span><span class="p">(</span><span class="n">text</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">created_at</span><span class="p">,</span><span class="n">emo</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unnest</span><span class="p">(</span><span class="n">emo</span><span class="p">)</span><span class="w">

</span><span class="n">emos</span><span class="w">
</span>
## # A tibble: 887 x 2
##    created_at          emo  
##    <dttm>              <chr>
##  1 2018-07-17 10:00:47 ?   
##  2 2018-07-17 08:35:05 ?   
##  3 2018-07-16 18:47:25 ?   
##  4 2018-07-16 14:51:30 ?   
##  5 2018-07-16 14:51:16 ?   
##  6 2018-07-16 13:28:08 ?   
##  7 2018-07-16 13:27:00 ?   
##  8 2018-07-16 13:27:00 ?   
##  9 2018-07-16 13:27:00 ?   
## 10 2018-07-16 13:25:01 ?   
## # ... with 877 more rows
<span class="n">emos</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">emo</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span>
## # A tibble: 187 x 2
##    emo       n
##    <chr> <int>
##  1 ?       84
##  2 ?       56
##  3 ?       51
##  4 ?       50
##  5 ?       50
##  6 ?       42
##  7 ?       36
##  8 ?       35
##  9 ?       33
## 10 ?       28
## # ... with 177 more rows

So apparently, I use a lot of ?. But also talk about ?, which sounds
more appropriate 🙂

As you can see, {tibble} converts elements to emojis when printing.
When using a data.frame, you have a simple unicode translation:

<span class="n">emos</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">head</span><span class="p">()</span><span class="w">
</span>
##            created_at        emo
## 1 2018-07-17 10:00:47 \U0001f4e6
## 2 2018-07-17 08:35:05 \U0001f680
## 3 2018-07-16 18:47:25 \U0001f62e
## 4 2018-07-16 14:51:30 \U0001f601
## 5 2018-07-16 14:51:16 \U0001f631
## 6 2018-07-16 13:28:08 \U0001f352

The ?

Let’s flag all the emojis with their names:

<span class="n">emos</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="w">
    </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
      </span><span class="n">emo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ji_name</span><span class="p">,</span><span class="w"> 
      </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">ji_name</span><span class="p">)</span><span class="w">
    </span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">emo</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span>
## Joining, by = "emo"

## Warning: Column `emo` joining character vector and factor, coercing into
## character vector

## # A tibble: 295 x 3
##    emo   name                       n
##    <chr> <fct>                  <int>
##  1 ?    thinking                  84
##  2 ?    thinking_face             84
##  3 ?    package                   56
##  4 ?    grimacing                 51
##  5 ?    grimacing_face            51
##  6 ?    party_popper              50
##  7 ?    tada                      50
##  8 ?    face_screaming_in_fear    50
##  9 ?    scream                    50
## 10 ?    innocent                  42
## # ... with 285 more rows

The ?

And finally, let’s see what are the most associated words with the
emojis we just saw:

<span class="n">library</span><span class="p">(</span><span class="n">tidytext</span><span class="p">)</span><span class="w">
</span><span class="n">emos_with_id</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">emo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ji_extract_all</span><span class="p">(</span><span class="n">text</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="n">status_id</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="n">emo</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">tidyr</span><span class="o">::</span><span class="n">unnest</span><span class="p">(</span><span class="n">emo</span><span class="p">)</span><span class="w">

</span><span class="n">emos_with_id</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">proustr</span><span class="o">::</span><span class="n">stop_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="w">
    </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
      </span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"https"</span><span class="p">,</span><span class="w"> </span><span class="s2">"t.co"</span><span class="p">,</span><span class="w"> </span><span class="s2">"https"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gt"</span><span class="p">)</span><span class="w">
    </span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">emo</span><span class="p">,</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span>
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## # A tibble: 5,660 x 3
##    emo   word          n
##    <chr> <chr>     <int>
##  1 ?    rstats       37
##  2 ?    rstats       27
##  3 ?    macbook      26
##  4 ?    package      20
##  5 ?    trans        18
##  6 ☕    pm           15
##  7 ?    pro          15
##  8 ?    marche       10
##  9 ?    ma_salmon    10
## 10 ?    ma_salmon    10
## # ... with 5,650 more rows

And what are the most used emojis with “rstats”?

<span class="n">emos_with_id</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">proustr</span><span class="o">::</span><span class="n">stop_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="w">
    </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
      </span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"https"</span><span class="p">,</span><span class="w"> </span><span class="s2">"t.co"</span><span class="p">,</span><span class="w"> </span><span class="s2">"https"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gt"</span><span class="p">)</span><span class="w">
    </span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">emo</span><span class="p">,</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="w">
    </span><span class="n">word</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"rstats"</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span>
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

## # A tibble: 81 x 3
##    emo   word       n
##    <chr> <chr>  <int>
##  1 ?    rstats    37
##  2 ?    rstats    27
##  3 ?    rstats     5
##  4 ?    rstats     4
##  5 ?    rstats     4
##  6 ?    rstats     4
##  7 ✍️    rstats     3
##  8 ?    rstats     3
##  9 ?    rstats     3
## 10 ⚡    rstats     2
## # ... with 71 more rows

Other cool functions

I recently discovered the ji_glue() function which allows you to
insert an emoji easily into a character vector :

<span class="n">ji_glue</span><span class="p">(</span><span class="s2">"I love to code :package:"</span><span class="p">)</span><span class="w">
</span>
## I love to code ?
<span class="n">ji_glue</span><span class="p">(</span><span class="s2">"Sometimes they make me :scream:"</span><span class="p">)</span><span class="w">
</span>
## Sometimes they make me ?
<span class="n">ji_glue</span><span class="p">(</span><span class="s2">"Sometimes they make me :cry:"</span><span class="p">)</span><span class="w">
</span>
## Sometimes they make me ?
<span class="n">ji_glue</span><span class="p">(</span><span class="s2">"Sometimes they make me :fear:"</span><span class="p">)</span><span class="w">
</span>
## Sometimes they make me ?
<span class="n">ji_glue</span><span class="p">(</span><span class="s2">"But in the end I'm always :tada:"</span><span class="p">)</span><span class="w">
</span>
## But in the end I'm always ?

The ji() function can also be used inside your markdown, so you can
write:

“I hate backtick r emo::ji(”bug“) backtick”, and it will come as: “I
hate ?”.

(of course, replace backtick by actuwith backticks 🙂 ).

That’s all folks ?

That’s all for today! Now have a nice emoji day ?

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)