Who is talking about the French Open?

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I don’t think rOpenSci’s Jeroen Ooms can ever top the coolness of his magick package but I have to admit other things he’s developped are not bad at all. He’s recently been working on interfaces to Google compact language detectors 2 and 3 (the latter being more experimental). I saw this cool use case and started thinking about other possible applications of the packages.

I was very sad when I realized it was too late to try and download tweets about the Eurovision song context but then I also remembered there’s this famous tennis tournament going on right now, about which people probably tweet in various languages. I don’t follow the French Open myself, but it seemed interesting to find out which languages were the most prevalent, and whether the results from the cld2 and cld3 packages are similar and whether they’re similar to the language detection results from Twitter itself.

Getting the tweets

I’m using my usual rtweet recipe. I no longer need to open my eyes when downloading tweets.

<span class="n">rg_tweets</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rtweet</span><span class="o">::</span><span class="n">search_tweets</span><span class="p">(</span><span class="n">q</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"#RolandGarros2017"</span><span class="p">,</span><span class="w">
                                   </span><span class="n">include_rts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
                                   </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">18000</span><span class="p">)</span><span class="w">
</span><span class="n">save</span><span class="p">(</span><span class="n">rg_tweets</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"data/2017-06-07-rolandgarros.RData"</span><span class="p">)</span><span class="w">
</span>

I got 18000 tweets.

Using the language detectors

I decided to first clean the tweets a bit, removing hashtags, mentions, and at least part of the links.

<span class="n">rg_tweets</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate</span><span class="p">(</span><span class="n">rg_tweets</span><span class="p">,</span><span class="w">
                           </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="s2">"#.*$"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">
                           </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="s2">"https.*$"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">
                           </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="s2">"@.*$"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">
                           </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="s2">"#.* "</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">
                           </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="s2">"https.* "</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">),</span><span class="w">
                           </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="s2">"@.* "</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">))</span><span class="w">
</span>

Today I’m a happy naive user of language detectors, but more technical details can be found in their respective README’s. It’d be difficult to have an easier interface than the two cld2 and cld3 packages. They’re also fast, although I haven’t timed the following chunk so you’ll have to believe me or test the packages yourselves.

Note that cld2 and cld3 both have functions for outputting several languages instead of one, with the associated reliability, but I won’t use them since I want a direct comparison with the Twitter output of one language per tweet.

<span class="n">rg_tweets</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate</span><span class="p">(</span><span class="n">rg_tweets</span><span class="p">,</span><span class="w"> 
                           </span><span class="n">cld2_language</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cld2</span><span class="o">::</span><span class="n">detect_language</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w">
                                                                 </span><span class="n">lang_code</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> 
                           </span><span class="n">cld3_language</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cld3</span><span class="o">::</span><span class="n">detect_language</span><span class="p">(</span><span class="n">text</span><span class="p">))</span><span class="w">
</span>

Before analysing the results, I’ll transform the Twitter detected language a bit: it’s not NA for undertemined language, it’s “und”.

<span class="n">rg_tweets</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate</span><span class="p">(</span><span class="n">rg_tweets</span><span class="p">,</span><span class="w"> 
                           </span><span class="n">lang</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">lang</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"und"</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">lang</span><span class="p">))</span><span class="w">
</span>

Looking at detected languages

Twitter output a language for 96% of the tweets, cld2 for 59% and cld3 for 84% of them.

Let’s see a few cases in which Twitter outputs a language whereas the other don’t.

As a side note, I learnt how to insert a DT::datatable from Daniela Vázquez after admiring one she had put in this cool blog post. This was an adventure in htmlwidgets hell. I started looking at Daniela’s Github blog repo, then talked with her on the R-Ladies slack, and I told her I’d do more the next day. I woke up to a PR solving all problems! Thanks a lot Daniela, and also thanks to your husband! Update: Daniela’s husband Gervasio came up with a fix not involving custom Jekyll plugins after I realized my site couldn’t be built on Github pages… I am so thankful for their help!

<span class="n">library</span><span class="p">(</span><span class="s2">"magrittr"</span><span class="p">)</span><span class="w">
</span><span class="n">rg_tweets</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">cld2_language</span><span class="p">),</span><span class="w"> </span><span class="nf">is.na</span><span class="p">(</span><span class="n">cld3_language</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">select</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="n">lang</span><span class="p">,</span><span class="w"> </span><span class="n">cld2_language</span><span class="p">,</span><span class="w">
         </span><span class="n">cld3_language</span><span class="p">)</span><span class="o">%>%</span><span class="w">
  </span><span class="n">head</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">DT</span><span class="o">::</span><span class="n">datatable</span><span class="p">()</span><span class="w">
</span>


It seems that the languages with undetermined languages via cld2 and cld3 are quite short. The Twitter language detector might be more focused at short sentences which well given the length of tweets wouldn’t be surprising. Moreover, maybe it often inputs a language even when uncertain. If we take the example of the word “Merci”, it’s French but also used in Catalan at least in Barcelona, so to me that seems uncertain. Some other tweets to which Twitter but not the other language detectors associated a language are a mix of languages.

Let’s have a look at lines with disagreements when no language information is missing.

<span class="n">rg_tweets</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">lang</span><span class="p">),</span><span class="w">
                </span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">cld2_language</span><span class="p">),</span><span class="w">
                </span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">cld3_language</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">group_by</span><span class="p">(</span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">lang</span><span class="p">,</span><span class="w"> </span><span class="n">cld2_language</span><span class="p">,</span><span class="w">
                              </span><span class="n">cld3_language</span><span class="p">)))</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">head</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">select</span><span class="p">(</span><span class="n">text</span><span class="p">,</span><span class="w"> </span><span class="n">lang</span><span class="p">,</span><span class="w"> </span><span class="n">cld2_language</span><span class="p">,</span><span class="w">
         </span><span class="n">cld3_language</span><span class="p">)</span><span class="o">%>%</span><span class="w">
  </span><span class="n">DT</span><span class="o">::</span><span class="n">datatable</span><span class="p">()</span><span class="w">
</span>


Similarly I think these tweets are quite short. Moreover, languages seem to often be different but not that different, e.g. “es” (Spanish) and “ca” (Catalan) or “es” and “gl” (Galician). I sometimes make similar mistakes, when I say I’ve heard a “Nordic language”” because I couldn’t identify it further than “not Swedish” (which I should be able to recognize).

I can also look at dissimilarities, computed on tweets with determined language.

<span class="n">dplyr</span><span class="o">::</span><span class="n">select</span><span class="p">(</span><span class="n">rg_tweets</span><span class="p">,</span><span class="w"> </span><span class="n">lang</span><span class="p">,</span><span class="w"> </span><span class="n">cld2_language</span><span class="p">,</span><span class="w"> </span><span class="n">cld3_language</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate_all</span><span class="p">(</span><span class="n">dplyr</span><span class="o">::</span><span class="n">funs</span><span class="p">(</span><span class="n">as.factor</span><span class="p">(</span><span class="n">.</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">t</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">cluster</span><span class="o">::</span><span class="n">daisy</span><span class="p">()</span><span class="w">
</span>
## Dissimilarities :
##                     lang cld2_language
## cld2_language 0.09870999              
## cld3_language 0.41158430    0.16547982
## 
## Metric :  mixed ;  Types
## Number of objects : 3

Unsurprisingly there’s a worse agreement between cld3 results and the other two ones than between cld2 and Twitter. I say I’m not surprised because cld3 is a still experimental language detector.

Representing languages

I’m going to assume that if Twitter and cld2 agree on the language assigned to a tweet, then it’s quite reliable.

<span class="n">agreed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="n">rg_tweets</span><span class="p">,</span><span class="w">
                        </span><span class="n">lang</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">cld2_language</span><span class="p">)</span><span class="w">
</span>

I’m therefore only considering 9432 tweets out of the original 18000 tweets.

<span class="n">agreed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">group_by</span><span class="p">(</span><span class="n">agreed</span><span class="p">,</span><span class="w"> </span><span class="n">lang</span><span class="p">)</span><span class="w">
</span><span class="n">agreed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">summarize</span><span class="p">(</span><span class="n">agreed</span><span class="p">,</span><span class="w"> </span><span class="n">tweets_count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">())</span><span class="w">
</span><span class="n">agreed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">ungroup</span><span class="p">(</span><span class="n">agreed</span><span class="p">)</span><span class="w">
</span><span class="n">agreed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">arrange</span><span class="p">(</span><span class="n">agreed</span><span class="p">,</span><span class="w"> </span><span class="n">desc</span><span class="p">(</span><span class="n">tweets_count</span><span class="p">))</span><span class="w">
</span><span class="n">agreed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">mutate</span><span class="p">(</span><span class="n">agreed</span><span class="p">,</span><span class="w"> </span><span class="n">lang</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">lang</span><span class="p">,</span><span class="w"> </span><span class="n">ordered</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">lang</span><span class="p">)))</span><span class="w"> 
</span>

Let’s plot the results.

<span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"hrbrthemes"</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">agreed</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_col</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">lang</span><span class="p">,</span><span class="w"> </span><span class="n">tweets_count</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_ipsum</span><span class="p">(</span><span class="n">base_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
              </span><span class="n">axis_title_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Detected language"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"No. of tweets in the sample"</span><span class="p">)</span><span class="w">
</span>

The most represented languages, English, French and Spanish are surely a result of who’s on Twitter, what are the most spoken languages on Earth, who’s interested in tennis (or the tennis players) and who’s awake when the tournament happens. One way to control for the timezone of the tournament would be to stream tweets during each of the tournaments of the tennis Grand Slam. Another extension of this small blog post would be to look for players names in tweets and to then see if one can find an association between the most mentioned players in a language and the nationality of these players. This could even be coupled to a sentiment analysis (you could support one player and criticize the other players). Then again, that’s something that’d be even more interesting in my opinion if applied to Eurovision contestants instead! Next year maybe…

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)