Does "the twitter ratio" apply to the #rstats community?

Posted on January 28, 2018 by Daniela Vazquez in R bloggers | 0 Comments

[This article was first published on d4tagirl, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Not long ago I came across a FiveThirtyEight post called “The Worst Tweeter In Politics Isn’t Trump”. Well, it was a long time ago actually, but this project was laying in my computer for a while ¯\(ツ)/¯. They gathered some resources from the media discussing that tweets leading to more replies than likes and retweets are the ones that make the community angry. This phenomenon is known as “The Ratio” as Luke O’Neil wrote recently in Esquire.

FiveThirtyEight used a ternary plot to illustrate the proportion of replies, retweets and likes of every Trump tweet. In this post I’m going to plot tweets with the rstats hashtag, suspecting that the ones that have a higher ratio of replies might be an exception to this rule since conversations tend to be pretty friendly in this community. But let’s find out!

Disclaimer

It wasn’t until I had this post ready for publishing that I realized the replies the media were discussing were the direct ones, without considering the replies of the replies, so I just invented a new ratio ???? I spent a great amount of extra work to consider all the replies (direct and indirect ones) but I liked the way it turned out and the way I had to solve some problems, so I’ll just stick with my personal definition of the ratio, knowing it’s not what it’s supposed to be ????

Retrieving the data

I’m becoming more and more fan of the rtweet package, built and maintained by Michael W. Kearney. It’s the way to go when you want to interact with Twitter’s API using R. I fetch some tweets with #rstats to analize!

library(rtweet)
library(dplyr)
tweets_rstats <- search_tweets(q = "#rstats",
                                include_rts = FALSE,
                                n = 300)

tweets_rstats <- tweets_rstats %>%
  distinct()

I keep only the original tweets with at least two likes because I want to keep relevant tweets. This is probably too arbitrary and surely can be improved, but here I go.

orig_tweets <- tweets_rstats %>% 
  filter(is.na(reply_to_status_id),
         favorite_count > 1) %>%    
  select(status_id, screen_name, text, favorite_count, retweet_count) %>%
  distinct()

I already have the number of retweets and the number of likes of each original tweet, but not the number of replies. To build the ternary plot (or pyramid as I prefer to call it) I need the number of replies as well. As the API doesn’t have a direct method to do this, so I have to do it by hand.

Here comes the purrr part. The purrr package is receiving a lot of love this year: there is a group sharing the #purrrResolution, courtesy of Isabella Ghement that you can join:

Just sent out the first group e-mail concerning the #purrrResolution #rstats #purrr – if you haven't received it, it means you are not yet on the list. To join the list, you can e-mail me ([email protected]). Keep on purrring!
— Isabella R. Ghement (@IsabellaGhement) January 5, 2018

And Colin Fay created the Twitter collection #RStats — Your daily dose of #purrr with great tips!

I collect all the mentions to all screen_names in the orig_tweets dataframe. I use distinct(screen_names) because I don’t want to call the API more than once for every screen_name.

library(purrr)
library(tidyr)

orig_tweets_mentions <- orig_tweets %>%
  distinct(screen_name) %>%           
  mutate(query = paste0("@", screen_name, " OR ", "to:", screen_name, " OR ", screen_name)) %>%
  mutate(tweets = pmap(list(q = .$query,
                            n = 1000,
                            retryonratelimit = TRUE),
                       rtweet::search_tweets)) %>%
  select(tweets) %>%
  unnest()

Here I’m joining the conversation by using the pmap function to fetch all the mentions to all the screen_names in the orig_tweets dataframe. The API only returns tweets from the last 6 to 10 days, but it should suffice. As Lucy pointed out in her post about Twitter Trees, querying the API using only to: screen_name misses some tweets, so I took her recommendation of including @screen_name and OR screen_name. You will notice that I took a lot of ideas from her blog post, which I highly recommend if you like to work with Twitter conversations.

I need to apply the rtweet::search_tweets function to each screen_name in the orig_tweets dataframe, passing more than one argument to the function: pmap is the answer! You can pass a list of arguments to the pmap function for it to pass them on to the search_tweets one. In this case I pass q: the query, n: the number of tweets I want, and retryonratelimit: set to TRUE for it to wait and retry when rate limited.

This is what I get:

To leave a comment for the author, please follow the link and comment on their blog: d4tagirl.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Disclaimer

Retrieving the data

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)