Statistics Sunday: Getting Started with the Russian Tweet Dataset

Posted on August 12, 2018 by in R bloggers | 0 Comments

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) – a “Russian troll factory”. In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files <- c("IRAhandle_tweets_1.csv",
           "IRAhandle_tweets_2.csv",
           "IRAhandle_tweets_3.csv",
           "IRAhandle_tweets_4.csv",
           "IRAhandle_tweets_5.csv",
           "IRAhandle_tweets_6.csv",
           "IRAhandle_tweets_7.csv",
           "IRAhandle_tweets_8.csv",
           "IRAhandle_tweets_9.csv")
my_files <- paste0("~/Downloads/russian-troll-tweets-master/",files)

each_file <- function(file) {
  tweet <- read_csv(file) }

library(tidyverse)

tweet_data <- NULL
for (file in my_files) {
  temp <- each_file(file)
  temp$id <- sub(".csv", "", file)
  tweet_data <- rbind(tweet_data, temp)
}

Note that this is a large file, with 2,973,371 observations of 16 variables. Let's do some cleaning of this dataset first. The researchers, Darren Linvill and Patrick Warren, identified 5 majors types of trolls:

Right Troll: These Trump-supporting trolls voiced right-leaning, populist messages, but “rarely broadcast traditionally important Republican themes, such as taxes, abortion, and regulation, but often sent divisive messages about mainstream and moderate Republicans…They routinely denigrated the Democratic Party, e.g. @LeroyLovesUSA, January 20, 2017, “#ThanksObama We're FINALLY evicting Obama. Now Donald Trump will bring back jobs for the lazy ass Obamacare recipients,” the authors wrote.
Left Troll: These trolls mainly supported Bernie Sanders, derided mainstream Democrats, and focused heavily on racial identity, in addition to sexual and religious identity. The tweets were “clearly trying to divide the Democratic Party and lower voter turnout,” the authors told FiveThirtyEight.
News Feed: A bit more mysterious, news feed trolls mostly posed as local news aggregators who linked to legitimate news sources. Some, however, “tweeted about global issues, often with a pro-Russia perspective.”
Hashtag Gamer: Gamer trolls used hashtag games—a popular call/response form of tweeting—to drum up interaction from other users. Some tweets were benign, but many “were overtly political, e.g. @LoraGreeen, July 11, 2015, “#WasteAMillionIn3Words Donate to #Hillary.”
Fearmonger: These trolls, who were least prevalent in the dataset, spread completely fake news stories, for instance “that salmonella-contaminated turkeys were produced by Koch Foods, a U.S. poultry producer, near the 2015 Thanksgiving holiday.”

But a quick table of the results of the variable, account_category, shows 8 in the dataset.

table(tweet_data$account_category)

## 
##   Commercial   Fearmonger HashtagGamer    LeftTroll     NewsFeed 
##       122582        11140       241827       427811       599294 
##   NonEnglish   RightTroll      Unknown 
##       837725       719087        13905

The additional three are Commercial, Non-English, and Unknown. At the very least, we should drop the Non-English tweets, since those use Russian characters and any analysis I do will assume data are in English. I'm also going to keep only a few key variables. Then I'm going to clean up this dataset to remove links, because I don't need those for my analysis - I certainly wouldn't want to follow them to their destination. If I want to free up some memory, I can then remove the large dataset.

reduced <- tweet_data %>%
  select(author,content,publish_date,account_category) %>%
  filter(account_category != "NonEnglish")

library(qdapRegex)

## 
## Attaching package: 'qdapRegex'

reduced$content <- rm_url(reduced$content)

rm(tweet_data)

Now we have a dataset of 2,135,646 observations of 4 variables. I'm planning on doing some analysis on my own of this dataset - and will of course share what I find - but for now, I thought I'd repeat a technique I've covered on this blog and demonstrate a new one.

library(tidytext)

tweetwords <- reduced %>%
  unnest_tokens(word, content) %>%
  anti_join(stop_words)

## Joining, by = "word"

wordcounts <- tweetwords %>%
  count(account_category, word, sort = TRUE) %>%
  ungroup()

head(wordcounts)

## # A tibble: 6 x 3
##   account_category word          n
##   <chr>            <chr>     <int>
## 1 NewsFeed         news     124586
## 2 RightTroll       trump     95794
## 3 RightTroll       rt        86970
## 4 NewsFeed         sports    47793
## 5 Commercial       workout   42395
## 6 NewsFeed         politics  38204

First, I'll conduct a TF-IDF analysis of the dataset. This code is a repeat from a previous post.

tweet_tfidf <- wordcounts %>%
  bind_tf_idf(word, account_category, n) %>%
  arrange(desc(tf_idf))

tweet_tfidf %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(account_category) %>%
  top_n(15) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = account_category)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~account_category, ncol = 2, scales = "free") +
  coord_flip()

## Selecting by tf_idf