Statistics Sunday: Getting Started with the Russian Tweet Dataset

August 12, 2018

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) – a “Russian troll factory”. In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files <- c("IRAhandle_tweets_1.csv",
my_files <- paste0("~/Downloads/russian-troll-tweets-master/",files)

each_file <- function(file) {
tweet <- read_csv(file) }

tweet_data <- NULL
for (file in my_files) {
temp <- each_file(file)
temp$id <- sub(".csv", "", file)
tweet_data <- rbind(tweet_data, temp)

Note that this is a large file, with 2,973,371 observations of 16 variables. Let’s do some cleaning of this dataset first. The researchers, Darren Linvill and Patrick Warren, identified 5 majors types of trolls:

  • Right Troll: These Trump-supporting trolls voiced right-leaning, populist messages, but “rarely broadcast traditionally important Republican themes, such as taxes, abortion, and regulation, but often sent divisive messages about mainstream and moderate Republicans…They routinely denigrated the Democratic Party, e.g. @LeroyLovesUSA, January 20, 2017, “#ThanksObama We’re FINALLY evicting Obama. Now Donald Trump will bring back jobs for the lazy ass Obamacare recipients,” the authors wrote.
  • Left Troll: These trolls mainly supported Bernie Sanders, derided mainstream Democrats, and focused heavily on racial identity, in addition to sexual and religious identity. The tweets were “clearly trying to divide the Democratic Party and lower voter turnout,” the authors told FiveThirtyEight.
  • News Feed: A bit more mysterious, news feed trolls mostly posed as local news aggregators who linked to legitimate news sources. Some, however, “tweeted about global issues, often with a pro-Russia perspective.”
  • Hashtag Gamer: Gamer trolls used hashtag games—a popular call/response form of tweeting—to drum up interaction from other users. Some tweets were benign, but many “were overtly political, e.g. @LoraGreeen, July 11, 2015, “#WasteAMillionIn3Words Donate to #Hillary.”
  • Fearmonger: These trolls, who were least prevalent in the dataset, spread completely fake news stories, for instance “that salmonella-contaminated turkeys were produced by Koch Foods, a U.S. poultry producer, near the 2015 Thanksgiving holiday.”

But a quick table of the results of the variable, account_category, shows 8 in the dataset.

## Commercial Fearmonger HashtagGamer LeftTroll NewsFeed
## 122582 11140 241827 427811 599294
## NonEnglish RightTroll Unknown
## 837725 719087 13905

The additional three are Commercial, Non-English, and Unknown. At the very least, we should drop the Non-English tweets, since those use Russian characters and any analysis I do will assume data are in English. I’m also going to keep only a few key variables. Then I’m going to clean up this dataset to remove links, because I don’t need those for my analysis – I certainly wouldn’t want to follow them to their destination. If I want to free up some memory, I can then remove the large dataset.

reduced <- tweet_data %>%
select(author,content,publish_date,account_category) %>%
filter(account_category != "NonEnglish")

## Attaching package: 'qdapRegex'
reduced$content <- rm_url(reduced$content)


Now we have a dataset of 2,135,646 observations of 4 variables. I’m planning on doing some analysis on my own of this dataset – and will of course share what I find – but for now, I thought I’d repeat a technique I’ve covered on this blog and demonstrate a new one.


tweetwords <- reduced %>%
unnest_tokens(word, content) %>%
## Joining, by = "word"
wordcounts <- tweetwords %>%
count(account_category, word, sort = TRUE) %>%

## # A tibble: 6 x 3
## account_category word n
## 1 NewsFeed news 124586
## 2 RightTroll trump 95794
## 3 RightTroll rt 86970
## 4 NewsFeed sports 47793
## 5 Commercial workout 42395
## 6 NewsFeed politics 38204

First, I’ll conduct a TF-IDF analysis of the dataset. This code is a repeat from a previous post.

tweet_tfidf <- wordcounts %>%
bind_tf_idf(word, account_category, n) %>%

tweet_tfidf %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(account_category) %>%
top_n(15) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = account_category)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~account_category, ncol = 2, scales = "free") +
## Selecting by tf_idf

But another method of examining terms and topics in a set of documents is Latent Dirichlet Allocation (LDA), which can be conducted using the R package, topicmodels. The only issue is that LDA requires a document term matrix. But we can easily convert our wordcounts dataset into a DTM with the cast_dtm function from tidytext. Then we run our LDA with topicmodels. Note that LDA is a random technique, so we set a random number seed, and we specify how many topics we want the LDA to extract (k). Since there are 6 account types (plus 1 unknown), I’m going to try having it extract 6 topics. We can see how well they line up with the account types.

tweets_dtm <- wordcounts %>%
cast_dtm(account_category, word, n)

tweets_lda <- LDA(tweets_dtm, k = 6, control = list(seed = 42))
tweet_topics <- tidy(tweets_lda, matrix = "beta")

Now we can pull out the top terms from this analysis, and plot them to see how they lined up.

top_terms <- tweet_topics %>%
group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
arrange(topic, -beta)

top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free") +

Based on these plots, I’d say the topics line up very well with the account categories, showing, in order: news feed, left troll, fear monger, right troll, hash gamer, and commercial. One interesting observation, though, is that Trump is a top term in 5 of the 6 topics.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)