Wrangling Wikileaks DMs

August 5, 2018

(This article was first published on Colin Fay, and kindly contributed to R-bloggers)

Using R to turn raw data into browsable and reusable content.


On the 29th of July 2018, Emma Best published on her website the copy of
11k+ wikileaks Twitter DM :

To be honnest, I’m not really interested in the content of this dataset.
What really interested me is that it’s raw data (copied and pasted
text) waiting to be parsed, and that I could use R to turn these
elements into a reusable and browsable content

The results

Here are the links to the pages I’ve created with R from this dataset:

  • Home has the full
    dataset, to search and download.
  • Timeline has a
    series of time-related content: notably DMs by years, and daily
    count of DMs.
  • Users holds the
    dataset for each users.
  • mentions_urls
    holds the extracted mentions and urls
  • methodo contains the
    methodology used for the data wrangling


Extracting the content

As I wanted to use the data offline (and not re-download it each time I
compile the outputs), I’ve first extracted and saved the dataset as a
.txt. You can now see it at https://colinfay.me/wikileaksdm/raw.txt.

Here is the code

## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Loading required package: xml2

## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
##     pluck

## The following object is masked from 'package:readr':
##     guess_encoding
# Reading the page
doc <- read_html("https://emma.best/2018/07/29/11000-messages-from-private-wikileaks-chat-released/")
# Extracting the paragraphs
doc <- doc %>% 
  # Getting the p
  html_nodes("p") %>%
  # Getting the text

# Removing the empty lines
doc <- doc[! nchar(doc)  == 0]
# Lines 1 to 9 are the content of the blogpost, not the content of the conversation. 
## [1] "“Objectivity is short-hand for not having a significant pre-conceived agenda, eliding facts the audience would be interested in, or engaging in obvious falsehoods.” ~ WikiLeaks"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [2] "Presented below are over 11,000 messages from the WikiLeaks + 10 chat, from which only excerpts have previously been published."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [3] "The chat is presented nearly in its entirety, with only a handful redactions made to protect the privacy and personal information of innocent, third parties as well as the already public name of an individual who has sent hate mail, made legal threats and who the source for the DMs considers a threat. It is at the request of the source that Mark’s full name is redacted, leaving only his first name and his initials (which he specified is alright). Though MGT’s full name is already public and easily discoverable, the source’s wishes are being respected. Beyond this individual, the redactions don’t include any information that’s relevant to understanding WikiLeaks or their activities."
## [4] "The chat log shows WikiLeaks’ private attitudes, their use of FOIA laws, as well as discussions about WikiLeaks’ lobbying and attempts to “humiliate” politicians, PR and propaganda efforts (such as establishing a “medium term truth” for “phase 2”), troll operations, attempts to engineer situations where WikiLeaks would be able to sue their critics, and in some instances where WikiLeaks helped direct lawsuits filed by third parties or encouraged criminal investigations against their opponents. In some instances, the chats are revealing. In others, they show a mundane consistency with WikiLeaks’ public stances. A few are provocative and confounding."                                   
## [5] "The extract below was created using DMArchiver, and is presented as pure text to make it easier to search and to provide as much metadata as possible (i.e. times as well as dates). The formatting is presented as-is, and shows users’ display names rather than their twitter handles. (Note: Emmy B is @GreekEmmy, not the author.)"                                                                                                                                                                                                                                                                                                                                                                           
## [6] "CW: At various points in the chat, there are examples of homophobia, transphobia, ableism, sexism, racism, antisemitism and other objectionable content and language. Some of these are couched as jokes, but are still likely to (and should) offend, as a racist or sexist jokes doesn’t cease to be racist or sexist because of an expected or desired laugh. Attempts to dismiss of these comments as “ironic” or “just trolling” merely invites comparisons to 4chan and ironic nazis. These comments, though offensive, are included in order to present as full and complete a record as possible and to let readers judge the context, purpose and merit of these comments for themselves."                
## [7] " "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [8] "If any current or former staffers, volunteers or hackers wants to add to my growing collection of leaks from within #WikiLeaks, please reach out. DMs are open and I’m EmmaBest on Wire."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [9] "— Emma Best (\u1d1c//ғ\u1d0f\u1d1c\u1d0f) \U0001f3f3️‍\U0001f308 (@NatSecGeek) June 28, 2018"
## [1] "[2015-05-01 13:52:11]  group Dm on Wls related trolls activity, incoming events & general topics."
# Removing these lines:
doc <- doc[10:length(doc)]

And then simply save it as a txt:

write(doc, "raw.txt")

I can now easily reaccess it.

res_raw <- read.delim("https://colinfay.me/wikileaksdm/raw.txt", 
                      sep = "\n", header = FALSE) %>% 
  # Turning the character vector into a tibbe
  as.tibble() %>% 
  # Renaming the V1 columné
  rename(value = V1)
## # A tibble: 11,377 x 1
##    value                                                                   
##  1 [2015-05-01 13:52:11]  group Dm on Wls related trolls activity, i…
##  2 [2015-05-01 13:53:39]  There’s a race on now to ‘spoil’ our …
##  3 [2015-05-01 13:55:02]  Greenberg, who palled up with the ope…
##  4 [2015-05-01 13:55:53]  ..stripped the hostility. We’re putti…
##  5 [2015-05-01 13:56:03]  suppressed.                           
##  6 [2015-05-01 14:01:26]  yes, both Wired & Verge contained DDB’s li…
##  7 [2015-05-01 14:02:37]  – cyber attacks. must be one of, if not th…
##  8 [2015-05-01 14:03:55]  Greenberg is a partisan. Fraudulent t…
##  9 [2015-05-01 14:29:48]  [Tweet] https://twitter.com/m_cetera/statu…
## 10 [2015-05-01 14:32:09]  Hi all. More comfortable discuss…
## # ... with 11,367 more rows

Cleaning the data

DMs have a specific structure: [date hour] text, except for
one “author”, , which is the meta-information
about the conversation (renaming of the channel, user joining and
leaving, etc). In order to tidy the format, let’s add
as an author.

res <- res_raw %>% 
  mutate(value = str_replace_all(value, 

Also I’ll remove, the last entry of the corpus, which doesn’t fit the
conversation format:

## # A tibble: 1 x 1
##   value                             
## 1 [LatestTweetID] 931704226425856001
res <- filter(res, ! str_detect(value, "931704226425856001"))

Some messages are splitted between lines. These lines don’t start with a
date (they are the middle of a DM). I’ll then paste the content of these
lines at the end of the line before.

Here is an example with lines 93 &

“[2015-05-02 14:12:27] OK, thanks H. Security issues were about who was on the list then?”
“Never quite know who you’re dealing with online I guess. I don’t, anyway!”

Here, 94 will be pasted at the end of 93 and removed.

Let’s loop this:

for (i in nrow(res):1){
  if (!grepl(pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry", res[i,])){
    res[i-1,] <- paste(res[i-1,], res[i,])
# Remove lines with no date or no DMConversationEntry
res <- res %>% 
  mutate(has_date = str_detect(value, pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry")) %>%
  filter(has_date) %>%

Extract key elements

We’ll now need to split the content in three: user, date, and

My first try was with :

res <- res %>%
    c("date", "user", "text"), 
    regex = "\\[(.{4}-.{2}-.{2} .{2}:.{2}:.{2})\\] <([a-zA-Z0-9 ]*)>  (.*)"

But that didn’t fit well: the DMConversationEntry has no date (I will
fill them later), so I need a NA here, hence the three steps process:

res <- res %>%
  extract(value,"user", regex = "<([a-zA-Z0-9 ]*)>", remove = FALSE) %>%
  extract(value,"date", regex = "\\[(.{4}-.{2}-.{2} .{2}:.{2}:.{2})\\] .*", remove = FALSE) %>%
  extract(value, "text", regex = "<[a-zA-Z0-9 ]*> (.*)", remove = FALSE) %>%

When date is missing, it’s because it’s a DMConversationEntry. Let’s
verify that:

res %>% 
  filter(user == "DMConversationEntry") %>%
  summarize(nas = sum(is.na(date)), 
            nrow = n())
## # A tibble: 1 x 2
##     nas  nrow
## 1    20    20

In order to have a date here, we will fill this with the directly
preceeding date:

res <- fill(res, date)

Saving data


write_csv(res, "wikileaks_dm.csv")


Find the min and max years:

## [1] "2015-05-01 13:52:11" "2017-11-10 04:30:46"

Filter and save a csv for each year:

    ~ filter(res, lubridate::year(date) == .x) %>%


Filter and save a csv for each user:

    ~ filter(res, user == .x) %>%

Counting users participation

res %>%
  count(user, sort = TRUE) %>%

Counting activity by days

res %>%
  mutate(date = lubridate::ymd_hms(date), 
         date = lubridate::date(date)) %>% 
  count(date) %>%

Adding extra info

Extracting all the mentions (@something):

mentions <- res %>% 
  mutate(mention = str_extract_all(text, "@[a-zA-Z0-9_]+")) %>%
  unnest(mention) %>% 
  select(mention, everything())
write_csv(mentions, "mentions.csv")

# Count them

mentions %>%
  count(mention, sort = TRUE) %>%

Extracting all the urls (http(s)something):

urls <- res %>% 
  mutate(url = str_extract_all(text, "http.+")) %>%
  unnest() %>% 
  select(url, everything())
write_csv(urls, "urls.csv")

Adding JSON format

I’ve also chosen to export JSON format of the csv.

list.files(pattern = "csv") %>%
  walk(function(x) {
    o <- read_csv(x)
      path = glue::glue("{tools::file_path_sans_ext(x)}.json")
list.files(pattern = "json") %>%
    file.copy(x, glue::glue("json/{x}"))

Building a website with Markdown and GitHub

Here’s a list of random elements from the process of building these
pages with R.


My website in hosted on
GitHub, with the home
url (colinfay.me) pointing to the root of this repo. If I create a new
folder pouet, and put inside this folder a file called index.html, I
can then go to colinfay.me/pouet, and get a new website from there. As
the wikileaks extraction already had its own repo, I’ve chosen to list
this repo https://github.com/ColinFay/wikileaksdm as a submodule of my
website’s repo.

More about submodules:

Inside this wikileaksdm project, I gathered all the data, an index.Rmd
which will be used as a homepage, and other Rmd for other pages. Each
are compiled as html.

Styling the pages

Markdown default style is nice, but I wanted something different. This
is why I used {markdowntemplates}, with the skeleton template. The
yaml looks like:

title: "Wikileaks Twitter DM - Home"
author: '@_colinfay'
date: "2018-08-06"
fig_width: 10
fig_height: 4 
navlink: "[Wikileaks Twitter DM](https://colinfay.me/wikileaksdm)"
  type: "article"
  title: "Wikileaks Twitter DM"
  - content: 'colinfay.me@_colinfay
' output: markdowntemplates::skeleton ---

Here, you can see some new things: footer content, og for open graph
data, and navlink for the content of the header.

Include the same markdown content several time

All the pages have the same intro content, so I can use
shiny::includeMarkdown to include it on each page (this way, I’ll only
need to update the content once if needed). Put it between backticks
with an r, and the markdown is integrated at compilation time as html.

See here, line 21:

Include font awesome icons

Before every link, there is a:

This could have been done with CSS, but I’ve used the {fontawesome}
package, also between backticks and with an r, to include them.

See here, line 33:

Page content

All the pages include interactive elements, and a static plot.
Interactive tables have been rendered with the {DT} package, and the
timeline with {dygraphs}. Under each dygraph, there is a static plot
made with {ggplot2}. In order to organise this two plots (interactive
and none), the second plot is put inside a

HTML tag. This
allows to create a foldable content inside the page.

See: https://twitter.com/_ColinFay/status/1022836135452663809

Prefilling functions

I use dygraph and datatable several times, with the same defaut
arguments (e.g extensions = "Buttons",options = list(scrollX = TRUE,
dom = "Bfrtip", buttons = c("copy", "csv")
). As I didn’t want to retype
these elements each time, I’ve called purrr::partial on it:

dt <- partial(
  extensions = "Buttons",
  options = list(
    scrollX = TRUE, 
    dom = "Bfrtip", 
    buttons = c("copy", "csv")

This new dt function is then used as the defaut datatable rendering.

Read more

If you want to read the code and discover the content, feel free to
browse the website and the github repo:

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)