Wrangling Wikileaks DMs

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Using R to turn raw data into browsable and reusable content.

About

On the 29th of July 2018, Emma Best published on her website the copy of 11k+ wikileaks Twitter DM : https://emma.best/2018/07/29/11000-messages-from-private-wikileaks-chat-released/

To be honnest, I’m not really interested in the content of this dataset. What really interested me is that it’s raw data (copied and pasted text) waiting to be parsed, and that I could use R to turn these elements into a reusable and browsable content.

The results

Here are the links to the pages I’ve created with R from this dataset:

  • Home has the full dataset, to search and download.
  • Timeline has a series of time-related content: notably DMs by years, and daily count of DMs.
  • Users holds the dataset for each users.
  • mentions_urls holds the extracted mentions and urls
  • methodo contains the methodology used for the data wrangling

Methodology

Extracting the content

As I wanted to use the data offline (and not re-download it each time I compile the outputs), I’ve first extracted and saved the dataset as a .txt. You can now see it at https://colinfay.me/wikileaksdm/raw.txt.

Here is the code used:

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
## 
##     pluck

## The following object is masked from 'package:readr':
## 
##     guess_encoding

# Reading the page
doc <- read_html("https://emma.best/2018/07/29/11000-messages-from-private-wikileaks-chat-released/")
# Extracting the paragraphs
doc <- doc %>% 
  # Getting the p
  html_nodes("p") %>%
  # Getting the text
  html_text()

# Removing the empty lines
doc <- doc[! nchar(doc)  == 0]
# Lines 1 to 9 are the content of the blogpost, not the content of the conversation. 
doc[1:9]

## [1] "“Objectivity is short-hand for not having a significant pre-conceived agenda, eliding facts the audience would be interested in, or engaging in obvious falsehoods.” ~ WikiLeaks"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [2] "Presented below are over 11,000 messages from the WikiLeaks + 10 chat, from which only excerpts have previously been published."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [3] "The chat is presented nearly in its entirety, with only a handful redactions made to protect the privacy and personal information of innocent, third parties as well as the already public name of an individual who has sent hate mail, made legal threats and who the source for the DMs considers a threat. It is at the request of the source that Mark’s full name is redacted, leaving only his first name and his initials (which he specified is alright). Though MGT’s full name is already public and easily discoverable, the source’s wishes are being respected. Beyond this individual, the redactions don’t include any information that’s relevant to understanding WikiLeaks or their activities."
## [4] "The chat log shows WikiLeaks’ private attitudes, their use of FOIA laws, as well as discussions about WikiLeaks’ lobbying and attempts to “humiliate” politicians, PR and propaganda efforts (such as establishing a “medium term truth” for “phase 2”), troll operations, attempts to engineer situations where WikiLeaks would be able to sue their critics, and in some instances where WikiLeaks helped direct lawsuits filed by third parties or encouraged criminal investigations against their opponents. In some instances, the chats are revealing. In others, they show a mundane consistency with WikiLeaks’ public stances. A few are provocative and confounding."                                   
## [5] "The extract below was created using DMArchiver, and is presented as pure text to make it easier to search and to provide as much metadata as possible (i.e. times as well as dates). The formatting is presented as-is, and shows users’ display names rather than their twitter handles. (Note: Emmy B is @GreekEmmy, not the author.)"                                                                                                                                                                                                                                                                                                                                                                           
## [6] "CW: At various points in the chat, there are examples of homophobia, transphobia, ableism, sexism, racism, antisemitism and other objectionable content and language. Some of these are couched as jokes, but are still likely to (and should) offend, as a racist or sexist jokes doesn’t cease to be racist or sexist because of an expected or desired laugh. Attempts to dismiss of these comments as “ironic” or “just trolling” merely invites comparisons to 4chan and ironic nazis. These comments, though offensive, are included in order to present as full and complete a record as possible and to let readers judge the context, purpose and merit of these comments for themselves."                
## [7] " "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [8] "If any current or former staffers, volunteers or hackers wants to add to my growing collection of leaks from within #WikiLeaks, please reach out. DMs are open and I’m EmmaBest on Wire."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [9] "— Emma Best (\u1d1c//ғ\u1d0f\u1d1c\u1d0f) \U0001f3f3️‍\U0001f308 (@NatSecGeek) June 28, 2018"

doc[10]

## [1] "[2015-05-01 13:52:11] <noll> group Dm on Wls related trolls activity, incoming events & general topics."

# Removing these lines:
doc <- doc[10:length(doc)]

And then simply save it as a txt:

write(doc, "raw.txt")

I can now easily reaccess it.

res_raw <- read.delim("https://colinfay.me/wikileaksdm/raw.txt", 
                      sep = "\n", header = FALSE) %>% 
  # Turning the character vector into a tibbe
  as.tibble() %>% 
  # Renaming the V1 columné
  rename(value = V1)
res_raw

## # A tibble: 11,377 x 1
##    value                                                                   
##    <fct>                                                                   
##  1 [2015-05-01 13:52:11] <noll> group Dm on Wls related trolls activity, i…
##  2 [2015-05-01 13:53:39] <WikiLeaks> There’s a race on now to ‘spoil’ our …
##  3 [2015-05-01 13:55:02] <WikiLeaks> Greenberg, who palled up with the ope…
##  4 [2015-05-01 13:55:53] <WikiLeaks> ..stripped the hostility. We’re putti…
##  5 [2015-05-01 13:56:03] <WikiLeaks> suppressed.                           
##  6 [2015-05-01 14:01:26] <noll> yes, both Wired & Verge contained DDB’s li…
##  7 [2015-05-01 14:02:37] <noll> – cyber attacks. must be one of, if not th…
##  8 [2015-05-01 14:03:55] <WikiLeaks> Greenberg is a partisan. Fraudulent t…
##  9 [2015-05-01 14:29:48] <noll> [Tweet] https://twitter.com/m_cetera/statu…
## 10 [2015-05-01 14:32:09] <WISE Up Action> Hi all. More comfortable discuss…
## # ... with 11,367 more rows

Cleaning the data

DMs have a specific structure: [date hour] <author> text, except for one “author”, <DMConversationEntry>, which is the meta-information about the conversation (renaming of the channel, user joining and leaving, etc). In order to tidy the format, let’s add <DMConversationEntry> as an author.

res <- res_raw %>% 
  mutate(value = str_replace_all(value, 
                                 "\\[DMConversationEntry\\]",
                                 "<DMConversationEntry>")) 

Also I’ll remove, the last entry of the corpus, which doesn’t fit the conversation format:

res[nrow(res),]

## # A tibble: 1 x 1
##   value                             
##   <chr>                             
## 1 [LatestTweetID] 931704226425856001

res <- filter(res, ! str_detect(value, "931704226425856001"))

Some messages are splitted between lines. These lines don’t start with a date (they are the middle of a DM). I’ll then paste the content of these lines at the end of the line before.

Here is an example with lines 93 & 94:

value
“[2015-05-02 14:12:27] OK, thanks H. Security issues were about who was on the list then?”
“Never quite know who you’re dealing with online I guess. I don’t, anyway!”

Here, 94 will be pasted at the end of 93 and removed.

Let’s loop this:

for (i in nrow(res):1){
  if (!grepl(pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry", res[i,])){
    res[i-1,] <- paste(res[i-1,], res[i,])
  }
}
# Remove lines with no date or no DMConversationEntry
res <- res %>% 
  mutate(has_date = str_detect(value, pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry")) %>%
  filter(has_date) %>%
  select(-has_date)

Extract key elements

We’ll now need to split the content in three: user, date, and text.

My first try was with :

res <- res %>%
  extract(
    value,
    c("date", "user", "text"), 
    regex = "\\[(.{4}-.{2}-.{2} .{2}:.{2}:.{2})\\] <([a-zA-Z0-9 ]*)>  (.*)"
    )

But that didn’t fit well: the DMConversationEntry has no date (I will fill them later), so I need a NA here, hence the three steps process:

res <- res %>%
  extract(value,"user", regex = "<([a-zA-Z0-9 ]*)>", remove = FALSE) %>%
  extract(value,"date", regex = "\\[(.{4}-.{2}-.{2} .{2}:.{2}:.{2})\\] .*", remove = FALSE) %>%
  extract(value, "text", regex = "<[a-zA-Z0-9 ]*> (.*)", remove = FALSE) %>%
  select(-value)

When date is missing, it’s because it’s a DMConversationEntry. Let’s verify that:

res %>% 
  filter(user == "DMConversationEntry") %>%
  summarize(nas = sum(is.na(date)), 
            nrow = n())

## # A tibble: 1 x 2
##     nas  nrow
##   <int> <int>
## 1    20    20

In order to have a date here, we will fill this with the directly preceeding date:

res <- fill(res, date)

Saving data

Global

write_csv(res, "wikileaks_dm.csv")

Year

Find the min and max years:

range(res$date)

## [1] "2015-05-01 13:52:11" "2017-11-10 04:30:46"

Filter and save a csv for each year:

walk(2015:2017, 
    ~ filter(res, lubridate::year(date) == .x) %>%
    write_csv(glue::glue("{.x}.csv"))
    )

User

Filter and save a csv for each user:

walk(unique(res$user), 
    ~ filter(res, user == .x) %>%
    write_csv(glue::glue("user_{make.names(.x)}.csv"))
    )

Counting users participation

res %>%
  count(user, sort = TRUE) %>%
  write_csv("user_count.csv")

Counting activity by days

res %>%
  mutate(date = lubridate::ymd_hms(date), 
         date = lubridate::date(date)) %>% 
  count(date) %>%
  write_csv("daily.csv")

Adding extra info

Extracting all the mentions (@something):

mentions <- res %>% 
  mutate(mention = str_extract_all(text, "@[a-zA-Z0-9_]+")) %>%
  unnest(mention) %>% 
  select(mention, everything())
write_csv(mentions, "mentions.csv")

# Count them

mentions %>%
  count(mention, sort = TRUE) %>%
  write_csv("mentions_count.csv")

Extracting all the urls (http(s)something):

urls <- res %>% 
  mutate(url = str_extract_all(text, "http.+")) %>%
  unnest() %>% 
  select(url, everything())
write_csv(urls, "urls.csv")

Adding JSON format

I’ve also chosen to export JSON format of the csv.

list.files(pattern = "csv") %>%
  walk(function(x) {
    o <- read_csv(x)
    jsonlite::write_json(
      o,
      path = glue::glue("{tools::file_path_sans_ext(x)}.json")
    )
  })
dir.create("json")
list.files(pattern = "json") %>%
  walk(function(x){
    file.copy(x, glue::glue("json/{x}"))
    unlink(x)
  })

Building a website with Markdown and GitHub

Here’s a list of random elements from the process of building these pages with R.

Hosting

My website in hosted on GitHub, with the home url (colinfay.me) pointing to the root of this repo. If I create a new folder pouet, and put inside this folder a file called index.html, I can then go to colinfay.me/pouet, and get a new website from there. As the wikileaks extraction already had its own repo, I’ve chosen to list this repo https://github.com/ColinFay/wikileaksdm as a submodule of my website’s repo.

More about submodules: https://git-scm.com/book/en/v2/Git-Tools-Submodules

Inside this wikileaksdm project, I gathered all the data, an index.Rmd which will be used as a homepage, and other Rmd for other pages. Each are compiled as html.

Styling the pages

Markdown default style is nice, but I wanted something different. This is why I used {markdowntemplates}, with the skeleton template. The yaml looks like:

---
title: "Wikileaks Twitter DM - Home"
author: '@_colinfay'
date: "2018-08-06"
fig_width: 10
fig_height: 4 
navlink: "[Wikileaks Twitter DM](https://colinfay.me/wikileaksdm)"
og:
  type: "article"
  title: "Wikileaks Twitter DM"
footer:
  - content: '<a href="https://colinfay.me">colinfay.me</a> • <a href="https://twitter.com/_ColinFay">@_colinfay</a><br/>'
output: markdowntemplates::skeleton
---

Here, you can see some new things: footer content, og for open graph data, and navlink for the content of the header.

Include the same markdown content several time

All the pages have the same intro content, so I can use shiny::includeMarkdown to include it on each page (this way, I’ll only need to update the content once if needed). Put it between backticks with an r, and the markdown is integrated at compilation time as html.

See here, line 21: https://raw.githubusercontent.com/ColinFay/wikileaksdm/master/index.Rmd

Include font awesome icons

Before every link, there is a:

This could have been done with CSS, but I’ve used the {fontawesome} package, also between backticks and with an r, to include them.

See here, line 33: https://raw.githubusercontent.com/ColinFay/wikileaksdm/master/index.Rmd

Page content

All the pages include interactive elements, and a static plot. Interactive tables have been rendered with the {DT} package, and the timeline with {dygraphs}. Under each dygraph, there is a static plot made with {ggplot2}. In order to organise this two plots (interactive and none), the second plot is put inside a <details> HTML tag. This allows to create a foldable content inside the page.

See: https://twitter.com/_ColinFay/status/1022836135452663809

Prefilling functions

I use dygraph and datatable several times, with the same defaut arguments (e.g extensions = "Buttons",options = list(scrollX = TRUE, dom = "Bfrtip", buttons = c("copy", "csv")). As I didn’t want to retype these elements each time, I’ve called purrr::partial on it:

dt <- partial(
  datatable,
  extensions = "Buttons",
  options = list(
    scrollX = TRUE, 
    dom = "Bfrtip", 
    buttons = c("copy", "csv")
    )
)

This new dt function is then used as the defaut datatable rendering.

Read more

If you want to read the code and discover the content, feel free to browse the website and the github repo:

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)