Using the iGraph package to Analyse the Enron Corpus

July 12, 2017
By

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

This blog post analyzes an extensive collection of e-mails from former Enron employees. The Enron corpus is analysed using network analysis tools provided by the iGraph package. Network analysis is a versatile technique that can be used to add value to a lot of different data sets, including the complex corporate relationships of Donald Trump.

The Enron scandal is one of the most famous corporate governance failures in the history of capitalism. The case itself is interesting for its sake, but in this post, I am more interested in one of the data sets that the subsequent investigation has provided.

The Enron Corpus

As part of their inquiries, The Federal Energy Regulatory Commission used an extensive collection of emails from Enron employees. The Enron corpus is one of the only publicly available collections of emails available for research. This dataset also provides a fascinating playground for citizen data scientists.

The set has privacy issues as it contains messages from living people. When analysing this data set, we need to keep in mind that the majority of former Enron employees were innocent people who lost their jobs due to the greed of their overlords. The code in this post only analyses the e-mail headers, ignoring the content.

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001 - Enron corpus analysis

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001.

The Enron Corpus is a large database of half a million of emails generated by more than 100 Enron employees. You can download the corpus from the Carnegie Mellon School of Computer Science. The first code snippet downloads the 7 May 2015 version of the dataset (about 423Mb, tarred and gzipped) and untars it to your working directory.

# Enron Email Dataset: https://www.cs.cmu.edu/~./enron/
download.file("https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz", destfile = "enron_mail_20150507.tgz")
untar("enron_mail_20150507.tgz")

Preparing the Data

The main folder is maildir, which holds all the personal accounts. Our first task is to load the required libraries and create a list of available emails. This code results in 517,401 e-mail files with 44,859 emails in the inboxes of users.

# E-mail corpus consists of nested folders per user with e-mails as text files
# Create list of all available e-mails
emails <- list.files("maildir/", full.names = T, recursive = T)
length(emails)
# Filter by inbox only
emails <- emails[grep("/inbox", emails)]
length(emails)

The bulk of the code creates a list of emails between Enron employees. The code performs a lot of string manipulations to extract the information from the text files. The content of the e-mails is ignored, the code only retrieves the sender and the receiver. The analysis is limited to e-mails between employees in the corpus. Only those receivers whose inbox forms part of the analysis are included. The result of this code is a data frame with the usernames of the sender and receiver for each email. The data frame contains 2779 emails that meet the criteria.

# Create list of sender and receiver (inbox owner)
inboxes <- data.frame(
    from = apply(as.data.frame(emails), 1, function(x){readLines(x, warn = F)[3]}),
    to = emails,
    stringsAsFactors = F
)

# Keep only enron.com and strip all but username
library(stringr) # String manipulation
inboxes <- inboxes[grepl("@enron.com", inboxes$from),]
inboxes$from <- str_sub(inboxes$from, 7, nchar(inboxes$from) - 10)
to <- str_split(inboxes$to, "/")
inboxes$to <- sapply(to, "[", 3)

# Create list of usernames
users <- data.frame(user = paste0("maildir/", unique(inboxes$to)))

# Remove those without sent mails
sent <- apply(users, 1, function(x){sum(grepl("sent", dir(x)))})
users <- subset(users, sent != 0)

# Replace username with e-mail name
users$mailname <- NA
for (i in 1:nrow(users)){
sentmail <- dir(paste0(users$user[i], "/sent_items/"))
name <- readLines(paste0(users$user[i], "/sent_items/", sentmail[1]), warn = F)[3]
name <- str_sub(name, 7, nchar(name)-10)
users$mailname[i] <- name
}
users$user <- str_sub(users$user, 9)
inboxes <- merge(inboxes, by.x="to", users, by.y="user")
inboxes <- data.frame(from = inboxes$from, to = inboxes$mailname)

inboxes$from <- as.character(inboxes$from)
inboxes$to <- as.character(inboxes$to)

# Only e-mails between inbox users
inboxes <- inboxes[inboxes$from %in% inboxes$to,]

# Remove no.address
inboxes <- subset(inboxes, from != "no.address" & to != "no.address")

# Remove emails to self
inboxes <- subset(inboxes, inboxes$from != inboxes$to)

Network Analysis

The last code snippet defines a graph from the table of e-mails. Each employee is a node in the network, and each e-mail is an edge (line). The iGraph package is a powerful tool to analyse networks. The graph_from_edgelist function creates a network object that can be analysed using the iGraph package. The graph is directed because the information flows from the sender to the receiver.

In the next step, the Spingglass algorithm finds community structure within the data. A community is a group of nodes that are more connected with each other than with any other node.

The last step visualises the network. The diagram is drawn using the Fruchterman-Reingold algorithm, which places the most connected nodes at the centre of the picture. The background colours in the diagram indicate the eight communities.

The graph tells us a lot about the group of employees in the Enron corpus and how they relate to each other. Each of the communities represents a tightly connected group of employees that mainly e-mail each other. Any connections between communities are shown in red. When the vertex.label = NA line is removed, the usernames are displayed at each node.

We can see groups that never mail each other, lonely hangers-on and tightly knit cliques within Enron. In the centre fo the graph we see w few individuals who are connectors between groups because they send a lot of emails to people outside their community. On the fringes of the graph are the hangers-on who only once or twice emailed somebody in the corpus but were still included in the investigation.

library(igraph)
g <- graph_from_edgelist(as.matrix(inboxes), directed = T)
coms <- spinglass.community(g)

# Plot network
par(mar = c(0,0,2,0))
plot(coms, g, 
     vertex.label=NA, 
     layout = layout.fruchterman.reingold,
     vertex.size = 3
)

This analysis provides only a quick glimpse into the knowledge contained in the Enron corpus. An extensive range of tools is available to analyse such networks. An interesting exercise would be to overlap this network with the organisation chart to see the relationships between teams. Have fun playing with this fantastic data set!

Enron corpus network with communities.

 

The post Using the iGraph package to Analyse the Enron Corpus appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)