# Enron Email Corpus Topic Model Analysis Part 2 – This Time with Better regex

November 4, 2013
By

(This article was first published on Data and Analysis with R, for Fun (and Maybe Work!), and kindly contributed to R-bloggers)

After posting my analysis of the Enron email corpus, I realized that the regex patterns I set up to capture and filter out the cautionary/privacy messages at the bottoms of peoples emails were not working.  Let’s have a look at my revised python code for processing the corpus:

As I did not change the R code since the last post, let’s have a look at the results:

terms(lda.model,20)
Topic 1   Topic 2   Topic 3     Topic 4
[1,] "enron"   "time"    "pleas"     "deal"
[2,] "busi"    "thank"   "thank"     "gas"
[3,] "manag"   "day"     "attach"    "price"
[4,] "meet"    "dont"    "email"     "contract"
[5,] "market"  "call"    "enron"     "power"
[6,] "compani" "week"    "agreement" "market"
[7,] "vinc"    "look"    "fax"       "chang"
[8,] "report"  "talk"    "call"      "rate"
[10,] "energi"  "ill"     "file"      "day"
[11,] "inform"  "tri"     "messag"    "month"
[12,] "pleas"   "bit"     "inform"    "compani"
[14,] "risk"    "night"   "send"      "transact"
[15,] "discuss" "friday"  "corp"      "product"
[16,] "regard"  "weekend" "kay"       "term"
[17,] "team"    "love"    "review"    "custom"
[18,] "plan"    "item"    "receiv"    "cost"
[19,] "servic"  "email"   "question"  "thank"
[20,] "offic"   "peopl"   "draft"     "purchas"

One at a time, I will try to interpret what each topic is trying to describe:

1. This one appears to be a business process topic, containing a lot of general business terms, with a few even relating to meetings.
2. Similar to the last model that I derived, this topic has a lot of time related words in it such as: time, day, week, night, friday, weekend.  I’ll be interested to see if this is another business meeting/interview/social meeting topic, or whether it describes something more social.
3. Hrm, this topic seems to contain a lot of general terms used when we talk about communication: email, agreement, fax, call, message, inform, phone, send, review, question.  It even has please and thank you!  I suppose it’s very formal and you could perhaps interpret this as professional sounding administrative emails.
4. This topic seems to be another case of emails containing a lot of ‘shop talk’

Okay, let’s see if we can find some examples for each topic:

sample(which(df.emails.topics$"1" > .95),3) [1] 27771 45197 27597 enron[[27771]] Christi's call. Christi has asked me to schedule the above meeting/conference call. September 11th (p.m.) seems to be the best date. Question: Does this meeting need to be a 1/2 day meeting? Christi and I were wondering. Give us your thoughts.  Yup, business process, meeting. This email fits the bill! Next! enron[[45197]] Bob, I didn't check voice mail until this morning (I don't have a blinking light. The assistants pick up our lines and amtel us when voice mails have been left.) Anyway, with the uncertainty of the future business under the Texas Desk, the following are my goals for the next six months: 1) Ensure a smooth transition of HPL to AEP, with minimal upsets to Texas business. 2) Develop operations processes and controls for the new Texas Desk. 3) Develop a replacement a. Strong push to improve Liz (if she remains with Enron and ) b. Hire new person, internally or externally 4) Assist in develop a strong logisitcs team. With the new business, we will need strong performers who know and accept their responsibilites. 1 and 2 are open-ended. How I accomplish these goals and what they entail will depend how the Texas Desk (if we have one) is set up and what type of activity the desk will be invovled in, which is unknown to me at this time. I'm sure as we get further into the finalization of the sale, additional and possibly more urgent goals will develop. So, in short, who knows what I need to do. D This one also seems to fit the bill. “D” here is writing about his/her goals for the next six months and considers briefly how to accomplish them. Not heavy into the content of the business, so I’m happy here. On to topic 2: sample(which(df.emails.topics$"2" > .95),3)
[1] 50356 22651 19259

enron[[50356]]

I agree it is Matt, and  I believe he has reviewed this tax stuff (or at
least other turbine K's) before.  His concern will be us getting some amount
of advance notice before title transfer (ie, delivery).  Obviously, he might
have some other comments as well.  I'm happy to send him the latest, or maybe
he can access the site?

Kay

Given that the present form of GE world hunger seems to be more domestic than
international it would appear that Matt Gockerman would be a good choice for
the Enron- GE tax discussion.  Do you want to contact him or do you want me
to.   I would be interested in listening in on the conversation for
continuity. 

Here, the conversants seem to be talking about having a phone conversation with “Matt” to get his ideas on a tax discussion. This fits in with the meeting theme. Next!

enron[[22651]]

LOVE
HONEY PIE


Well, that was pretty social, wasn’t it? Okay one more from the same topic:

enron[[19259]]

Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X- X- X- X-b X-Folder: \ExMerge - Giron, Darron C.\Sent Items
X-Origin: GIRON-D
X-FileName: darron giron 6-26-02.PST

Sorry.  I've got a UBS meeting all day.  Catch you later.  I was looking forward to the conversation.

DG

It seems everyone agreed to Ninfa's.  Let's meet at 11:45; let me know if a
different time is better.  Ninfa's is located in the tunnel under the JP
Morgan Chase Tower at 600 Travis.  See you there.

Schroeder

Woops, header info that I didn’t manage to filter out . Anyway, DG writes about an impending conversation, and Schroeder writes about a specific time for their meeting. This fits! Next topic!

sample(which(df.emails.topics$"3" > .95),3) [1] 24147 51673 29717 enron[[24147]] Kaye: Can you please email the prior report to me? Thanks. Sara Shackleton Enron North America Corp. 1400 Smith Street, EB 3801a Houston, Texas 77002 713-853-5620 (phone) 713-646-3490 (fax) 04/10/2001 05:56 PM At Alan's request, please provide to me by e-mail (with a Thursday of this week your suggested changes to the March 2001 Monthly Report, so that we can issue the April 2001 Monthly Report by the end of this week. Thanks for your attention to this matter. Nita This one definitely fits in with the professional sounding administrative emails interpretation. Emailing reports and such. Next!  I believe this was intended for Susan Scott with ETS...I'm with Nat Gas trading. Thanks FYI...another executed capacity transaction on EOL for Transwestern. This message is to confirm your EOL transaction with Transwestern Pipeline Company. You have successfully acquired the package(s) listed below. If you have questions or concerns regarding the transaction(s), please call Michelle Lokay at (713) 345-7932 prior to placing your nominations for these volumes. Product No.: 39096 Time Stamp: 3/27/01 09:03:47 am Product Name: US PLCapTW Frm CenPool-OasisBlock16 Shipper Name: E Prime, Inc. Volume: 10,000 Dth/d Rate:$0.0500 /dth 1-part rate (combined  Res + Com) 100% Load Factor
+ applicable fuel and unaccounted for

TW K#: 27548

Effective
Points:	RP- (POI# 58649)  Central Pool      10,000 Dth/d
DP- (POI# 8516)   Oasis Block 16  10,000 Dth/d

Alternate Point(s):  NONE

Note:     	In order to place a nomination with this agreement, you must log
off the TW system and then log back on.  This action will update
the agreement's information on your PC and allow you to place
nominations under the agreement number shown above.

Contact Info:		Michelle Lokay
Phone (713) 345-7932
Fax       (713) 646-8000

Rather long, but even the short part at the beginning falls under the right category for this topic! Okay, let’s look at the final topic:

sample(which(df.emails.topics\$"4" > .95),3)
[1] 39100  31681  6427

enron[[39100]]

Randy, your proposal is fine by me.  Jim



Hrm, this is supposed to be a ‘business content’ topic, so I suppose I can see why this email was classified as such. It doesn’t take long to go from ‘proposal’ to ‘contract’ if you free associate, right? Next!

enron[[31681]]

review.  I reviewed the letter with Jim Osborne and Ken Krisa yesterday and
should get their comments today.  My plan is to Fedex to Midland for Ken's
signature tomorrow morning and from there it will got to Wildhorse.



This one makes me feel a little better, referencing a specific business letter that the emailer probably wants the emailed person to see. Let’s find one more for good luck:

enron[[6427]]

At a ratio of 10:1, you should have your 4th one signed and have the fifth
one on the way...

09/19/2000 05:40 PM

ONLY 450!  Why, I thought you guys hit 450 a long time ago.

Marie Heard
Senior Legal Specialist
Phone:  (713) 853-3907
Fax:  (713) 646-8537

09/19/00 05:34 PM

Well, I do believe this makes 450!  A nice round number if I do say so myself!

Susan Bailey
09/19/2000 05:30 PM

We have received an executed Master Agreement:

Type of Contract:  ISDA Master Agreement (Multicurrency-Cross Border)

Effective
Enron Entity:   Enron North America Corp.

Counterparty:   Arizona Public Service Company

Weather
Foreign Exchange
Pulp & Paper

Special Note:  The Counterparty has three (3) Local Business Days after the
receipt of a Confirmation from ENA to accept or dispute the Confirmation.
Also, ENA is the Calculation Agent unless it should become a Defaulting
Party, in which case the Counterparty shall be the Calculation Agent.

Susan S. Bailey
Enron North America Corp.
1400 Smith Street, Suite 3806A
Houston, Texas 77002
Phone: (713) 853-4737
Fax: (713) 646-3490

That one was very long, but there’s definitely some good business content in it (along with some happy banter about the contract that I guess was acquired).

All in all, I’d say that fixing those regex patterns that were supposed to filter out the caution/privacy messages at the ends of peoples’ emails was a big boon to the LDA analysis here.

Let that be a lesson: half the battle in LDA is in filtering out the noise!