Mining Sent Email for Self-Knowledge

[This article was first published on Dan Garmat's Blog -- R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How can we use data analytics to increase our self-knowledge? Along with biofeedback from digital devices like FitBit, less structured sources such as sent emails can provide insights.

E.g. here it seems my communication took a sudden more positive turn in 2013. Let’s see what else shakes out of my sent email corpus.

monthly_sentiment

In Snakes in a Package: combining Python and R with reticulate Adnan Fiaz uses a download of personal gmail from Google Takeout to extract R-bloggers post counts from subject lines. To handle gmail’s choice of mbox file format, rather than write a new R package to parse mbox files, he uses reticulate to import a Python package, mailbox. His approach seems a great use case for reticulate – when you want to take advantage of a highly developed Python package in R.

Loading Email Corpus into R

I wanted to mine my own emails for sentiment and see if I can learn anything about myself. Has my sent mail showed signs of mood trends over time? I started by following his example:

<span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidytext</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">lubridate</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">reticulate</span><span class="p">)</span><span class="w">
</span><span class="n">mailbox</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">import</span><span class="p">(</span><span class="s2">"mailbox"</span><span class="p">)</span><span class="w">

</span><span class="n">sent</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mailbox</span><span class="o">$</span><span class="n">mbox</span><span class="p">(</span><span class="s2">"Sent-001.mbox"</span><span class="p">)</span><span class="w">
</span><span class="n">message</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sent</span><span class="o">$</span><span class="n">get_message</span><span class="p">(</span><span class="m">11L</span><span class="p">)</span><span class="w">
</span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"Date"</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] "Mon, 23 Jul 2018 20:01:33 -0700"</span><span class="w">
</span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"Subject"</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] "Re: Ptfc schedules"</span><span class="w">
</span>

Loading in email #11, can see it’s about Portland Football Club’s schedule. I wanted to see the body of the email, but found the normal built-in documentation doesn’t exist for Python modules

<span class="o">?</span><span class="n">get_message</span><span class="w">
</span><span class="c1"># No documentation for ‘get_message’ in specified packages and libraries:</span><span class="w">
</span><span class="c1"># you could try ‘??get_message’</span><span class="w">
</span><span class="o">?</span><span class="n">mailbox</span><span class="w">
</span><span class="c1"># No documentation for ‘mailbox’ in specified packages and libraries:</span><span class="w">
</span><span class="c1"># you could try ‘??mailbox’</span><span class="w">
</span>

Returning message prints the whole thing, but with much additional unneeded formatting. So worked around it with nested sub() and gsub() commands on specific example emails to get down to the text I wrote and sent, only.

It starts with this already difficult to understand call

<span class="n">sub</span><span class="p">(</span><span class="s2">".*Content-Transfer-Encoding: quoted-printable"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> 
  </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"=E2=80=99"</span><span class="p">,</span><span class="w"> </span><span class="s2">"'"</span><span class="p">,</span><span class="w"> 
  </span><span class="n">gsub</span><span class="p">(</span><span class="s2">">"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> 
  </span><span class="n">sub</span><span class="p">(</span><span class="s2">"On [A-Z][a-z]{2}.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> 
  </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\n|\t"</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w"> 
  </span><span class="n">message</span><span class="p">)))))</span><span class="w">
</span>

And, after much guess-try-see-what’s-left-and-add-another-sub(), ended up with this ugly function that does semi-reasonably for my goal of sentiment analysis:

<span class="n">parse_sent_message</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">email</span><span class="p">){</span><span class="w">
  </span><span class="n">substr</span><span class="p">(</span><span class="w">
    </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"-top:|-bottom:|break-word"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="n">sub</span><span class="p">(</span><span class="s2">"Content-Type: application/pdf|Mime-Version: 1.0.*"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="n">sub</span><span class="p">(</span><span class="s2">".*charset ISO|charset  UTF-8|charset us-ascii"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="n">sub</span><span class="p">(</span><span class="s2">".*Content-Transfer-Encoding: 7bit"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> 
    </span><span class="n">sub</span><span class="p">(</span><span class="s2">"orwarded message.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> 
    </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"=|\""</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w"> 
    </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"  "</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w"> 
    </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"= "</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> 
    </span><span class="n">sub</span><span class="p">(</span><span class="s2">".*Content-Transfer-Encoding: quoted-printable"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> 
    </span><span class="n">sub</span><span class="p">(</span><span class="s2">".*charset=UTF-8"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> 
    </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"=E2=80=99|'"</span><span class="p">,</span><span class="w"> </span><span class="s2">"'"</span><span class="p">,</span><span class="w"> 
    </span><span class="n">gsub</span><span class="p">(</span><span class="s2">">|<"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> 
    </span><span class="n">sub</span><span class="p">(</span><span class="s2">"On [A-Z][a-z]{2}.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
    </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\n|\t|<div|</div>|<br>"</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w"> 
    </span><span class="n">email</span><span class="p">))))))))))))))),</span><span class="w"> 
  </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">parse_sent_message</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] " Hey aren't you planning to go to Seattle the 16th? Trying to figure out my days off schedule    "</span><span class="w">
</span>

Good to go. I tried using the R mailman wrapper, but ran into issues, so went back to the imported mailbox module. Importing and parsing took a few minutes:

<span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"From"</span><span class="p">)</span><span class="w"> </span><span class="c1"># check this email index 11 if from my email address</span><span class="w">
</span><span class="n">myemail</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"From"</span><span class="p">)</span><span class="w"> </span><span class="c1"># since it is, save as myemail to check the rest</span><span class="w">

</span><span class="n">keys</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sent</span><span class="o">$</span><span class="n">keys</span><span class="p">()</span><span class="w">
</span><span class="c1"># keys <- keys[1:3000] # uncomment if want to run the below on a subset to see if it works</span><span class="w">
</span><span class="n">number_of_messages</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">keys</span><span class="p">)</span><span class="w">

</span><span class="n">pb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">utils</span><span class="o">::</span><span class="n">txtProgressBar</span><span class="p">(</span><span class="n">max</span><span class="o">=</span><span class="n">number_of_messages</span><span class="p">)</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_frame</span><span class="p">(</span><span class="n">sent_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="kc">NA</span><span class="p">),</span><span class="w"> </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">as.character</span><span class="p">(</span><span class="kc">NA</span><span class="p">),</span><span class="w"> </span><span class="n">number_of_messages</span><span class="p">))</span><span class="w">

</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">keys</span><span class="p">)){</span><span class="w">
  </span><span class="n">message</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sent</span><span class="o">$</span><span class="n">get_message</span><span class="p">(</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="nf">is.character</span><span class="p">(</span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"From"</span><span class="p">))){</span><span class="w">
    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"From"</span><span class="p">)</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">myemail</span><span class="p">){</span><span class="w">
      </span><span class="n">sent_messages</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"Date"</span><span class="p">)</span><span class="w">
      </span><span class="n">sent_messages</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">parse_sent_message</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="n">utils</span><span class="o">::</span><span class="n">setTxtProgressBar</span><span class="p">(</span><span class="n">pb</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

If the message is not from me, it is saved as NA. What percent of mail flagged “sent” was not from myemail?

<span class="nf">sum</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">sent_messages</span><span class="o">$</span><span class="n">text</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">number_of_messages</span><span class="w">
</span><span class="c1"># [1] 0.6664132</span><span class="w">
</span>

67%.
Removing them and doing some additional processing, can see these 11,093 remaining sent emails range from November of 2014 to September of 2018 with a median date of October of 2013.

<span class="n">sent_messages</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">text</span><span class="p">))</span><span class="w">

</span><span class="n">sent_messages</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">sent_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dmy_hms</span><span class="p">(</span><span class="n">sent_date</span><span class="p">))</span><span class="w">

</span><span class="c1"># remove duplicates per month</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">year_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">(</span><span class="n">sent_date</span><span class="p">),</span><span class="w">
         </span><span class="n">month_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">month</span><span class="p">(</span><span class="n">sent_date</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">year_sent</span><span class="p">,</span><span class="w"> </span><span class="n">month_sent</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">top_n</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">wt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sent_date</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w">

</span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">summary</span><span class="p">(</span><span class="n">sent_date</span><span class="p">)</span><span class="w">
</span><span class="c1">#   sent_date                       text             year_sent      month_sent    </span><span class="w">
</span><span class="c1"># Min.   :2004-11-10 01:42:04   Length:11093       Min.   :2004   Min.   : 1.000  </span><span class="w">
</span><span class="c1"># 1st Qu.:2010-07-17 20:39:10   Class :character   1st Qu.:2010   1st Qu.: 3.000  </span><span class="w">
</span><span class="c1"># Median :2013-10-01 22:12:08   Mode  :character   Median :2013   Median : 6.000  </span><span class="w">
</span><span class="c1"># Mean   :2013-03-24 10:55:30                      Mean   :2013   Mean   : 6.416  </span><span class="w">
</span><span class="c1"># 3rd Qu.:2015-09-18 19:45:21                      3rd Qu.:2015   3rd Qu.: 9.000  </span><span class="w">
</span><span class="c1"># Max.   :2018-09-30 01:35:02                      Max.   :2018   Max.   :12.000    </span><span class="w">
</span>

While median date comes a bit later than the chronological midpoint seemingly implies slightly more emails later, from the chart above, it’s probably more due to missing years of data.

Sentiment Analysis

Julia Silge and David Robinson have put together an excellent online reference on text mining at Text Mining with R so with some slight work can follow their analyses with email data. Using their tidytext package, quickly see a lot of html formatting tags still made it past my gsub() gauntlet.

<span class="n">tidy_emails</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="p">)</span><span class="w">
  
</span><span class="n">tidy_emails</span><span class="w">
</span><span class="c1"># # A tibble: 886,870 x 4</span><span class="w">
</span><span class="c1">#    sent_date           year_sent month_sent word     </span><span class="w">
</span><span class="c1">#    <dttm>                  <dbl>      <dbl> <chr>    </span><span class="w">
</span><span class="c1">#  1 2018-09-27 16:30:19      2018          9 htmlbodyp</span><span class="w">
</span><span class="c1">#  2 2018-09-27 16:30:19      2018          9 style    </span><span class="w">
</span><span class="c1">#  3 2018-09-27 16:30:19      2018          9 margin   </span><span class="w">
</span><span class="c1">#  4 2018-09-27 16:30:19      2018          9 0px      </span><span class="w">
</span><span class="c1">#  5 2018-09-27 16:30:19      2018          9 font     </span><span class="w">
</span><span class="c1">#  6 2018-09-27 16:30:19      2018          9 stretch  </span><span class="w">
</span><span class="c1">#  7 2018-09-27 16:30:19      2018          9 normal   </span><span class="w">
</span><span class="c1">#  8 2018-09-27 16:30:19      2018          9 font     </span><span class="w">
</span><span class="c1">#  9 2018-09-27 16:30:19      2018          9 size     </span><span class="w">
</span><span class="c1"># 10 2018-09-27 16:30:19      2018          9 12px     </span><span class="w">
</span><span class="c1"># # ... with 886,860 more rows</span><span class="w">
</span>

In fact, after common stop words are removed, can see a need to add a few more

<span class="n">data</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span><span class="w">

</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span><span class="w">

</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> 
</span><span class="c1"># # A tibble: 129,528 x 2</span><span class="w">
</span><span class="c1">#    word                                                                             n</span><span class="w">
</span><span class="c1">#    <chr>                                                                        <int></span><span class="w">
</span><span class="c1">#  1 3d                                                                            8433</span><span class="w">
</span><span class="c1">#  2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  7620</span><span class="w">
</span><span class="c1">#  3 content                                                                       4086</span><span class="w">
</span><span class="c1">#  4 dan                                                                           3487</span><span class="w">
</span><span class="c1">#  5 1                                                                             3451</span><span class="w">
</span><span class="c1">#  6 font                                                                          2735</span><span class="w">
</span><span class="c1">#  7 type                                                                          2695</span><span class="w">
</span><span class="c1">#  8 style                                                                         2535</span><span class="w">
</span><span class="c1">#  9 nbsp                                                                          2495</span><span class="w">
</span><span class="c1"># 10 class                                                                         2451</span><span class="w">
</span><span class="c1"># # ... with 129,518 more rows  </span><span class="w">
</span>

Maybe the

<span class="n">nchar</span><span class="p">(</span><span class="s2">"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] 76</span><span class="w">
</span>

76 a’s in a row come from consolidating from something in the gsub()s.

Adding these less useful terms to create an email stop words dictonary:

<span class="n">email_stop_words</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">stop_words</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">rbind</span><span class="p">(</span><span class="w">
    </span><span class="n">data_frame</span><span class="p">(</span><span class="s2">"word"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">9</span><span class="p">),</span><span class="w"> </span><span class="s2">"3d"</span><span class="p">,</span><span class="w"> </span><span class="s2">"8a"</span><span class="p">,</span><span class="w"> </span><span class="s2">"mail.gmail.com"</span><span class="p">,</span><span class="w"> </span><span class="s2">"wa"</span><span class="p">,</span><span class="w"> </span><span class="s2">"aa"</span><span class="p">,</span><span class="w"> </span><span class="s2">"content"</span><span class="p">,</span><span class="w"> </span><span class="s2">"dir"</span><span class="p">,</span><span class="w">
                          </span><span class="s2">"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"</span><span class="p">,</span><span class="w">
                          </span><span class="s2">"ad"</span><span class="p">,</span><span class="w"> </span><span class="s2">"af"</span><span class="p">,</span><span class="w"> </span><span class="s2">"font"</span><span class="p">,</span><span class="w"> </span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="s2">"auto"</span><span class="p">,</span><span class="w"> </span><span class="s2">"zz"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ae"</span><span class="p">,</span><span class="w"> </span><span class="s2">"zx"</span><span class="p">,</span><span class="w"> </span><span class="s2">"id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ai"</span><span class="p">,</span><span class="w">
                          </span><span class="s2">"style"</span><span class="p">,</span><span class="w"> </span><span class="s2">"nbsp"</span><span class="p">,</span><span class="w"> </span><span class="s2">"class"</span><span class="p">,</span><span class="w"> </span><span class="s2">"span"</span><span class="p">,</span><span class="w"> </span><span class="s2">"http"</span><span class="p">,</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gmail.com"</span><span class="p">,</span><span class="w"> 
                          </span><span class="s2">"plain"</span><span class="p">,</span><span class="w"> </span><span class="s2">"0px"</span><span class="p">,</span><span class="w"> </span><span class="s2">"size"</span><span class="p">,</span><span class="w"> </span><span class="s2">"color"</span><span class="p">,</span><span class="w"> </span><span class="s2">"quot"</span><span class="p">,</span><span class="w"> </span><span class="s2">"8859"</span><span class="p">,</span><span class="w"> </span><span class="s2">"href"</span><span class="p">,</span><span class="w"> </span><span class="s2">"margin"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ltr"</span><span class="p">,</span><span class="w"> 
                          </span><span class="s2">"left"</span><span class="p">,</span><span class="w"> </span><span class="s2">"disposition"</span><span class="p">,</span><span class="w"> </span><span class="s2">"attachment"</span><span class="p">,</span><span class="w"> </span><span class="s2">"padding"</span><span class="p">,</span><span class="w"> </span><span class="s2">"rgba"</span><span class="p">,</span><span class="w"> </span><span class="s2">"webkit"</span><span class="p">,</span><span class="w"> </span><span class="s2">"https"</span><span class="p">),</span><span class="w">
               </span><span class="s2">"lexicon"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sent_email"</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">  

</span><span class="c1"># just remove all words less than 3 letters</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">email_stop_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">nchar</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">

</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">top_n</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">wt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_col</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">xlab</span><span class="p">(</span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">coord_flip</span><span class="p">()</span><span class="w">
</span>

top_words

Can see some unsurprising name related common terms as well as “lol” and “hey”. But surprisingly “time”, “meeting”, “week”, and “people” also show up a lot. Wonder if those are unusual. (Would need another sent mail corpus to compare.)

What are my top joy words in email?

<span class="n">nrc_joy</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"nrc"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">filter</span><span class="p">(</span><span class="n">sentiment</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"joy"</span><span class="p">)</span><span class="w">

</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">inner_join</span><span class="p">(</span><span class="n">nrc_joy</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1"># # A tibble: 373 x 2</span><span class="w">
</span><span class="c1">#    word        n</span><span class="w">
</span><span class="c1">#    <chr>   <int></span><span class="w">
</span><span class="c1">#  1 art       531</span><span class="w">
</span><span class="c1">#  2 feeling   389</span><span class="w">
</span><span class="c1">#  3 hope      387</span><span class="w">
</span><span class="c1">#  4 found     318</span><span class="w">
</span><span class="c1">#  5 pretty    286</span><span class="w">
</span><span class="c1">#  6 true      267</span><span class="w">
</span><span class="c1">#  7 pay       229</span><span class="w">
</span><span class="c1">#  8 money     218</span><span class="w">
</span><span class="c1">#  9 friend    209</span><span class="w">
</span><span class="c1"># 10 love      203</span><span class="w">
</span><span class="c1"># # ... with 363 more rows</span><span class="w">
</span>

Hm, I only partially agree with this list. “Art” is a friend I email frequently. “Feeling” is a slight positive, but more neutral than a joy word per se. “Hope” is most common I’d agree with between 2004 and 2018 it seems.

How does sentiment look over time? Grouping by month:

<span class="n">email_sentiment</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">year_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">(</span><span class="n">sent_date</span><span class="p">),</span><span class="w">
         </span><span class="n">month_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">month</span><span class="p">(</span><span class="n">sent_date</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">inner_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">year_sent</span><span class="p">,</span><span class="w"> </span><span class="n">month_sent</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">spread</span><span class="p">(</span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">sentiment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">positive</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">negative</span><span class="p">)</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">email_sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">month_sent</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(</span><span class="n">year_sent</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_col</span><span class="p">(</span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">year_sent</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> 
</span>

sentiment_by_time

2005, 2013, 2015 and 2016 look like more positive sentiment sent mail years. 2009 and 2011 look more negative overall. A few years, much of 2006, 2007 and 2008 are missing, weirdly.

Also see an apparently highly negative month in August of 2009.

<span class="c1"># whoa happened in August of 2009?</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">sent_date</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="s2">"2009-08-01"</span><span class="p">,</span><span class="w"> </span><span class="n">sent_date</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="s2">"2009-08-31"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">write.csv</span><span class="p">(</span><span class="s2">"temp.csv"</span><span class="p">)</span><span class="w">

</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">year_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">(</span><span class="n">sent_date</span><span class="p">),</span><span class="w">
         </span><span class="n">month_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">month</span><span class="p">(</span><span class="n">sent_date</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">inner_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">year_sent</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">2009</span><span class="p">,</span><span class="w"> </span><span class="n">month_sent</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">08</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> 
</span><span class="c1"># # A tibble: 237 x 3</span><span class="w">
</span><span class="c1">#    word       sentiment     n</span><span class="w">
</span><span class="c1">#    <chr>      <chr>     <int></span><span class="w">
</span><span class="c1">#  1 pain       negative     35</span><span class="w">
</span><span class="c1">#  2 happiness  positive     21</span><span class="w">
</span><span class="c1">#  3 sting      negative     21</span><span class="w">
</span><span class="c1">#  4 happy      positive     12</span><span class="w">
</span><span class="c1">#  5 stinging   negative     12</span><span class="w">
</span><span class="c1">#  6 depression negative     11</span><span class="w">
</span><span class="c1">#  7 free       positive     11</span><span class="w">
</span><span class="c1">#  8 bad        negative      9</span><span class="w">
</span><span class="c1">#  9 damage     negative      9</span><span class="w">
</span><span class="c1"># 10 venom      negative      9</span><span class="w">
</span><span class="c1"># # ... with 227 more rows</span><span class="w">
</span>

Was it a bad breakup? Digging into my emails, can find a New York Times Magazine article copy-and-pasted and sent to several people. The article, “Oh, Sting, Where Is Thy Death?” By Richard Conniff, mentions the pain of stinging insects and its relevance to happiness research. Note most of those ns are divisible by 3.

Most Common Charged Words

If taking all the emotionally charged words and seeing what comes out most often, both surprises and expected outcomes show up:

bing_word_counts <- 
  tidy_emails %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts
# # A tibble: 2,143 x 3
#    word    sentiment     n
#    <chr>   <chr>     <int>
#  1 cool    positive    481
#  2 nice    positive    456
#  3 free    positive    445
#  4 bad     negative    308
#  5 pretty  positive    286
#  6 retreat negative    239
#  7 solid   positive    230
#  8 fine    positive    222
#  9 hard    negative    219
# 10 worth   positive    207
# # ... with 2,133 more rows

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

top_sentiment_words

Surprised to see how much more positive words show up than negative words - Bing does have more positive words in its lexicon, so could make sense there. “Bad” as top negative word seems like a bad top word. “Issue” is definitely a word I have an issue with using a bad amount of time. But it’s cool to see how much I use “cool” (or is it bad? this is causing anxiety). Anyway, I think this is a solid view worth the time to get a nice feeling for top words I love to use in email.

Obligatory Wordcloud

Is it easier to read than the above? Nah, but it must be included in any text mining blog post, so…

<span class="n">library</span><span class="p">(</span><span class="n">wordcloud</span><span class="p">)</span><span class="w">

</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">email_stop_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">nchar</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">with</span><span class="p">(</span><span class="n">wordcloud</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">max.words</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">))</span><span class="w">

</span><span class="n">library</span><span class="p">(</span><span class="n">reshape2</span><span class="p">)</span><span class="w">

</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">inner_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">acast</span><span class="p">(</span><span class="n">word</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">value.var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">comparison.cloud</span><span class="p">(</span><span class="n">colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"gray20"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gray80"</span><span class="p">),</span><span class="w">
                   </span><span class="n">max.words</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span>

top_sentiment_wordcloud

Hope that was cool 🙂

To leave a comment for the author, please follow the link and comment on their blog: Dan Garmat's Blog -- R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)