Mining Sent Email for Self-Knowledge
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How can we use data analytics to increase our self-knowledge? Along with biofeedback from digital devices like FitBit, less structured sources such as sent emails can provide insights.
E.g. here it seems my communication took a sudden more positive turn in 2013. Let’s see what else shakes out of my sent email corpus.
In Snakes in a Package: combining Python and R with reticulate Adnan Fiaz uses a download of personal gmail from Google Takeout to extract R-bloggers post counts from subject lines. To handle gmail’s choice of mbox file format, rather than write a new R package to parse mbox files, he uses reticulate to import a Python package, mailbox. His approach seems a great use case for reticulate – when you want to take advantage of a highly developed Python package in R.
Loading Email Corpus into R
I wanted to mine my own emails for sentiment and see if I can learn anything about myself. Has my sent mail showed signs of mood trends over time? I started by following his example:
<span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidytext</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">lubridate</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">reticulate</span><span class="p">)</span><span class="w">
</span><span class="n">mailbox</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">import</span><span class="p">(</span><span class="s2">"mailbox"</span><span class="p">)</span><span class="w">
</span><span class="n">sent</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mailbox</span><span class="o">$</span><span class="n">mbox</span><span class="p">(</span><span class="s2">"Sent-001.mbox"</span><span class="p">)</span><span class="w">
</span><span class="n">message</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sent</span><span class="o">$</span><span class="n">get_message</span><span class="p">(</span><span class="m">11L</span><span class="p">)</span><span class="w">
</span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"Date"</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] "Mon, 23 Jul 2018 20:01:33 -0700"</span><span class="w">
</span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"Subject"</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] "Re: Ptfc schedules"</span><span class="w">
</span>
Loading in email #11, can see it’s about Portland Football Club’s schedule. I wanted to see the body of the email, but found the normal built-in documentation doesn’t exist for Python modules
<span class="o">?</span><span class="n">get_message</span><span class="w">
</span><span class="c1"># No documentation for ‘get_message’ in specified packages and libraries:</span><span class="w">
</span><span class="c1"># you could try ‘??get_message’</span><span class="w">
</span><span class="o">?</span><span class="n">mailbox</span><span class="w">
</span><span class="c1"># No documentation for ‘mailbox’ in specified packages and libraries:</span><span class="w">
</span><span class="c1"># you could try ‘??mailbox’</span><span class="w">
</span>
Returning message
prints the whole thing, but with much additional unneeded formatting. So worked around it with nested sub()
and gsub()
commands on specific example emails to get down to the text I wrote and sent, only.
It starts with this already difficult to understand call
<span class="n">sub</span><span class="p">(</span><span class="s2">".*Content-Transfer-Encoding: quoted-printable"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"=E2=80=99"</span><span class="p">,</span><span class="w"> </span><span class="s2">"'"</span><span class="p">,</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">">"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">sub</span><span class="p">(</span><span class="s2">"On [A-Z][a-z]{2}.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\n|\t"</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w">
</span><span class="n">message</span><span class="p">)))))</span><span class="w">
</span>
And, after much guess-try-see-what’s-left-and-add-another-sub()
, ended up with this ugly function that does semi-reasonably for my goal of sentiment analysis:
<span class="n">parse_sent_message</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">email</span><span class="p">){</span><span class="w">
</span><span class="n">substr</span><span class="p">(</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"-top:|-bottom:|break-word"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">sub</span><span class="p">(</span><span class="s2">"Content-Type: application/pdf|Mime-Version: 1.0.*"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">sub</span><span class="p">(</span><span class="s2">".*charset ISO|charset UTF-8|charset us-ascii"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">sub</span><span class="p">(</span><span class="s2">".*Content-Transfer-Encoding: 7bit"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">sub</span><span class="p">(</span><span class="s2">"orwarded message.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"=|\""</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">" "</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"= "</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">sub</span><span class="p">(</span><span class="s2">".*Content-Transfer-Encoding: quoted-printable"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">sub</span><span class="p">(</span><span class="s2">".*charset=UTF-8"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"=E2=80=99|'"</span><span class="p">,</span><span class="w"> </span><span class="s2">"'"</span><span class="p">,</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">">|<"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">sub</span><span class="p">(</span><span class="s2">"On [A-Z][a-z]{2}.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\n|\t|<div|</div>|<br>"</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w">
</span><span class="n">email</span><span class="p">))))))))))))))),</span><span class="w">
</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">parse_sent_message</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] " Hey aren't you planning to go to Seattle the 16th? Trying to figure out my days off schedule "</span><span class="w">
</span>
Good to go. I tried using the R mailman wrapper, but ran into issues, so went back to the imported mailbox module. Importing and parsing took a few minutes:
<span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"From"</span><span class="p">)</span><span class="w"> </span><span class="c1"># check this email index 11 if from my email address</span><span class="w">
</span><span class="n">myemail</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"From"</span><span class="p">)</span><span class="w"> </span><span class="c1"># since it is, save as myemail to check the rest</span><span class="w">
</span><span class="n">keys</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sent</span><span class="o">$</span><span class="n">keys</span><span class="p">()</span><span class="w">
</span><span class="c1"># keys <- keys[1:3000] # uncomment if want to run the below on a subset to see if it works</span><span class="w">
</span><span class="n">number_of_messages</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">keys</span><span class="p">)</span><span class="w">
</span><span class="n">pb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">utils</span><span class="o">::</span><span class="n">txtProgressBar</span><span class="p">(</span><span class="n">max</span><span class="o">=</span><span class="n">number_of_messages</span><span class="p">)</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_frame</span><span class="p">(</span><span class="n">sent_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="kc">NA</span><span class="p">),</span><span class="w"> </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">as.character</span><span class="p">(</span><span class="kc">NA</span><span class="p">),</span><span class="w"> </span><span class="n">number_of_messages</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">keys</span><span class="p">)){</span><span class="w">
</span><span class="n">message</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sent</span><span class="o">$</span><span class="n">get_message</span><span class="p">(</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="nf">is.character</span><span class="p">(</span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"From"</span><span class="p">))){</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"From"</span><span class="p">)</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">myemail</span><span class="p">){</span><span class="w">
</span><span class="n">sent_messages</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">message</span><span class="o">$</span><span class="n">get</span><span class="p">(</span><span class="s2">"Date"</span><span class="p">)</span><span class="w">
</span><span class="n">sent_messages</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">parse_sent_message</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">utils</span><span class="o">::</span><span class="n">setTxtProgressBar</span><span class="p">(</span><span class="n">pb</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>
If the message is not from me, it is saved as NA
. What percent of mail flagged “sent” was not from myemail
?
<span class="nf">sum</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">sent_messages</span><span class="o">$</span><span class="n">text</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">number_of_messages</span><span class="w">
</span><span class="c1"># [1] 0.6664132</span><span class="w">
</span>
67%.
Removing them and doing some additional processing, can see these 11,093 remaining sent emails range from November of 2014 to September of 2018 with a median date of October of 2013.
<span class="n">sent_messages</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">text</span><span class="p">))</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">sent_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dmy_hms</span><span class="p">(</span><span class="n">sent_date</span><span class="p">))</span><span class="w">
</span><span class="c1"># remove duplicates per month</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">year_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">(</span><span class="n">sent_date</span><span class="p">),</span><span class="w">
</span><span class="n">month_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">month</span><span class="p">(</span><span class="n">sent_date</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">year_sent</span><span class="p">,</span><span class="w"> </span><span class="n">month_sent</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">top_n</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">wt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sent_date</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ungroup</span><span class="p">()</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">sent_date</span><span class="p">)</span><span class="w">
</span><span class="c1"># sent_date text year_sent month_sent </span><span class="w">
</span><span class="c1"># Min. :2004-11-10 01:42:04 Length:11093 Min. :2004 Min. : 1.000 </span><span class="w">
</span><span class="c1"># 1st Qu.:2010-07-17 20:39:10 Class :character 1st Qu.:2010 1st Qu.: 3.000 </span><span class="w">
</span><span class="c1"># Median :2013-10-01 22:12:08 Mode :character Median :2013 Median : 6.000 </span><span class="w">
</span><span class="c1"># Mean :2013-03-24 10:55:30 Mean :2013 Mean : 6.416 </span><span class="w">
</span><span class="c1"># 3rd Qu.:2015-09-18 19:45:21 3rd Qu.:2015 3rd Qu.: 9.000 </span><span class="w">
</span><span class="c1"># Max. :2018-09-30 01:35:02 Max. :2018 Max. :12.000 </span><span class="w">
</span>
While median date comes a bit later than the chronological midpoint seemingly implies slightly more emails later, from the chart above, it’s probably more due to missing years of data.
Sentiment Analysis
Julia Silge and David Robinson have put together an excellent online reference on text mining at Text Mining with R so with some slight work can follow their analyses with email data. Using their tidytext
package, quickly see a lot of html formatting tags still made it past my gsub()
gauntlet.
<span class="n">tidy_emails</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="p">)</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w">
</span><span class="c1"># # A tibble: 886,870 x 4</span><span class="w">
</span><span class="c1"># sent_date year_sent month_sent word </span><span class="w">
</span><span class="c1"># <dttm> <dbl> <dbl> <chr> </span><span class="w">
</span><span class="c1"># 1 2018-09-27 16:30:19 2018 9 htmlbodyp</span><span class="w">
</span><span class="c1"># 2 2018-09-27 16:30:19 2018 9 style </span><span class="w">
</span><span class="c1"># 3 2018-09-27 16:30:19 2018 9 margin </span><span class="w">
</span><span class="c1"># 4 2018-09-27 16:30:19 2018 9 0px </span><span class="w">
</span><span class="c1"># 5 2018-09-27 16:30:19 2018 9 font </span><span class="w">
</span><span class="c1"># 6 2018-09-27 16:30:19 2018 9 stretch </span><span class="w">
</span><span class="c1"># 7 2018-09-27 16:30:19 2018 9 normal </span><span class="w">
</span><span class="c1"># 8 2018-09-27 16:30:19 2018 9 font </span><span class="w">
</span><span class="c1"># 9 2018-09-27 16:30:19 2018 9 size </span><span class="w">
</span><span class="c1"># 10 2018-09-27 16:30:19 2018 9 12px </span><span class="w">
</span><span class="c1"># # ... with 886,860 more rows</span><span class="w">
</span>
In fact, after common stop words are removed, can see a need to add a few more
<span class="n">data</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">anti_join</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1"># # A tibble: 129,528 x 2</span><span class="w">
</span><span class="c1"># word n</span><span class="w">
</span><span class="c1"># <chr> <int></span><span class="w">
</span><span class="c1"># 1 3d 8433</span><span class="w">
</span><span class="c1"># 2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 7620</span><span class="w">
</span><span class="c1"># 3 content 4086</span><span class="w">
</span><span class="c1"># 4 dan 3487</span><span class="w">
</span><span class="c1"># 5 1 3451</span><span class="w">
</span><span class="c1"># 6 font 2735</span><span class="w">
</span><span class="c1"># 7 type 2695</span><span class="w">
</span><span class="c1"># 8 style 2535</span><span class="w">
</span><span class="c1"># 9 nbsp 2495</span><span class="w">
</span><span class="c1"># 10 class 2451</span><span class="w">
</span><span class="c1"># # ... with 129,518 more rows </span><span class="w">
</span>
Maybe the
<span class="n">nchar</span><span class="p">(</span><span class="s2">"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] 76</span><span class="w">
</span>
76 a’s in a row come from consolidating from something in the
gsub()
s.
Adding these less useful terms to create an email stop words dictonary:
<span class="n">email_stop_words</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">stop_words</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rbind</span><span class="p">(</span><span class="w">
</span><span class="n">data_frame</span><span class="p">(</span><span class="s2">"word"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">9</span><span class="p">),</span><span class="w"> </span><span class="s2">"3d"</span><span class="p">,</span><span class="w"> </span><span class="s2">"8a"</span><span class="p">,</span><span class="w"> </span><span class="s2">"mail.gmail.com"</span><span class="p">,</span><span class="w"> </span><span class="s2">"wa"</span><span class="p">,</span><span class="w"> </span><span class="s2">"aa"</span><span class="p">,</span><span class="w"> </span><span class="s2">"content"</span><span class="p">,</span><span class="w"> </span><span class="s2">"dir"</span><span class="p">,</span><span class="w">
</span><span class="s2">"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"</span><span class="p">,</span><span class="w">
</span><span class="s2">"ad"</span><span class="p">,</span><span class="w"> </span><span class="s2">"af"</span><span class="p">,</span><span class="w"> </span><span class="s2">"font"</span><span class="p">,</span><span class="w"> </span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="s2">"auto"</span><span class="p">,</span><span class="w"> </span><span class="s2">"zz"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ae"</span><span class="p">,</span><span class="w"> </span><span class="s2">"zx"</span><span class="p">,</span><span class="w"> </span><span class="s2">"id"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ai"</span><span class="p">,</span><span class="w">
</span><span class="s2">"style"</span><span class="p">,</span><span class="w"> </span><span class="s2">"nbsp"</span><span class="p">,</span><span class="w"> </span><span class="s2">"class"</span><span class="p">,</span><span class="w"> </span><span class="s2">"span"</span><span class="p">,</span><span class="w"> </span><span class="s2">"http"</span><span class="p">,</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gmail.com"</span><span class="p">,</span><span class="w">
</span><span class="s2">"plain"</span><span class="p">,</span><span class="w"> </span><span class="s2">"0px"</span><span class="p">,</span><span class="w"> </span><span class="s2">"size"</span><span class="p">,</span><span class="w"> </span><span class="s2">"color"</span><span class="p">,</span><span class="w"> </span><span class="s2">"quot"</span><span class="p">,</span><span class="w"> </span><span class="s2">"8859"</span><span class="p">,</span><span class="w"> </span><span class="s2">"href"</span><span class="p">,</span><span class="w"> </span><span class="s2">"margin"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ltr"</span><span class="p">,</span><span class="w">
</span><span class="s2">"left"</span><span class="p">,</span><span class="w"> </span><span class="s2">"disposition"</span><span class="p">,</span><span class="w"> </span><span class="s2">"attachment"</span><span class="p">,</span><span class="w"> </span><span class="s2">"padding"</span><span class="p">,</span><span class="w"> </span><span class="s2">"rgba"</span><span class="p">,</span><span class="w"> </span><span class="s2">"webkit"</span><span class="p">,</span><span class="w"> </span><span class="s2">"https"</span><span class="p">),</span><span class="w">
</span><span class="s2">"lexicon"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sent_email"</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># just remove all words less than 3 letters</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">anti_join</span><span class="p">(</span><span class="n">email_stop_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">nchar</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">top_n</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">wt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_col</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_flip</span><span class="p">()</span><span class="w">
</span>
Can see some unsurprising name related common terms as well as “lol” and “hey”. But surprisingly “time”, “meeting”, “week”, and “people” also show up a lot. Wonder if those are unusual. (Would need another sent mail corpus to compare.)
What are my top joy words in email?
<span class="n">nrc_joy</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"nrc"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">sentiment</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"joy"</span><span class="p">)</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">inner_join</span><span class="p">(</span><span class="n">nrc_joy</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1"># # A tibble: 373 x 2</span><span class="w">
</span><span class="c1"># word n</span><span class="w">
</span><span class="c1"># <chr> <int></span><span class="w">
</span><span class="c1"># 1 art 531</span><span class="w">
</span><span class="c1"># 2 feeling 389</span><span class="w">
</span><span class="c1"># 3 hope 387</span><span class="w">
</span><span class="c1"># 4 found 318</span><span class="w">
</span><span class="c1"># 5 pretty 286</span><span class="w">
</span><span class="c1"># 6 true 267</span><span class="w">
</span><span class="c1"># 7 pay 229</span><span class="w">
</span><span class="c1"># 8 money 218</span><span class="w">
</span><span class="c1"># 9 friend 209</span><span class="w">
</span><span class="c1"># 10 love 203</span><span class="w">
</span><span class="c1"># # ... with 363 more rows</span><span class="w">
</span>
Hm, I only partially agree with this list. “Art” is a friend I email frequently. “Feeling” is a slight positive, but more neutral than a joy word per se. “Hope” is most common I’d agree with between 2004 and 2018 it seems.
How does sentiment look over time? Grouping by month:
<span class="n">email_sentiment</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">year_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">(</span><span class="n">sent_date</span><span class="p">),</span><span class="w">
</span><span class="n">month_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">month</span><span class="p">(</span><span class="n">sent_date</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">inner_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">year_sent</span><span class="p">,</span><span class="w"> </span><span class="n">month_sent</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">spread</span><span class="p">(</span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">sentiment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">positive</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">negative</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">email_sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">month_sent</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(</span><span class="n">year_sent</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_col</span><span class="p">(</span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">year_sent</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span>
2005, 2013, 2015 and 2016 look like more positive sentiment sent mail years. 2009 and 2011 look more negative overall. A few years, much of 2006, 2007 and 2008 are missing, weirdly.
Also see an apparently highly negative month in August of 2009.
<span class="c1"># whoa happened in August of 2009?</span><span class="w">
</span><span class="n">sent_messages</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">sent_date</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="s2">"2009-08-01"</span><span class="p">,</span><span class="w"> </span><span class="n">sent_date</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="s2">"2009-08-31"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">write.csv</span><span class="p">(</span><span class="s2">"temp.csv"</span><span class="p">)</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">year_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">year</span><span class="p">(</span><span class="n">sent_date</span><span class="p">),</span><span class="w">
</span><span class="n">month_sent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">month</span><span class="p">(</span><span class="n">sent_date</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">inner_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">year_sent</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">2009</span><span class="p">,</span><span class="w"> </span><span class="n">month_sent</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">08</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1"># # A tibble: 237 x 3</span><span class="w">
</span><span class="c1"># word sentiment n</span><span class="w">
</span><span class="c1"># <chr> <chr> <int></span><span class="w">
</span><span class="c1"># 1 pain negative 35</span><span class="w">
</span><span class="c1"># 2 happiness positive 21</span><span class="w">
</span><span class="c1"># 3 sting negative 21</span><span class="w">
</span><span class="c1"># 4 happy positive 12</span><span class="w">
</span><span class="c1"># 5 stinging negative 12</span><span class="w">
</span><span class="c1"># 6 depression negative 11</span><span class="w">
</span><span class="c1"># 7 free positive 11</span><span class="w">
</span><span class="c1"># 8 bad negative 9</span><span class="w">
</span><span class="c1"># 9 damage negative 9</span><span class="w">
</span><span class="c1"># 10 venom negative 9</span><span class="w">
</span><span class="c1"># # ... with 227 more rows</span><span class="w">
</span>
Was it a bad breakup? Digging into my emails, can find a New York Times Magazine article copy-and-pasted and sent to several people. The article, “Oh, Sting, Where Is Thy Death?” By Richard Conniff, mentions the pain of stinging insects and its relevance to happiness research. Note most of those n
s are divisible by 3.
Most Common Charged Words
If taking all the emotionally charged words and seeing what comes out most often, both surprises and expected outcomes show up:
bing_word_counts <-
tidy_emails %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts
# # A tibble: 2,143 x 3
# word sentiment n
# <chr> <chr> <int>
# 1 cool positive 481
# 2 nice positive 456
# 3 free positive 445
# 4 bad negative 308
# 5 pretty positive 286
# 6 retreat negative 239
# 7 solid positive 230
# 8 fine positive 222
# 9 hard negative 219
# 10 worth positive 207
# # ... with 2,133 more rows
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
Surprised to see how much more positive words show up than negative words - Bing does have more positive words in its lexicon, so could make sense there. “Bad” as top negative word seems like a bad top word. “Issue” is definitely a word I have an issue with using a bad amount of time. But it’s cool to see how much I use “cool” (or is it bad? this is causing anxiety). Anyway, I think this is a solid view worth the time to get a nice feeling for top words I love to use in email.
Obligatory Wordcloud
Is it easier to read than the above? Nah, but it must be included in any text mining blog post, so…
<span class="n">library</span><span class="p">(</span><span class="n">wordcloud</span><span class="p">)</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">anti_join</span><span class="p">(</span><span class="n">email_stop_words</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">nchar</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">with</span><span class="p">(</span><span class="n">wordcloud</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">max.words</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">))</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">reshape2</span><span class="p">)</span><span class="w">
</span><span class="n">tidy_emails</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">inner_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">acast</span><span class="p">(</span><span class="n">word</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">value.var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">comparison.cloud</span><span class="p">(</span><span class="n">colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"gray20"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gray80"</span><span class="p">),</span><span class="w">
</span><span class="n">max.words</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span>
Hope that was cool 🙂
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.