Digging into mbox details: A tale of tm & reticulate

hrbrmstr

4 years ago

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I had to processes a bunch of emails for a $DAYJOB task this week and my “default setting” is to use R for pretty much everything (this should come as no surprise). Treating mail as data is not an uncommon task and many R packages exist that can reach out and grab mail from servers or work directly with local mail archives.

Mbox’in off the rails on a crazy tm¹

This particular mail corpus is in mbox format since it was saved via Apple Mail. It’s one big text file with each message appearing one after the other. The format has been around for decades, and R’s tm package — via the tm.plugin.mail plugin package — can process these mbox files.

To demonstrate, we’ll use an Apple Mail archive excerpt from a set of R mailing list messages as they are not private/sensitive:

library(tm)
library(tm.plugin.mail)

# point the tm corpus machinery to the mbox file and let it know the timestamp format since it varies
VCorpus(
  MBoxSource("~/Data/test.mbox/mbox"),
  readerControl = list(
    reader = readMail(DateFormat = "%a, %e %b %Y %H:%M:%S %z")
  )
) -> mbox

str(unclass(mbox), 1)
## List of 3
##  $ content:List of 198
##  $   : list()
##   ..- attr(*, "class")= chr "CorpusMeta"
##  $ d :'data.frame': 198 obs. of  0 variables

str(unclass(mbox[[1]]), 1)
## List of 2
##  $ content: chr [1:476] "Try this:" "" "> library(lubridate)" "> library(tidyverse)" ...
##  $   :List of 9
##   ..- attr(*, "class")= chr "TextDocumentMeta"

str(unclass(mbox[[1]]$meta), 1)
## List of 9
##  $ author       : chr "jim holtman "
##  $ datetimestamp: POSIXlt[1:1], format: "2018-08-01 15:01:17"
##  $ description  : chr(0) 
##  $ heading      : chr "Re: [R] read txt file - date - no space"
##  $ id           : chr ""
##  $ language     : chr "en"
##  $ origin       : chr(0) 
##  $ header       : chr [1:145] "Delivered-To: bob@rud.is" "Received: by 2002:ac0:e681:0:0:0:0:0 with SMTP id b1-v6csp950182imq;" "        Wed, 1 Aug 2018 08:02:23 -0700 (PDT)" "X-Google-Smtp-Source: AAOMgpcdgBD4sDApBiF2DpKRfFZ9zi/4Ao32Igz9n8vT7EgE6InRoa7VZelMIik7OVmrFCRPDBde" ...
##  $              : NULL

We’re using unclass() since the str() output gets a bit crowded with all of the tm class attributes stuck in the output display.

The tm suite is designed for text mining. My task had nothing to do with text mining and I really just needed some header fields and body content in a data frame. If you’ve been working with R for a while, some things in the str() output will no doubt cause a bit of angst. For instance:

datetimestamp: POSIXlt[1:1], : POSIXlt and data frames really don’t mix well
description : chr(0) / origin : chr(0): zero-length character vectors
$ : NULL : Blank element name with a NULL value…I Don’t Even ²

The tm suite is also super opinionated and “helpfully” left out a ton of headers (though it did keep the source for the complete headers around). Still, we can roll up our sleeves and turn that into a data frame:

# helper function for cleaner/shorter code
`%|0|%` <- function(x, y) { if (length(x) == 0) y else x }

# might as well stay old-school since we're using tm
do.call(
  rbind.data.frame,
  lapply(mbox, function(.x) {

    # we have a few choices, but this one is pretty explicit abt what it does
    # so we'll likely be able to decipher it quickly in 2 years when/if we come
    # back to it

    data.frame(
      author = .x$meta$author %|0|% NA_character_,
      datetimestamp = as.POSIXct(.x$meta$datetimestamp %|0|% NA),
      description = .x$meta$description %|0|% NA_character_,
      heading = .x$meta$heading %|0|% NA_character_,
      id = .x$meta$id %|0|% NA_character_,
      language = .x$meta$language %|0|% NA_character_,
      origin = .x$meta$origin %|0|% NA_character_,
      header = I(list(.x$meta$header %|0|% NA_character_)),
      body = I(list(.x$content %|0|% NA_character_)),
      stringsAsFactors = FALSE
    )

  })
) %>%
  glimpse()
## Observations: 198
## Variables: 9
## $ author        < chr> "jim holtman ", "PIKAL Petr ...
## $ datetimestamp  2018-08-01 15:01:17, 2018-08-01 13:09:18, 2018-...
## $ description   < chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ heading       < chr> "Re: [R] read txt file - date - no space", "Re: ...
## $ id            < chr> " "en", "en", "en", "en", "en", "en", "en", "en", ...
## $ origin        < chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ header         Delivere...., Delivere...., Delivere...., De...
## $ body           Try this...., SGkNCg0K...., Dear Pik...., De...

That wasn’t a huge effort, but we would now have to re-process the headers and/or write a custom version of tm.plugin.mail::readMail() (the function source is very readable and extendable) to get any extra data out. Here’s what that might look like:

# Custom msg reader
read_mail <- function(elem, language, id) {

  # extract header val
  hdr_val <- function(src, pat) {
    gsub(
      sprintf("%s: ", pat), "",
      grep(sprintf("^%s:", pat), src, "", value = TRUE, useBytes = TRUE)
    ) %|0|% NA
  }

  mail <- elem$content

  index <- which(mail == "")[1]
  header <- mail[1:index]
  mid <- hdr_val(header, "Message-ID")

  PlainTextDocument(
    x = mail[(index + 1):length(mail)],
    author = hdr_val(header, "From"),

    spam_score = hdr_val(header, "X-Spam-Score"), ###

If we wanted all the headers, there are even more succinct ways to solve for that use case.

Packaging up emails with a reticulated message.mbox

Since the default functionality of tm.plugin.mail::readMail() forced us to work a bit to get what we needed there’s some justification in seeking out an alternative path. I’ve written about reticulate before and am including it in this post as the Python standard library module mailbox can also make quick work of mbox files.

Two pieces of advice I generally reiterate when I talk about reticulate is that I highly recommend using Python 3 (remember, it’s a fragmented ecosystem) and that I prefer specifying the specific target Python to use via the RETICULATE_PYTHON environment variable that I have in ~/.Renviron as RETICULATE_PYTHON=/usr/local/bin/python3.

Let’s bring the mailbox module into R:

library(reticulate)
library(tidyverse)

mailbox < - import("mailbox")

If you're unfamiliar with a Python module or object, you can get help right in R via reticulate::py_help(). Et sequitur³: py_help(mailbox) will bring up the text help for that module and py_help(mailbox$mbox) (remember, we swap out dots for dollars when referencing Python object components in R) will do the same for the mailbox.mbox class.

Text help is great and all, but we can also render it to HTML with this helper function:

py_doc <- function(x) {
  require("htmltools")
  require("reticulate")
  pydoc <- reticulate::import("pydoc")
  htmltools::html_print(
    htmltools::HTML(
      pydoc$render_doc(x, renderer=pydoc$HTMLDoc())
    )
  )
}

Here's what the text and HTML help for mailbox.mbox look side-by-side:

We can also use a helper function to view the online documentation:

readthedocs <- function(obj, py_ver=3, check_keywords = "yes") {
  require("glue")
  query <- obj$`__name__`
  browseURL(
    glue::glue(
      "https://docs.python.org/{py_ver}/search.html?q={query}&check_keywords={check_keywords}"
    )
  )
}

Et sequitur: readthedocs(mailbox$mbox) will take us to this results page

Going back to the task at hand, we need to cycle through the messages and make a data frame for the bits we (well, I, care about). The reticulate package does an amazing job making Python objects first-class citizens in R, but Python objects may feel "opaque" to R users since we have to use the $ syntax to get to methods and values and — very often — familiar helpers such as str() are less than helpful on these objects. Let's try to look at the first message (remember, Python is 0-indexed):

msg1 <- mbox$get(0)

str(msg1)

msg1

The output for those last two calls is not shown because they both are just a large text dump of the message source. #unhelpful

We can get more details, and we'll wrap some punctuation-filled calls in two, small helper functions that have names that will sound familiar:

pstr <- function(obj, ...) { str(obj$`__dict__`, ...) } # like 'str()`

pnames < - function(obj) { import_builtins()$dir(obj) } # like 'names()' but more complete

Lets see them in action:

pstr(msg1, 1) # we can pass any params str() will take
## List of 10
##  $ _from        : chr "jholtman@gmail.com Wed Aug 01 15:02:23 2018"
##  $ policy       :Compat32()
##  $ _headers     :List of 56
##  $ _unixfrom    : NULL
##  $ _payload     : chr "Try this:\n\n> library(lubridate)\n> library(tidyverse)\n> input <- read.csv(text =3D \"date,str1,str2,str3\n+ "| __truncated__
##  $ _charset     : NULL
##  $ preamble     : NULL
##  $ epilogue     : NULL
##  $ defects      : list()
##  $ _default_type: chr "text/plain"

pnames(msg1)
##  [1] "__bytes__"                 "__class__"                
##  [3] "__contains__"              "__delattr__"              
##  [5] "__delitem__"               "__dict__"                 
##  [7] "__dir__"                   "__doc__"                  
##  [9] "__eq__"                    "__format__"               
## [11] "__ge__"                    "__getattribute__"         
## [13] "__getitem__"               "__gt__"                   
## [15] "__hash__"                  "__init__"                 
## [17] "__init_subclass__"         "__iter__"                 
## [19] "__le__"                    "__len__"                  
## [21] "__lt__"                    "__module__"               
## [23] "__ne__"                    "__new__"                  
## [25] "__reduce__"                "__reduce_ex__"            
## [27] "__repr__"                  "__setattr__"              
## [29] "__setitem__"               "__sizeof__"               
## [31] "__str__"                   "__subclasshook__"         
## [33] "__weakref__"               "_become_message"          
## [35] "_charset"                  "_default_type"            
## [37] "_explain_to"               "_from"                    
## [39] "_get_params_preserve"      "_headers"                 
## [41] "_payload"                  "_type_specific_attributes"
## [43] "_unixfrom"                 "add_flag"                 
## [45] "add_header"                "as_bytes"                 
## [47] "as_string"                 "attach"                   
## [49] "defects"                   "del_param"                
## [51] "epilogue"                  "get"                      
## [53] "get_all"                   "get_boundary"             
## [55] "get_charset"               "get_charsets"             
## [57] "get_content_charset"       "get_content_disposition"  
## [59] "get_content_maintype"      "get_content_subtype"      
## [61] "get_content_type"          "get_default_type"         
## [63] "get_filename"              "get_flags"                
## [65] "get_from"                  "get_param"                
## [67] "get_params"                "get_payload"              
## [69] "get_unixfrom"              "is_multipart"             
## [71] "items"                     "keys"                     
## [73] "policy"                    "preamble"                 
## [75] "raw_items"                 "remove_flag"              
## [77] "replace_header"            "set_boundary"             
## [79] "set_charset"               "set_default_type"         
## [81] "set_flags"                 "set_from"                 
## [83] "set_param"                 "set_payload"              
## [85] "set_raw"                   "set_type"                 
## [87] "set_unixfrom"              "values"                   
## [89] "walk"

names(msg1)
##  [1] "add_flag"                "add_header"             
##  [3] "as_bytes"                "as_string"              
##  [5] "attach"                  "defects"                
##  [7] "del_param"               "epilogue"               
##  [9] "get"                     "get_all"                
## [11] "get_boundary"            "get_charset"            
## [13] "get_charsets"            "get_content_charset"    
## [15] "get_content_disposition" "get_content_maintype"   
## [17] "get_content_subtype"     "get_content_type"       
## [19] "get_default_type"        "get_filename"           
## [21] "get_flags"               "get_from"               
## [23] "get_param"               "get_params"             
## [25] "get_payload"             "get_unixfrom"           
## [27] "is_multipart"            "items"                  
## [29] "keys"                    "policy"                 
## [31] "preamble"                "raw_items"              
## [33] "remove_flag"             "replace_header"         
## [35] "set_boundary"            "set_charset"            
## [37] "set_default_type"        "set_flags"              
## [39] "set_from"                "set_param"              
## [41] "set_payload"             "set_raw"                
## [43] "set_type"                "set_unixfrom"           
## [45] "values"                  "walk"

# See the difference between pnames() and names()

setdiff(pnames(msg1), names(msg1))
##  [1] "__bytes__"                 "__class__"                
##  [3] "__contains__"              "__delattr__"              
##  [5] "__delitem__"               "__dict__"                 
##  [7] "__dir__"                   "__doc__"                  
##  [9] "__eq__"                    "__format__"               
## [11] "__ge__"                    "__getattribute__"         
## [13] "__getitem__"               "__gt__"                   
## [15] "__hash__"                  "__init__"                 
## [17] "__init_subclass__"         "__iter__"                 
## [19] "__le__"                    "__len__"                  
## [21] "__lt__"                    "__module__"               
## [23] "__ne__"                    "__new__"                  
## [25] "__reduce__"                "__reduce_ex__"            
## [27] "__repr__"                  "__setattr__"              
## [29] "__setitem__"               "__sizeof__"               
## [31] "__str__"                   "__subclasshook__"         
## [33] "__weakref__"               "_become_message"          
## [35] "_charset"                  "_default_type"            
## [37] "_explain_to"               "_from"                    
## [39] "_get_params_preserve"      "_headers"                 
## [41] "_payload"                  "_type_specific_attributes"
## [43] "_unixfrom"

Using just names() excludes the "hidden" builtins for Python objects, but knowing they are there and what they are can be helpful, depending on the program context.

Let's continue on the path to our messaging goal and see what headers are available. We'll use some domain knowledge about the _headers component, though we won't end up going that route to build a data frame:

map_chr(msg1$`_headers`, ~.x[[1]])
##  [1] "Delivered-To"               "Received"                  
##  [3] "X-Google-Smtp-Source"       "X-Received"                
##  [5] "ARC-Seal"                   "ARC-Message-Signature"     
##  [7] "ARC-Authentication-Results" "Return-Path"               
##  [9] "Received"                   "Received-SPF"              
## [11] "Authentication-Results"     "Received"                  
## [13] "X-Virus-Scanned"            "Received"                  
## [15] "Received"                   "Received"                  
## [17] "X-Virus-Scanned"            "X-Spam-Flag"               
## [19] "X-Spam-Score"               "X-Spam-Level"              
## [21] "X-Spam-Status"              "Received"                  
## [23] "Received"                   "Received"                  
## [25] "Received"                   "DKIM-Signature"            
## [27] "X-Google-DKIM-Signature"    "X-Gm-Message-State"        
## [29] "X-Received"                 "MIME-Version"              
## [31] "References"                 "In-Reply-To"               
## [33] "From"                       "Date"                      
## [35] "Message-ID"                 "To"                        
## [37] "X-Tag-Only"                 "X-Filter-Node"             
## [39] "X-Spam-Level"               "X-Spam-Status"             
## [41] "X-Spam-Flag"                "Content-Disposition"       
## [43] "Subject"                    "X-BeenThere"               
## [45] "X-Mailman-Version"          "Precedence"                
## [47] "List-Id"                    "List-Unsubscribe"          
## [49] "List-Archive"               "List-Post"                 
## [51] "List-Help"                  "List-Subscribe"            
## [53] "Content-Type"               "Content-Transfer-Encoding" 
## [55] "Errors-To"                  "Sender"

The mbox object does provide a get() method to retrieve header values so we'll go that route to build our data frame but we'll make yet-another helper since doing something like msg1$get("this header does not exist") will return NULL just like list(a=1)$b would. We'll actually make two new helpers since we want to be able to safely work with the payload content and that means ensuring it's in UTF-8 encoding (mail systems are horribly diverse beasts and the R community is international and, remember, we're using R mailing list messages):

# execute an object's get() method and return a character string or NA if no value was present for the key
get_chr <- function(.x, .y) { as.character(.x[["get"]](.y)) %|0|% NA_character_ }

# get the object's value as a valid UTF-8 string
utf8_decode < - function(.x) { .x[["decode"]]("utf-8", "ignore") %|0|% NA_character_ }

We're also doing this because I get really tired of using the $ syntax.

We also want the message content or payload. Modern mail messages can be really complex structures with many multiple part entities. To put it a different way, there may be HTML, RTF and plaintext versions of a message all in the same envelope. We want the plaintext ones so we'll have to iterate through any multipart messages to (hopefully) get to a plaintext version. Since this post is already pretty long and we ignored errors in the tm portion, I'll refrain from including any error handling code here as well.

map_df(1:py_len(mbox), ~{

  m <- mbox$get(.x-1) # python uses 0-index lists

  list(
    date = as.POSIXct(get_chr(m, "date"), format = "%a, %e %b %Y %H:%M:%S %z"),
    from = get_chr(m, "from"),
    to = get_chr(m, "to"),
    subj = get_chr(m, "subject"),
    spam_score = get_chr(m, "X-Spam-Score")
  ) -> mdf

  content_type <-  m$get_content_maintype() %|0|% NA_character_

  if (content_type[1] == "text") { # we don't want images
    while (m$is_multipart()) m <- m$get_payload()[[1]] # cycle through until we get to something we can use
    mtmp <- m$get_payload(decode = TRUE) # get the message text
    mdf$body <- utf8_decode(mtmp) # make it safe to use
  }

  mdf

}) -> mbox_df

glimpse(mbox_df)
## Observations: 198
## Variables: 7
## $ date          2018-08-01 11:01:17, 2018-08-01 09:09:18, 20...
## $ from         < chr> "jim holtman ", "PIKAL Pe...
## $ to           < chr> "diego.avesani@gmail.com, R mailing list  "Re: [R] read txt file - date - no space", "R...
## $ spam_score   < chr> "-3.631", "-3.533", "-3.631", "-3.631", "-3.5...
## $ content_type < chr> "text", "text", "text", "text", "text", "text...
## $ body         < chr> "Try this:\n\n library(lubridate)\n library...

FIN

By now, you've likely figured out this post really had nothing to do with reading mbox files. I mean, it did — and this was a task I had to do this week — but the real goal was to use a fairly basic task to help R folks edge a bit closer to becoming more friendly with Python in R. There hundreds of thousands of Python packages out there and, while I'm one to wax poetic about having R or C[++]-backed R-native packages — and am wont to point out Python's egregiously prolific flaws — sometimes you just need to get something done quickly and wish to avoid reinventing the wheel. The reticulate package makes that eminently possible.

I'll be wrapping up some of the reticulate helper functions into a small package soon, so keep your eyes on RSS.

^{: You might want to read this even if you're not interested in mbox files. FIN (right above this note) might have some clues as to why.}
^{1: yes, the section title was a stretch}
^{2: am I doing this right, Mara? 😉}
^{3: Make Latin Great Again}

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mbox’in off the rails on a crazy tm1

Packaging up emails with a reticulated message.mbox

FIN

Mbox’in off the rails on a crazy tm¹