{emayili} Message Integrity

[This article was first published on R - datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How can you be sure that the contents of an email haven’t been tampered with? The best approach would probably be to have a digital signature on each component of the message. Perhaps I’ll look at integrating that into {emayili} some time in the future. However, today I’m writing about the first step in that direction: MD5 checksums.

The Content-MD5 Header Field

RFC 1864 describes the Content-MD5 header field. The MD5 algorithm is used to generate a hash for each component of the message and that hash is included in the Content-MD5 header field.

Sounds pretty simple, right? Well, the devil’s in the details (although, in this case, it’s not particularly devilish). The MD5 algorithm produces a 128 bit digest. These bits are translated into 16 bytes (or octets). And those bytes are then Base64 encoded. The final result is a 24 character string (including padding).

First let’s explore this in the shell.

MD5 & Base64 in the Shell

The md5sum command-line tool can be used to generate an MD5 hash. What’s the MD5 hash for a simple “Hello, World!” message?

echo "Hello, World!" | md5sum
bea8252ff4e80f41719ea13cdf007273  -

Superficially this looks good, but there’s a subtlety that could trip us up: echo will implicitly append a line feed character to the end of the message. And that also gets factored into the hash. We don’t want that. The -n flag will suppress the trailing line feed.

echo -n "Hello, World!" | md5sum
65a8e27d8879283831b664bd8b7f0ad4  -

Now we can Base64 encode the result.

echo -n "Hello, World!" | md5sum | base64
NjVhOGUyN2Q4ODc5MjgzODMxYjY2NGJkOGI3ZjBhZDQgIC0K

Okay, hold on! The result was supposed to be only 24 characters long. Something’s not right!

The problem is that we are Base64 encoding the characters in the hexadecimal representation of the bytes rather than the bytes themselves. We’re going to need different tools.

One thing I found confusing in RFC 1864 was the example. Contrary to what I expected, the example string, “Check Integrity!”, was supposed to be the result of the MD5 hash, which was then Base64 encoded.

echo -n "Check Integrity!" | base64
Q2hlY2sgSW50ZWdyaXR5IQ==

I’m still not quite sure what the point of that was.

The openssl tool can also be used to generate an MD5 hash. And it can do the Base64 encoding too. Let’s start with the MD5 hash.

echo -n "Hello, World!" |  openssl dgst -md5
(stdin)= 65a8e27d8879283831b664bd8b7f0ad4

That’s consistent with the earlier result. Now, we’ll use the -binary flag to get binary output (a series of bytes rather than the hexadecimal representation of those bytes). We’ll pipe that back into openssl again and then do the Base64 encoding.

echo -n "Hello, World!" |  openssl dgst -md5 -binary | openssl enc -base64
ZajifYh5KDgxtmS9i38K1A==

Count the characters? There are 24, just as required.

Now Repeat in R

How about repeating the process now in R? The {digest} library has a function for producing an MD5 hash (along with a bunch of other digest types).

library(digest)
library(base64enc)

Let’s generate the MD5 hash for the same message.

digest("Hello, World!", algo = "md5", serialize = FALSE)
[1] "65a8e27d8879283831b664bd8b7f0ad4"

Let’s take the long way around getting the required hash. First, we’ll break the hash down into a series of two-digit hexadecimal numbers.

hash <- digest("Hello, World!", algo = "md5", serialize = FALSE) %>%
  substring(
    first = seq(1, nchar(.), 2),
    last = seq(2, nchar(.), 2)
  )
 [1] "65" "a8" "e2" "7d" "88" "79" "28" "38" "31" "b6" "64" "bd" "8b" "7f" "0a" "d4"

Now convert each of those to an integer. The 16 is for base 16 (hexadecimal).

hash <- strtoi(hash, 16)
 [1] 101 168 226 125 136 121  40  56  49 182 100 189 139 127  10 212

Finally, Base64 encode those bytes!

base64encode(hash)
[1] "ZajifYh5KDgxtmS9i38K1A=="

As edifying as that was, it was a most circuitous route. Fortunately, we can get there more directly.

digest("Hello, World!", algo = "md5", serialize = FALSE, raw = TRUE) %>%
  base64encode()
[1] "ZajifYh5KDgxtmS9i38K1A=="

MD5 in {emayili}

This functionality has now been baked into {emayili}. We’ll try it out on a simple message object using the same simple message.

library(emayili)

options(envelope.details = TRUE)
options(envelope.invisible = FALSE)

envelope() %>%
  text("Hello, World!")
Date:                      Tue, 05 Oct 2021 04:55:29 GMT
X-Mailer:                  {emayili}-0.6.0
MIME-Version:              1.0
Content-Type:              text/plain; charset=utf-8
Content-Disposition:       inline
Content-Transfer-Encoding: 7bit

Hello, World!

The Content-MD5 header field contains the Base64 encoded MD5 hash of the message body. With that you can verify that the message content has not been modified (although you’ll probably leave this to your mail client).


The {emayili} package is developed & supported by Fathom Data.

To leave a comment for the author, please follow the link and comment on their blog: R - datawookie.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)