{emayili}: Rudimentary Email Address Validation

[This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A recent issue on the {emayili} GitHub repository prompted me to think a bit more about email address validation. When I started looking into this I was somewhat surprised to learn that it’s such a complicated problem. Who would have thought that something as apparently simple as an email address could be linked with such complexity?

There are two separate problems:

  1. Is the email address valid or authentic? Does it work?
  2. Is the email address syntactically correct? Does it obey the rules?

We’ll be focusing on the second problem. But before we delve into that, let’s load {emayili}.

library(emayili)

Does It Work?

If you want to validate an email address, send it an email. Problem solved. There’s only ONE way to validate an email address

The most effective means to answer this question is to send a test message to the address and potentially request a response. There are a few possible outcomes, some of which are:

  • The test message bounces. It’s rejected by the server, so the mailbox doesn’t exist?
  • The test message appears to be delivered but there’s no response. Perhaps the mailbox exists, but it’s unused or a throwaway.
  • The test message is delivered and there’s a response. It exists! ?

Is it Syntactically Correct?

Not surprisingly, there are rules which dictate what constitutes a syntactically correct email address. Before we think about the rules, let’s talk about the parts of an email address.

An email address consists of two parts: the local part and the domain. For example, [email protected]. An optional display name may precede the email address, in which case the email address is enclosed in angle brackets. For example, Alice Jones <[email protected]>.

There are various rules which apply to the local part and the domain, the most important (IMHO) of which are summarised below. Note: This is a high level summary and neglects some of the nuances.

Rules: Local Part

Rules for the local part of an email address:

  • May be up to 64 characters long.
  • May be quoted (like "alice") or unquoted (just alice).
  • If unquoted may consist of the following ASCII characters:
  • lower and uppercase Latin letters (a to z and A to Z)
  • digits (0 to 9)
  • various printable punctuation marks and
  • dots (.) but not at the beginning or end or more than one in succession.
  • If quoted then the rules are a lot more liberal.
  • May contain comments (which are enclosed in parentheses).

Rules: Domain

Rules for the domain of an email address:

  • May be up to 255 characters long.
  • Must satisfy the requirements of a hostname or IP address (in which case it must be enclosed in square brackets).
  • May contain comments (which are enclosed in parentheses).

The address Class

An address class has been added to emayili.

alice <- address("[email protected]")
(bob <- address(email = "[email protected]", display = "Robert Brown"))
[1] "Robert Brown <[email protected]>"

You can construct address objects from local name and domain.

address(local = "alice", domain = "example.com")
[1] "[email protected]"

It’s vectorised and does recycling, so you can also do this sort of thing:

address(local = c("bob", "erin"), domain = "yahoo.co.uk")
[1] "[email protected]"  "[email protected]"

This also works well in a pipeline. First let’s set up a tibble with the details of a few email accounts.

recipients <- tibble(
  email = c(NA, NA, NA, "[email protected]"),
  local = c("alice", "erin", "bob", NA),
  domain = c("example.com", "yahoo.co.uk", "yahoo.co.uk", NA),
  display = c(NA, NA, "Robert Brown", NA)
)
recipients
# A tibble: 4 × 4
  email             local domain      display     
  <chr>             <chr> <chr>       <chr>       
1 <NA>              alice example.com <NA>        
2 <NA>              erin  yahoo.co.uk <NA>        
3 <NA>              bob   yahoo.co.uk Robert Brown
4 [email protected] <NA>  <NA>        <NA>        

Now use invoke() to call address() for each record.

library(purrr)

recipients <- recipients %>%
  invoke(address, .)
[1] "[email protected]"              "[email protected]"              
[3] "Robert Brown <[email protected]>" "[email protected]"             

? This could equally be done with do.call().

Email addresses can also be coerced into address objects using as.address().

as.address("Robert <[email protected]>")
[1] "Robert <[email protected]>"

Methods

There are methods for extracting the email address and display name.

raw(bob)
[1] "[email protected]"
display(bob)
[1] "Robert Brown"

You can also get the local part and the domain.

local(bob)
[1] "bob"
domain(bob)
[1] "yahoo.co.uk"

Parties

There’s also a new function, parties(), for extracting the addresses associated with an email.

email <- envelope() %>%
  from("[email protected]") %>%
  to("[email protected]", "Robert <[email protected]>") %>%
  cc("[email protected]") %>%
  bcc("[email protected]")

parties(email)
# A tibble: 5 × 6
  type                   address display raw                local  domain     
  <chr>               <vctrs_dd> <chr>   <chr>              <chr>  <chr>      
1 From         [email protected] <NA>    [email protected]  alice  example.com
2 To            [email protected] <NA>    [email protected]   erin   yahoo.co.uk
3 To    Robert <[email protected]> Robert  [email protected]    bob    yahoo.co.uk
4 Cc           [email protected] <NA>    [email protected]  oscar  windmill.nl
5 Bcc         [email protected] <NA>    [email protected] olivia hotmail.com

The details of all of the addresses on the email, broken down in a nice tidy format.

Normalisation

Sometimes email address data can be dirty, so the address class tries to sanitise its contents.

as.address("       Robert       <   bob    @    yahoo.co.uk   >")
[1] "Robert <[email protected]>"

This is very simple at the moment, but I’m planning on adding more functionality.

Compliance

Finally, a test of whether an email address complies with the syntax rules.

First some good addresses.

compliant(c(
  "[email protected]",
  "Robert <[email protected]>",
  "[email protected]"
))
[1] TRUE TRUE TRUE

Now some evil addresses.

compliant(c(
  "alice?example.com",
  "Robert [email protected]",
  "olivia@hot_mail.com"
))
[1] FALSE FALSE FALSE

The implementation of compliant() uses regular expressions. Take a look at this StackOverflow thread for a lengthy discussion on the use of REGEX for checking emails. It’s not perfect, but it’s functional. If you discover cases where it fails, please let me know and I’ll improve the logic.

Conclusion

For the most part you won’t need to worry about these checks. They will all happen in the background when you put together an email. If, however, one of your email addresses is problematic, then you should at least know about it before you send the email.

These updates are available in {emayili}-0.4.16.

To leave a comment for the author, please follow the link and comment on their blog: R | datawookie.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)