Examining Email Addresses in R

August 22, 2015

(This article was first published on Mathew Analytics » R, and kindly contributed to R-bloggers)

I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics.

df = data.frame(email = c("[email protected]","[email protected]","[email protected]","[email protected]","[email protected]",
                          "[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"))
df$one <- sub("@.*$", "", df$email )
df$two <- sub('.*@', '', df$email )
df$three <- sub('.*\.', '', df$email )
num <- c(0:9); num
num_match <- str_c(num, collapse = "|"); num_match
df$num_yn <- as.numeric(str_detect(df$email, num_match))
und <- c("_"); und
und_match <- str_c(und, collapse = "|"); und_match
df$und_yn <- as.numeric(str_detect(df$email, und_match))
> df
             email    one        two three num_yn und_yn
1      [email protected]    one    gkn.com   com      0      0
2  [email protected] two132   wern.com   com      1      0
3     [email protected]  three     fu.com   com      0      0
4     [email protected]   four    huo.com   com      0      0
5     [email protected]   five    hoi.net   net      0      0
6   [email protected]    ten hoinse.com   com      0      0
7   [email protected] four99    huo.com   com      1      0
8     [email protected]    two   wern.gov   gov      0      0
9    [email protected]  f_ive    hoi.com   com      0      1
10   [email protected]    six  ihoio.gov   gov      0      0

What about you? If you regularly work with email addresses and have some useful insights for the rest of us, please leave a comment below. How do you usually attack a data set where it’s just a large number of email addresses?

To leave a comment for the author, please follow the link and comment on their blog: Mathew Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)