Parsing Domain Names in R with tldextract

August 4, 2014
By

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

The R Language is really good at data and statistical analysis, but when
it comes to working with information security data it has a few holes
that need plugging up. Bob has been doing a couple of posts using Rcpp
to do things like Basic DNS
Lookups
,
TXT
lookups
,
and IPv4
Conversions
.
I wanted to add to some of that work with a quick package for parsing
domain names.

While *.com, *.net and *.org top-level domains are easy
to parse, the rest of the world gets messy rather quick. Just taking the
entry after the last dot creates problems for top-level domains like
anything in *.com.uk. Or to make things even more complicated, the
name of “us-west-1.compute.amazonaws.com” is considered (for name
parsing) to be a top-level domain and the domain name we’d want to
process is the name that would appear before the us-west-1 in that name.

Introducing TLD Extract (the R version)

It’s always easier to imitate rather than reinvent, so I took some time
to read through the
tldextract Python
package, and used that to test my code was executing properly during
development so I used the same name for the R pacakge. The data for the
package is drawn from the same source as the python package, the Public
Suffix List
from the Mozilla Foundation.
For convenience, I include a cached version of the data so it can run offline.

Installation

To install this package, use the devtools package:

devtools::install_github("jayjacobs/tldextract")

Usage

Using the package is fairly straight forward, it will return a data frame with the
original name and seperate columns for each parsed component.

library(tldextract)
# use the cached lookup data, simple call
tldextract("www.google.com")

##             host subdomain domain tld
## 1 www.google.com       www google com

# it can take multiple domains at the same time
tldextract(c("www.google.com", "www.google.com.ar", "googlemaps.ca", "tbn0.google.cn"))

##                host subdomain     domain    tld
## 1    www.google.com       www     google    com
## 2 www.google.com.ar       www     google com.ar
## 3     googlemaps.ca      <NA> googlemaps     ca
## 4    tbn0.google.cn      tbn0     google     cn

The specification for the top-level domains is cached in the package and
is viewable.

# view and update the TLD domains list in the tldnames data
data(tldnames)
head(tldnames)

## [1] "ac"     "com.ac" "edu.ac" "gov.ac" "net.ac" "mil.ac"

If the cached version is out of data and the package isn’t updated, the
data can be manually loaded, and then passed into the function.

# get most recent TLD listings
tld <- getTLD() # optionally pass in a different URL than the default
manyhosts <- c("pages.parts.marionautomotive.com", "www.embroiderypassion.com", 
               "fsbusiness.co.uk", "www.vmm.adv.br", "ttfc.cn", "carole.co.il",
               "visiontravail.qc.ca", "mail.space-hoppers.co.uk", "chilton.k12.pa.us")
tldextract(manyhosts, tldnames=tld)

##                               host   subdomain            domain       tld
## 1 pages.parts.marionautomotive.com pages.parts  marionautomotive       com
## 2        www.embroiderypassion.com         www embroiderypassion       com
## 3                 fsbusiness.co.uk        <NA>        fsbusiness     co.uk
## 4                   www.vmm.adv.br         www               vmm    adv.br
## 5                          ttfc.cn        <NA>              ttfc        cn
## 6                     carole.co.il        <NA>            carole     co.il
## 7              visiontravail.qc.ca        <NA>     visiontravail     qc.ca
## 8         mail.space-hoppers.co.uk        mail     space-hoppers     co.uk
## 9                chilton.k12.pa.us        <NA>           chilton k12.pa.us

And there we have it!

One last thing, this is the first package I created with unit tests.
This package is really simple and adding in unit tests seamed like a
no-brainer. After reading through Hadley Wickham’s Advanced
R
online book and exploring how
other packages implement the
testthat package, I implemented a
few simple tests. If you are creating (or about to create) R packages,
look at the source for the tldextract package
for the incredibly simple unit tests included with it!

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.