Parsing Domain Names in R with tldextract

August 4, 2014

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

The R Language is really good at data and statistical analysis, but when
it comes to working with information security data it has a few holes
that need plugging up. Bob has been doing a couple of posts using Rcpp
to do things like Basic DNS
and IPv4
I wanted to add to some of that work with a quick package for parsing
domain names.

While *.com, *.net and *.org top-level domains are easy
to parse, the rest of the world gets messy rather quick. Just taking the
entry after the last dot creates problems for top-level domains like
anything in * Or to make things even more complicated, the
name of “” is considered (for name
parsing) to be a top-level domain and the domain name we’d want to
process is the name that would appear before the us-west-1 in that name.

Introducing TLD Extract (the R version)

It’s always easier to imitate rather than reinvent, so I took some time
to read through the
tldextract Python
package, and used that to test my code was executing properly during
development so I used the same name for the R pacakge. The data for the
package is drawn from the same source as the python package, the Public
Suffix List
from the Mozilla Foundation.
For convenience, I include a cached version of the data so it can run offline.


To install this package, use the devtools package:



Using the package is fairly straight forward, it will return a data frame with the
original name and seperate columns for each parsed component.

# use the cached lookup data, simple call

##             host subdomain domain tld
## 1       www google com

# it can take multiple domains at the same time
tldextract(c("", "", "", ""))

##                host subdomain     domain    tld
## 1       www     google    com
## 2       www     google
## 3      <NA> googlemaps     ca
## 4      tbn0     google     cn

The specification for the top-level domains is cached in the package and
is viewable.

# view and update the TLD domains list in the tldnames data

## [1] "ac"     "" "" "" "" ""

If the cached version is out of data and the package isn’t updated, the
data can be manually loaded, and then passed into the function.

# get most recent TLD listings
tld <- getTLD() # optionally pass in a different URL than the default
manyhosts <- c("", "", 
               "", "", "", "",
               "", "", "")
tldextract(manyhosts, tldnames=tld)

##                               host   subdomain            domain       tld
## 1  marionautomotive       com
## 2         www embroiderypassion       com
## 3               <NA>        fsbusiness
## 4                  www               vmm
## 5                        <NA>              ttfc        cn
## 6                   <NA>            carole
## 7            <NA>     visiontravail
## 8        mail     space-hoppers
## 9              <NA>           chilton

And there we have it!

One last thing, this is the first package I created with unit tests.
This package is really simple and adding in unit tests seamed like a
no-brainer. After reading through Hadley Wickham’s Advanced
online book and exploring how
other packages implement the
testthat package, I implemented a
few simple tests. If you are creating (or about to create) R packages,
look at the source for the tldextract package
for the incredibly simple unit tests included with it!

To leave a comment for the author, please follow the link and comment on their blog: Data Driven Security. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)