Parsing Domain Names in R with tldextract

[This article was first published on Data Driven Security, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R Language is really good at data and statistical analysis, but when
it comes to working with information security data it has a few holes
that need plugging up. Bob has been doing a couple of posts using Rcpp
to do things like Basic DNS
and IPv4
I wanted to add to some of that work with a quick package for parsing
domain names.

While *.com, *.net and *.org top-level domains are easy
to parse, the rest of the world gets messy rather quick. Just taking the
entry after the last dot creates problems for top-level domains like
anything in * Or to make things even more complicated, the
name of “” is considered (for name
parsing) to be a top-level domain and the domain name we’d want to
process is the name that would appear before the us-west-1 in that name.

Introducing TLD Extract (the R version)

It’s always easier to imitate rather than reinvent, so I took some time
to read through the
tldextract Python
package, and used that to test my code was executing properly during
development so I used the same name for the R pacakge. The data for the
package is drawn from the same source as the python package, the Public
Suffix List
from the Mozilla Foundation.
For convenience, I include a cached version of the data so it can run offline.


To install this package, use the devtools package:



Using the package is fairly straight forward, it will return a data frame with the
original name and seperate columns for each parsed component.

# use the cached lookup data, simple call

##             host subdomain domain tld
## 1       www google com

# it can take multiple domains at the same time
tldextract(c("", "", "", ""))

##                host subdomain     domain    tld
## 1       www     google    com
## 2       www     google
## 3       googlemaps     ca
## 4      tbn0     google     cn

The specification for the top-level domains is cached in the package and
is viewable.

# view and update the TLD domains list in the tldnames data

## [1] "ac"     "" "" "" "" ""

If the cached version is out of data and the package isn’t updated, the
data can be manually loaded, and then passed into the function.

# get most recent TLD listings
tld <- getTLD() # optionally pass in a different URL than the default
manyhosts <- c("", "", 
               "", "", "", "",
               "", "", "")
tldextract(manyhosts, tldnames=tld)

##                               host   subdomain            domain       tld
## 1  marionautomotive       com
## 2         www embroiderypassion       com
## 3                       fsbusiness
## 4                  www               vmm
## 5                                      ttfc        cn
## 6                               carole
## 7                 visiontravail
## 8        mail     space-hoppers
## 9                         chilton

And there we have it!

One last thing, this is the first package I created with unit tests.
This package is really simple and adding in unit tests seamed like a
no-brainer. After reading through Hadley Wickham’s Advanced
online book and exploring how
other packages implement the
testthat package, I implemented a
few simple tests. If you are creating (or about to create) R packages,
look at the source for the tldextract package
for the incredibly simple unit tests included with it!

To leave a comment for the author, please follow the link and comment on their blog: Data Driven Security. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)