Building a DGA Classifier: Part 1, Data Preparation

September 30, 2014

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

This will be a three-part blog series on building a DGA classifier and
will be split into three logical phases of building a classifier: 1)
Data preparation (this) 2) Feature engineering and 3) Model selection.
And before I get too far into this, I want to give a huge thank you to
Click Security for releasing a DGA

in python as part of their very nice Data Hacking github
. If you would
rather see how a classifer is build in python, they did a great job of
laying out the steps in the code and an in depth ipython notebook.
Initially I didn’t deviate much from their work, but as I rolled around
in the data I was experimenting with several different features only to
end up reducing my feature set down to a relatively simple set of 3.

The DGA package and sample data is available from the DGA github

A little background on DGA

DGA stands for Domain Generating Algorithm and these algorithms are
part of the evolution of malware communications. In the beginning
malware would be hardcoded with IP address(es) or domain names and the
botnet could be disrupted by going after whatever was hardcoded. The
purpose of the DGA is to be deterministic, yet generate a whole lot of
random (hard to guess) domains, of which the bot maintainer only has to
register one (or a handful) to enable the malware to phone home. If the
domain or IP is taken down, a new name from the algorithm can be used by
the botnet maintainer with a new IP address and the botnet maintained.

The purpose of building a DGA classifier isn’t specifically for
takedowns of botnets, but to discover and detect the use on our network
or services. If we can you have a list of domains resolved and accessed
at your organization, it is possible now to see which of those are
potentially generated and used by malware.

And where we are headed

My goal was to create a DGA classifier that was first and foremost easy
to use. So I created a package that could be loaded and used in other
code without having to think about the machine learning behind the
classifier. Hopefully it doesn’t get any easier than this:

Install the package:



# known good domains:
good <- c("", "", "", 
          "", "", "")
# DGA domains generated by cryptolocker
bad <- c("", "", "", 
         "", "", "")

# classify the domains as either "legit" or "dga"
dgaPredict(c(good, bad))

##               name class prob
## 1         facebook legit 1.00
## 2           google legit 1.00
## 3             sina legit 1.00
## 4          twitter legit 1.00
## 5           yandex legit 1.00
## 6              msn legit 1.00
## 7     kqcrotywqigo   dga 1.00
## 8   rlvukicfjceajm   dga 1.00
## 9     ibxaoddvcped   dga 1.00
## 10 tntuqxxbvxytpif   dga 1.00
## 11  heksblnvanyeug   dga 0.98
## 12 kbmqwdsrfzfqpdp   dga 1.00

The function returns the domain name it extracted from the names, the
classification assigned and the probability the classifier used to
classify. You can see from the output, that all of these domains were
clearly classified (with high probability).

So that’s what these posts are going to walk through… what steps did
we go through to answer the deceptively simple question of “Is this
domain legitimate or generating by an algorithm?

Getting and Cleaning the Data

The first major step in any classifer is getting training data. If that
term is new to you, think of training data like the answer key to a
test. We want the list of the questions (domain/host names) and the
associated answer (whether each is “legit” or “dga”). This is also
called “supervised” data, “labeled” data or “ground truth” data. In some
cases (seems like most cases in infosec), establishing reliable training
data is a huge challenge, but in this case we’re lucky. All we need is a
list of good/legitimate domains and a second list of domains generated
by an algorithm and we should be able to get that. In the example from
they offer several data sets that we could copy, but we’ll seek out our
own list because classifiers like this are very sensitive to the choices
you make when gathering the training data.


For samples of legitimate domains, an obvious choice is to go to the
Alexa list of top web sites. But it’s not ready for our use as is. If
you grab the top 1 Million Alexa
and parse
it, you’ll find just over 11 thousand are full URLs and not just
domains, and there are thousands of domains with subdomains that don’t
help us (we are only classifying on domains here). So after I remove the
URLs, de-duplicated the domains and clean it up, I end up with the Alexa
top 965,843.

Real World” Data from OpenDNS

After reading the post from Frank Denis at OpenDNS titled “Why Using
Real World Data Matters For Building Effective Security
I grabbed their 10,000 Top

and their 10,000 Random
If we compare that to the top Alexa domains, 6,901 of the top ten
thousand are in the alexa data and 893 of the random domains are in the
Alexa data. I will clean that up as I make the final training data set.

DGA domains

The Click Security version wasn’t very clear in where they got their bad
domains so I decided to collect my own and this was rather fun. Because
I work with some interesting characters (who know interesting
characters), I was able to collect several data sets from recent
botnets: “Cryptolocker”, two seperate “Game-Over Zues” algorithms, and
an anonymous collection of malicious (and algorithmically generated)
domains. In the end, I was able to collect 73,598 algorithmically
generated domains.

Creating the labeled data set

Once we clean up the various sources we want a simple, labeled data set
consisting of five columns: the full domain, the 2nd-level domain,
top-level domain, the main class, which is either legitimate (“legit”)
or from a domain generating algorithm (“dga”) and finally a subclass
(either the specific botnet or legit source). I have quite a bit of data
here and I will use all of the data in the final model (assuming I have
enough memory to process it). However this much data will create a lot
of waiting for you and some rather large files to be posting in a blog.
So, the dataset I’m making available is 10,000 domains. The “legit”
domains are composed of the top 1,000 alexa domains, along with 4k
randomly sampled alexa and opendns domains. The “dga” domains are
randomly sampled from the 70,000+ domains I have in my collection.
Working with a total of 10,000 samples should be enough to get some good
results while reducing the processing times in the next two parts. This
is a sample of the data in the sampledga data set available in the DGA package.

rbind(head(sampledga, 5), 
      sampledga[sample(which(sampledga$subclass=="opendns"), 5), ],
      sampledga[sample(which(sampledga$subclass=="cryptolocker"), 5), ])

##                        host           domain tld class     subclass
## 1                 google com legit        alexa
## 2             facebook com legit        alexa
## 3               youtube com legit        alexa
## 4                   yahoo com legit        alexa
## 5                   baidu com legit        alexa
## 976585           xt12365  cn legit      opendns
## 967852          adobesc com legit      opendns
## 980838           lotustv  cc legit      opendns
## 981307 hqcowichanvalley com legit      opendns
## 983055             online  ms legit      opendns
## 25721   brclxtulykdemb  ru   dga cryptolocker
## 24225   sdixbryxaxqrmf  ru   dga cryptolocker
## 25479   syipowvqvktasf  ru   dga cryptolocker
## 15776  chaksikvqltbdeo  ru   dga cryptolocker
## 7948     jpmwlfhrjbly  ru   dga cryptolocker

The sample data is available in the DGA github
and stay
tuned for the next part as I get into “feature engineering”.

To leave a comment for the author, please follow the link and comment on their blog: Data Driven Security. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)