Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This will be a three-part blog series on building a DGA classifier and will be split into three logical phases of building a classifier: 1) Data preparation (this) 2) Feature engineering and 3) Model selection. And before I get too far into this, I want to give a huge thank you to Click Security for releasing a DGA classifier in python as part of their very nice Data Hacking github repo. If you would rather see how a classifer is build in python, they did a great job of laying out the steps in the code and an in depth ipython notebook. Initially I didn’t deviate much from their work, but as I rolled around in the data I was experimenting with several different features only to end up reducing my feature set down to a relatively simple set of 3.

The DGA package and sample data is available from the DGA github repository.

### A little background on DGA

DGA stands for Domain Generating Algorithm and these algorithms are part of the evolution of malware communications. In the beginning malware would be hardcoded with IP address(es) or domain names and the botnet could be disrupted by going after whatever was hardcoded. The purpose of the DGA is to be deterministic, yet generate a whole lot of random (hard to guess) domains, of which the bot maintainer only has to register one (or a handful) to enable the malware to phone home. If the domain or IP is taken down, a new name from the algorithm can be used by the botnet maintainer with a new IP address and the botnet maintained.

The purpose of building a DGA classifier isn’t specifically for takedowns of botnets, but to discover and detect the use on our network or services. If we can you have a list of domains resolved and accessed at your organization, it is possible now to see which of those are potentially generated and used by malware.

### And where we are headed

My goal was to create a DGA classifier that was first and foremost easy to use. So I created a package that could be loaded and used in other code without having to think about the machine learning behind the classifier. Hopefully it doesn’t get any easier than this:

Install the package:

devtools::install("jayjacobs/dga")

library(dga)

# known good domains:
# DGA domains generated by cryptolocker
"tntuqxxbvxytpif.ru", "heksblnvanyeug.ru", "kbmqwdsrfzfqpdp.ru")

# classify the domains as either "legit" or "dga"

##               name class prob
## 3             sina legit 1.00
## 5           yandex legit 1.00
## 6              msn legit 1.00
## 7     kqcrotywqigo   dga 1.00
## 8   rlvukicfjceajm   dga 1.00
## 9     ibxaoddvcped   dga 1.00
## 10 tntuqxxbvxytpif   dga 1.00
## 11  heksblnvanyeug   dga 0.98
## 12 kbmqwdsrfzfqpdp   dga 1.00


The function returns the domain name it extracted from the names, the classification assigned and the probability the classifier used to classify. You can see from the output, that all of these domains were clearly classified (with high probability).

So that’s what these posts are going to walk through… what steps did we go through to answer the deceptively simple question of “Is this domain legitimate or generating by an algorithm?

# Getting and Cleaning the Data

The first major step in any classifer is getting training data. If that term is new to you, think of training data like the answer key to a test. We want the list of the questions (domain/host names) and the associated answer (whether each is “legit” or “dga”). This is also called “supervised” data, “labeled” data or “ground truth” data. In some cases (seems like most cases in infosec), establishing reliable training data is a huge challenge, but in this case we’re lucky. All we need is a list of good/legitimate domains and a second list of domains generated by an algorithm and we should be able to get that. In the example from click security, they offer several data sets that we could copy, but we’ll seek out our own list because classifiers like this are very sensitive to the choices you make when gathering the training data.

### Alexa

For samples of legitimate domains, an obvious choice is to go to the Alexa list of top web sites. But it’s not ready for our use as is. If you grab the top 1 Million Alexa domains and parse it, you’ll find just over 11 thousand are full URLs and not just domains, and there are thousands of domains with subdomains that don’t help us (we are only classifying on domains here). So after I remove the URLs, de-duplicated the domains and clean it up, I end up with the Alexa top 965,843.

### “Real World” Data from OpenDNS

After reading the post from Frank Denis at OpenDNS titled “Why Using Real World Data Matters For Building Effective Security Models”, I grabbed their 10,000 Top Domains and their 10,000 Random samples. If we compare that to the top Alexa domains, 6,901 of the top ten thousand are in the alexa data and 893 of the random domains are in the Alexa data. I will clean that up as I make the final training data set.

### DGA domains

The Click Security version wasn’t very clear in where they got their bad domains so I decided to collect my own and this was rather fun. Because I work with some interesting characters (who know interesting characters), I was able to collect several data sets from recent botnets: “Cryptolocker”, two seperate “Game-Over Zues” algorithms, and an anonymous collection of malicious (and algorithmically generated) domains. In the end, I was able to collect 73,598 algorithmically generated domains.

# Creating the labeled data set

Once we clean up the various sources we want a simple, labeled data set consisting of five columns: the full domain, the 2nd-level domain, top-level domain, the main class, which is either legitimate (“legit”) or from a domain generating algorithm (“dga”) and finally a subclass (either the specific botnet or legit source). I have quite a bit of data here and I will use all of the data in the final model (assuming I have enough memory to process it). However this much data will create a lot of waiting for you and some rather large files to be posting in a blog. So, the dataset I’m making available is 10,000 domains. The “legit” domains are composed of the top 1,000 alexa domains, along with 4k randomly sampled alexa and opendns domains. The “dga” domains are randomly sampled from the 70,000+ domains I have in my collection. Working with a total of 10,000 samples should be enough to get some good results while reducing the processing times in the next two parts. This is a sample of the data in the sampledga data set available in the DGA package.

library(dga)
data(sampledga)
set.seed(2)
sampledga[sample(which(sampledga$subclass=="opendns"), 5), ], sampledga[sample(which(sampledga$subclass=="cryptolocker"), 5), ])

##                        host           domain tld class     subclass
## 4                 yahoo.com            yahoo com legit        alexa
## 5                 baidu.com            baidu com legit        alexa
## 976585           xt12365.cn          xt12365  cn legit      opendns
## 980838           lotustv.cc          lotustv  cc legit      opendns
## 981307 hqcowichanvalley.com hqcowichanvalley com legit      opendns
## 983055            online.ms           online  ms legit      opendns
## 25721     brclxtulykdemb.ru   brclxtulykdemb  ru   dga cryptolocker
## 24225     sdixbryxaxqrmf.ru   sdixbryxaxqrmf  ru   dga cryptolocker
## 25479     syipowvqvktasf.ru   syipowvqvktasf  ru   dga cryptolocker
## 15776    chaksikvqltbdeo.ru  chaksikvqltbdeo  ru   dga cryptolocker
## 7948        jpmwlfhrjbly.ru     jpmwlfhrjbly  ru   dga cryptolocker


The sample data is available in the DGA github repository and stay tuned for the next part as I get into “feature engineering”.