This will be a three-part blog series on building a DGA classifier and will be split into three logical phases of building a classifier: 1) Data preparation (this) 2) Feature engineering and 3) Model selection. And before I get too far into this, I want to give a huge thank you to Click Security for releasing a DGA classifier in python as part of their very nice Data Hacking github repo. If you would rather see how a classifer is build in python, they did a great job of laying out the steps in the code and an in depth ipython notebook. Initially I didn’t deviate much from their work, but as I rolled around in the data I was experimenting with several different features only to end up reducing my feature set down to a relatively simple set of 3.
The DGA package and sample data is available from the DGA github repository.
A little background on DGA
DGA stands for Domain Generating Algorithm and these algorithms are part of the evolution of malware communications. In the beginning malware would be hardcoded with IP address(es) or domain names and the botnet could be disrupted by going after whatever was hardcoded. The purpose of the DGA is to be deterministic, yet generate a whole lot of random (hard to guess) domains, of which the bot maintainer only has to register one (or a handful) to enable the malware to phone home. If the domain or IP is taken down, a new name from the algorithm can be used by the botnet maintainer with a new IP address and the botnet maintained.
The purpose of building a DGA classifier isn’t specifically for takedowns of botnets, but to discover and detect the use on our network or services. If we can you have a list of domains resolved and accessed at your organization, it is possible now to see which of those are potentially generated and used by malware.
And where we are headed
My goal was to create a DGA classifier that was first and foremost easy to use. So I created a package that could be loaded and used in other code without having to think about the machine learning behind the classifier. Hopefully it doesn’t get any easier than this:
Install the package:
devtools::install("jayjacobs/dga") library(dga) # known good domains: good <- c("facebook.com", "google.com", "sina.com.cn", "twitter.com", "yandex.ru", "msn.com") # DGA domains generated by cryptolocker bad <- c("kqcrotywqigo.ru", "rlvukicfjceajm.ru", "ibxaoddvcped.ru", "tntuqxxbvxytpif.ru", "heksblnvanyeug.ru", "kbmqwdsrfzfqpdp.ru") # classify the domains as either "legit" or "dga" dgaPredict(c(good, bad)) ## name class prob ## 1 facebook legit 1.00 ## 2 google legit 1.00 ## 3 sina legit 1.00 ## 4 twitter legit 1.00 ## 5 yandex legit 1.00 ## 6 msn legit 1.00 ## 7 kqcrotywqigo dga 1.00 ## 8 rlvukicfjceajm dga 1.00 ## 9 ibxaoddvcped dga 1.00 ## 10 tntuqxxbvxytpif dga 1.00 ## 11 heksblnvanyeug dga 0.98 ## 12 kbmqwdsrfzfqpdp dga 1.00
The function returns the domain name it extracted from the names, the classification assigned and the probability the classifier used to classify. You can see from the output, that all of these domains were clearly classified (with high probability).
So that’s what these posts are going to walk through… what steps did we go through to answer the deceptively simple question of “Is this domain legitimate or generating by an algorithm?”
Getting and Cleaning the Data
The first major step in any classifer is getting training data. If that term is new to you, think of training data like the answer key to a test. We want the list of the questions (domain/host names) and the associated answer (whether each is “legit” or “dga”). This is also called “supervised” data, “labeled” data or “ground truth” data. In some cases (seems like most cases in infosec), establishing reliable training data is a huge challenge, but in this case we’re lucky. All we need is a list of good/legitimate domains and a second list of domains generated by an algorithm and we should be able to get that. In the example from click security, they offer several data sets that we could copy, but we’ll seek out our own list because classifiers like this are very sensitive to the choices you make when gathering the training data.
For samples of legitimate domains, an obvious choice is to go to the Alexa list of top web sites. But it’s not ready for our use as is. If you grab the top 1 Million Alexa domains and parse it, you’ll find just over 11 thousand are full URLs and not just domains, and there are thousands of domains with subdomains that don’t help us (we are only classifying on domains here). So after I remove the URLs, de-duplicated the domains and clean it up, I end up with the Alexa top 965,843.
“Real World” Data from OpenDNS
After reading the post from Frank Denis at OpenDNS titled “Why Using Real World Data Matters For Building Effective Security Models”, I grabbed their 10,000 Top Domains and their 10,000 Random samples. If we compare that to the top Alexa domains, 6,901 of the top ten thousand are in the alexa data and 893 of the random domains are in the Alexa data. I will clean that up as I make the final training data set.
The Click Security version wasn’t very clear in where they got their bad domains so I decided to collect my own and this was rather fun. Because I work with some interesting characters (who know interesting characters), I was able to collect several data sets from recent botnets: “Cryptolocker”, two seperate “Game-Over Zues” algorithms, and an anonymous collection of malicious (and algorithmically generated) domains. In the end, I was able to collect 73,598 algorithmically generated domains.
Creating the labeled data set
Once we clean up the various sources we want a simple, labeled data set
consisting of five columns: the full domain, the 2nd-level domain,
top-level domain, the main class, which is either legitimate (“legit”)
or from a domain generating algorithm (“dga”) and finally a subclass
(either the specific botnet or legit source). I have quite a bit of data
here and I will use all of the data in the final model (assuming I have
enough memory to process it). However this much data will create a lot
of waiting for you and some rather large files to be posting in a blog.
So, the dataset I’m making available is 10,000 domains. The “legit”
domains are composed of the top 1,000 alexa domains, along with 4k
randomly sampled alexa and opendns domains. The “dga” domains are
randomly sampled from the 70,000+ domains I have in my collection.
Working with a total of 10,000 samples should be enough to get some good
results while reducing the processing times in the next two parts. This
is a sample of the data in the
sampledga data set available in the DGA package.
library(dga) data(sampledga) set.seed(2) rbind(head(sampledga, 5), sampledga[sample(which(sampledga$subclass=="opendns"), 5), ], sampledga[sample(which(sampledga$subclass=="cryptolocker"), 5), ]) ## host domain tld class subclass ## 1 google.com google com legit alexa ## 2 facebook.com facebook com legit alexa ## 3 youtube.com youtube com legit alexa ## 4 yahoo.com yahoo com legit alexa ## 5 baidu.com baidu com legit alexa ## 976585 xt12365.cn xt12365 cn legit opendns ## 967852 adobesc.com adobesc com legit opendns ## 980838 lotustv.cc lotustv cc legit opendns ## 981307 hqcowichanvalley.com hqcowichanvalley com legit opendns ## 983055 online.ms online ms legit opendns ## 25721 brclxtulykdemb.ru brclxtulykdemb ru dga cryptolocker ## 24225 sdixbryxaxqrmf.ru sdixbryxaxqrmf ru dga cryptolocker ## 25479 syipowvqvktasf.ru syipowvqvktasf ru dga cryptolocker ## 15776 chaksikvqltbdeo.ru chaksikvqltbdeo ru dga cryptolocker ## 7948 jpmwlfhrjbly.ru jpmwlfhrjbly ru dga cryptolocker
The sample data is available in the DGA github repository and stay tuned for the next part as I get into “feature engineering”.