Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This will be a three-part blog series on building a DGA classifier and
will be split into three logical phases of building a classifier: 1)
Data preparation (this) 2) Feature engineering and 3) Model selection.
And before I get too far into this, I want to give a huge thank you to
Click Security for releasing a DGA
classifier
in python as part of their very nice Data Hacking github
repo. If you would
rather see how a classifer is build in python, they did a great job of
laying out the steps in the code and an in depth ipython notebook.
Initially I didn’t deviate much from their work, but as I rolled around
in the data I was experimenting with several different features only to
end up reducing my feature set down to a relatively simple set of 3.
The DGA package and sample data is available from the DGA github
repository.
A little background on DGA
DGA stands for Domain Generating Algorithm and these algorithms are
part of the evolution of malware communications. In the beginning
malware would be hardcoded with IP address(es) or domain names and the
botnet could be disrupted by going after whatever was hardcoded. The
purpose of the DGA is to be deterministic, yet generate a whole lot of
random (hard to guess) domains, of which the bot maintainer only has to
register one (or a handful) to enable the malware to phone home. If the
domain or IP is taken down, a new name from the algorithm can be used by
the botnet maintainer with a new IP address and the botnet maintained.
The purpose of building a DGA classifier isn’t specifically for
takedowns of botnets, but to discover and detect the use on our network
or services. If we can you have a list of domains resolved and accessed
at your organization, it is possible now to see which of those are
potentially generated and used by malware.
And where we are headed
My goal was to create a DGA classifier that was first and foremost easy
to use. So I created a package that could be loaded and used in other
code without having to think about the machine learning behind the
classifier. Hopefully it doesn’t get any easier than this:
Install the package:
devtools::install("jayjacobs/dga")
library(dga)
# known good domains:
good <- c("facebook.com", "google.com", "sina.com.cn",
"twitter.com", "yandex.ru", "msn.com")
# DGA domains generated by cryptolocker
bad <- c("kqcrotywqigo.ru", "rlvukicfjceajm.ru", "ibxaoddvcped.ru",
"tntuqxxbvxytpif.ru", "heksblnvanyeug.ru", "kbmqwdsrfzfqpdp.ru")
# classify the domains as either "legit" or "dga"
dgaPredict(c(good, bad))
## name class prob
## 1 facebook legit 1.00
## 2 google legit 1.00
## 3 sina legit 1.00
## 4 twitter legit 1.00
## 5 yandex legit 1.00
## 6 msn legit 1.00
## 7 kqcrotywqigo dga 1.00
## 8 rlvukicfjceajm dga 1.00
## 9 ibxaoddvcped dga 1.00
## 10 tntuqxxbvxytpif dga 1.00
## 11 heksblnvanyeug dga 0.98
## 12 kbmqwdsrfzfqpdp dga 1.00
The function returns the domain name it extracted from the names, the
classification assigned and the probability the classifier used to
classify. You can see from the output, that all of these domains were
clearly classified (with high probability).
So that’s what these posts are going to walk through… what steps did
we go through to answer the deceptively simple question of “Is this
domain legitimate or generating by an algorithm?”
Getting and Cleaning the Data
The first major step in any classifer is getting training data. If that
term is new to you, think of training data like the answer key to a
test. We want the list of the questions (domain/host names) and the
associated answer (whether each is “legit” or “dga”). This is also
called “supervised” data, “labeled” data or “ground truth” data. In some
cases (seems like most cases in infosec), establishing reliable training
data is a huge challenge, but in this case we’re lucky. All we need is a
list of good/legitimate domains and a second list of domains generated
by an algorithm and we should be able to get that. In the example from
click
security,
they offer several data sets that we could copy, but we’ll seek out our
own list because classifiers like this are very sensitive to the choices
you make when gathering the training data.
Alexa
For samples of legitimate domains, an obvious choice is to go to the
Alexa list of top web sites. But it’s not ready for our use as is. If
you grab the top 1 Million Alexa
domains and parse
it, you’ll find just over 11 thousand are full URLs and not just
domains, and there are thousands of domains with subdomains that don’t
help us (we are only classifying on domains here). So after I remove the
URLs, de-duplicated the domains and clean it up, I end up with the Alexa
top 965,843.
“Real World” Data from OpenDNS
After reading the post from Frank Denis at OpenDNS titled “Why Using
Real World Data Matters For Building Effective Security
Models”,
I grabbed their 10,000 Top
Domains
and their 10,000 Random
samples.
If we compare that to the top Alexa domains, 6,901 of the top ten
thousand are in the alexa data and 893 of the random domains are in the
Alexa data. I will clean that up as I make the final training data set.
DGA domains
The Click Security version wasn’t very clear in where they got their bad
domains so I decided to collect my own and this was rather fun. Because
I work with some interesting characters (who know interesting
characters), I was able to collect several data sets from recent
botnets: “Cryptolocker”, two seperate “Game-Over Zues” algorithms, and
an anonymous collection of malicious (and algorithmically generated)
domains. In the end, I was able to collect 73,598 algorithmically
generated domains.
Creating the labeled data set
Once we clean up the various sources we want a simple, labeled data set
consisting of five columns: the full domain, the 2nd-level domain,
top-level domain, the main class, which is either legitimate (“legit”)
or from a domain generating algorithm (“dga”) and finally a subclass
(either the specific botnet or legit source). I have quite a bit of data
here and I will use all of the data in the final model (assuming I have
enough memory to process it). However this much data will create a lot
of waiting for you and some rather large files to be posting in a blog.
So, the dataset I’m making available is 10,000 domains. The “legit”
domains are composed of the top 1,000 alexa domains, along with 4k
randomly sampled alexa and opendns domains. The “dga” domains are
randomly sampled from the 70,000+ domains I have in my collection.
Working with a total of 10,000 samples should be enough to get some good
results while reducing the processing times in the next two parts. This
is a sample of the data in the sampledga
data set available in the DGA package.
library(dga)
data(sampledga)
set.seed(2)
rbind(head(sampledga, 5),
sampledga[sample(which(sampledga$subclass=="opendns"), 5), ],
sampledga[sample(which(sampledga$subclass=="cryptolocker"), 5), ])
## host domain tld class subclass
## 1 google.com google com legit alexa
## 2 facebook.com facebook com legit alexa
## 3 youtube.com youtube com legit alexa
## 4 yahoo.com yahoo com legit alexa
## 5 baidu.com baidu com legit alexa
## 976585 xt12365.cn xt12365 cn legit opendns
## 967852 adobesc.com adobesc com legit opendns
## 980838 lotustv.cc lotustv cc legit opendns
## 981307 hqcowichanvalley.com hqcowichanvalley com legit opendns
## 983055 online.ms online ms legit opendns
## 25721 brclxtulykdemb.ru brclxtulykdemb ru dga cryptolocker
## 24225 sdixbryxaxqrmf.ru sdixbryxaxqrmf ru dga cryptolocker
## 25479 syipowvqvktasf.ru syipowvqvktasf ru dga cryptolocker
## 15776 chaksikvqltbdeo.ru chaksikvqltbdeo ru dga cryptolocker
## 7948 jpmwlfhrjbly.ru jpmwlfhrjbly ru dga cryptolocker
The sample data is available in the DGA github
repository and stay
tuned for the next part as I get into “feature engineering”.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.