Enter the name of a domain to test if Reque✘ DNS would block access to it.
Type: | ✔ ✘ |
DGA Class: | |
Prediction Probability: |
This demo shows the our classification model in action. In reality, this model would be integrated into our full solution which is described in the sections below, and would be capable of handling thousands of requests automatically.
Malware and hacking is an issue that affects all of society. In 2016 alone, cyber attacks cost the US economy over $100B. These attacks not only affect the economy, but the general public as well.
In September 2018, hurricane Florence struck the North Carolina shoreline. It brought Category-1 force winds and massive flooding.
Hackers also sensed an opportunity.
Their target was the Onslow Water and Sewer Authority. ONWASA is responsible for providing clean water to 175,000 people in the region. The hackers slipped the trojan-horse malware Emotet onto ONWASA’s network. Once established, Emotet contacted its remote command and control servers and downloaded more malware. This malware contained ransomware which encrypted and locked all of ONWASA’s customer-facing systems – in the middle of their disaster response!
The only way to recover them was for ONWASA to pay a ransom to the hackers. With no guarantee that the hackers would honor the terms of their extortion, ONWASA refused to pay. They had to revert to responding to all requests using manual processes. It took weeks for ONWASA to recover.
This story illustrates a common attack pattern for malware. Trojan-horse malware will infect a single system on a network. It then contacts its command and control servers for new instructions. Once it has infected enough systems, then the trojan is given new malware to execute across the infected network.
The cyber attack experienced by ONWASA is just one of many similar attacks. There are two key aspects to these attacks which are described below.
This diagram from US-CERT shows how Emotet and similar malware works. We have highlighted in green a key step in Emotet’s attack sequence: its communication with command and control servers.
Communication with command and control servers (C2 servers) allows malware to download new attack instructions or to monetize their botnet by selling access to the network of infected computers. It was this capability that downloaded the ransomware that caused the most acute trouble for ONWASA.
Emotet is one of many malware variants that utilize a domain generation algorithm (DGA) to evade established malware protection strategies. These domain generating algorithms can create thousands of domains per day. The typical operation of DGA-based malware is depicted in the diagram below.
The vast majority of DGA domains do not exist. But that is not a problem for the malware operator. They only need to register a small percentage of the domains for the malware to connect successfully to its C2 server.
Traditionally, security teams use blacklists to block these communications. Unfortunately, blacklists are not effective against domain generating algorithms: there are thousands of domains being generated every day.
To solve the problems presented by malware that uses domain generation algorithms, our team is building Reque✘ DNS - an intelligent DNS firewall with an integrated deep-learning model that can block requests to algorithm-generated domains. This blockage prevents the malware from communicating with its C2 server.
Most DNS servers do not block traffic, and most DNS firewalls only use blacklists of known malicious domain names to determine if a domain is malicious or benign.
Below is an updated diagram illustrating how our solution works.
The malware-infected computer never receives the IP address of the C2 server so no additional malware is downloaded. In ONWASA’s case, Reque✘ DNS would have prevented the infection with the ransomware.
Below is a high level diagram of our solution's architecture.
The Reque✘ architecture has three layers - training, inference, and the application.
One of the important aspects of this problem space is that the malware keeps transforming with new algorithms, so the models and inferences need to be up to date for the DNS server to be effective in blocking the malicious domains. We use a layered architecture that can keep up with the changing datasets and evolving models, while keeping the application interface the same.
In the training layer, we have a pipeline to collect the data from different sources and then cleanup and merge them into a single dataset. This data is used to train our models. The trained models are evaluated and the best models are updated to the inference layer.
In the inference layer, the model is served in an inference engine using an Apache Storm based pipeline which listens on a Kafka bus for new requests. The inferences include a binary classification of whether a domain is malicious or benign, and a sub-classification of the malicious domains. These inferences are published on the Kafka bus to serve applications.
In the application layer, we have DNS resolvers with a frontend load balancer and a DNS Master. These servers connect to the inference layer using the Kafka bus with different topics.
When a computer is configured with our Reque✘ DNS server, the requests are sent to one of these DNS resolvers. The resolvers publish any non-cached domains on a topic on a topic to the Kafka bus served by the inference engine. The inference engine responds with the responses on the Kafka bus with another topic. The DNS master listens to these topics and updates its Response Policy Zone (RPZ) list and publishes them to the DNS resolvers.
The diagram below shows the steps from importing raw data from Cisco Umbrella and Bambenek Consulting to training our model files.
Python scripts automate the training pipeline, from data collection to model evaluation.
The inference pipeline diagram provides a detailed look at how a domain name is predicted and classified in the Reque✘ DNS system.
The inference engine, which runs on Flask, provides REST apis to the models, which are used to serve the web interface used in the demo. It is packaged in a docker container and can be launched in any cloud environment.
Our data came from two main sources: the free Cisco Umbrella list of the top million domain names by popularity, which we assumed to be benign, and Bambenek Consulting's feeds of recent DGA domains. For the Bambenek list, we used both the default (or mixed) confidence feed, and the high confidence feed. We analyzed a total of 1.36M domain names, 1 million benign (73.5%) and 360k malicious (26.5%).
We performed some EDA on the datasets to see if there were obvious differences in the names of the DGA domains and the benign domains. A few differences were apparent. Although we did not ultimately use the EDA for feature engineering, it served as a way to confirm the datasets should be different enough to predict using machine learning.
Character Frequency Analysis
The following chart shows the character frequencies of the characters in each domain name set. The values have been normalized to account for the different sizes of the datasets, while showing the different patterns of letter frequencies.
The data sets have distinct differences in frequency counts. There are far more j's, q's, x's and z's in the malicious domain names than in the benign ones. There are also no dashes in the malicious domain names and far more periods in the benign dataset.
Hierarchy Count Analysis
We calculated the hierarchy count for each domain name. Amazon.com would have a hierarchy count of 2, while pages.github.com would have a count of 3. The following histograms compare the three datasets by heirachy count.
The one feature that jumps out is that benign domain names are more likely to have higher heirarchies than the malicious domain names, while the malicious DGA domains predominantly have hierarchy counts of 2. This matches what we saw above, that benign domain names have far more periods in them.
Domain Name Length Analysis
We also compared the name lengths of the domains in the datasets. The histograms below show the distribution of domain name lengths for the different datasets.
The distribution of benign domain names is a fairly standard distribution that peaks around 16 characters. Both the mixed and high confidence datasets have a second peak around 30 characters. You can also see that domain names of lengths less than 10 characters or greater than 33 are more likely to be benign than malicious.
Top Level Domain Analysis
The top level domains of .com, .net and .org were the top three for each dataset. The charts below show the relative frequency of the 15 most common top level domains. Each chart shows the top 15 ordered by one of the datasets, with the other two datasets shown in grey for comparison.
Of special note is that .ru (for Russia) is the 4th most common top level domain in the Cisco Umbrella set. The .biz domain is also far more common in the malicious datasets. And especially interesting is that the .pw domain is the 6th most common top level domain for the malicious data sets, given that it is the country code top-level domain for Palau, an island nation with a population size of roughly 22,000.
Distinct Word Analysis
We also explored Natural Language Processing techniques to segment the domain labels further into smaller chunks of meaningful words, to see if some metrics based on such segmentation are able to show any appreciable difference between the DGA and benign categories.
The strongest signal we found is in the number of distinct words in the domain name, normalized by the domain length. There is at least an order of difference between the two categories.
Other metrics, such as raw/normalized scoring based on vocabulary contexts and raw count of or normalized number of digit characters in the domain name don't seem to offer any appreciable differentiation between the DGA and benign domains.
Below you will see how DGA and benign domains compared in number of distinct words, normalized by domain length. Both linear and log results are shown.
Our main objective was to predict whether a domain name represented a malicious website or a benign one, with an accuracy rate close to 99%. Our second objective was to predict the DGA class of domain names deemed malicious. And our third objective was to minimize our false positive rate so as not to block benign websites.
We wanted a model that can learn to capture the mechanism generating the sequence of characters in the domain name, but not necessarily memorize it. A long short-term memory (LSTM) model is suited for this. Employing an LSTM model avoids the vanishing gradients problem common with recurrent neural networks.
When our DNS server gets a domain name request, the process for classifying the domain goes like this:
We created binary and multi-class LSTM models. After evaluating these LSTM models, we compared their performance against a baseline logistic regression model, which performed significantly worse for classifying domain names.
In cybersecurity, it is important to have high accuracy in order to prevent incorrect blockages. Our binary model (that either allows or blocks an incoming domain request) achieves approximately 98.9% accuracy on our test data.
This accuracy is even maintained over time. With a model we trained using data from October, we were able to achieve 97.6% accuracy on data from four weeks later. The one area the October model struggled with was new malware categories. This will require us to keep our models up-to-date with the latest data.
Another important metric is the false positive rate - the number of benign domains getting classified as malicious, and hence blocked. A high false positive rate would cause legitimate sites to be blocked. This would lead to a frustrating user experience and severely limit our product’s uptake.
Our binary model achieves a 0.7% False Positive rate. In other words, for every 1000 requests processed, 7 benign domains are classified as malicious. While this is not too bad at this early stage, we want to get the False Positive rate even lower. Some research shows we can get it close to 1 in 10,000 with sufficient training data.
The resulting metrics for the 3 models we evaluated are listed in the table and chart below.
Metric | LSTM Binary | LSTM Multiclass | Logistic Regression |
---|---|---|---|
Accuracy | 0.99 | 0.96 | 0.869 |
Precision | 0.975 | 0.958 | 0.865 |
Recall | 0.988 | 0.96 | 0.869 |
F1 Score | 0.982 | 0.958 | 0.865 |
False Positives | 7 out of 1000 | 6 out of 1000 | 45 out of 1000 |
False Negatives | 3 out of 1000 | 5 out of 1000 | 86 out of 1000 |
As shown above, our binary model performs the best, followed by our multiclass model (due to multiple classification categories) while the baseline logistic regression model lags substantially behind. Not only is the accuracy much lower, the logistic regression model is 6 to 7 times worse on False Positives.
MVP Roadmap | Post-MVP Roadmap |
---|---|
✔ Select data sources ✔ Build binary and multiclass models ✔ Build solution architecture ✔ Automate:
✔ Build demonstration UI (requex.net) ⭕ Integrate solution into DNS infrastructure ⭕ Deploy to first customer and evaluate |
⭕ Build customer reporting and customization portal ⭕ Diversify our DGA threat intelligence data ⭕ Build new binary and multiclass models ⭕ Scale our infrastructure |
Our team has made significant progress towards our MVP. You can see our progress above.
Our next phase is to focus on integrating our models into our DNS infrastructure. From there, we need to find a customer willing to try our solution.
Beyond our MVP, we have plans to build a customer portal for reporting and customization. We also want to diversify our training data and build more sophisticated models. We anticipate malware will soon use more complex domain generating algorithms and we want to be ready for them.
Ultimately, we will need to scale our infrastructure to take on more customers.
We are looking for partners who can:
We would like to extend our deepest thanks to the following for their insight, support, data and time.
Paul Vixie CEO, Farsight Security John Bambenek Bambenek Consulting Joyce Shen UC Berkeley and Tenfore Holdings David Steier UC Berkeley and Carnegie Mellon |
Cisco Umbrella Daily top 1 million domains Endgame, Inc Research on using LSTMs for Domain Generating Algorithms |