Reque DNS

Block Malicious Domain Requests Intelligently




Try Now >>

Demo


Enter the name of a domain to test if Reque DNS would block access to it.




Type:
DGA Class:
Prediction Probability:

This demo shows the our classification model in action. In reality, this model would be integrated into our full solution which is described in the sections below, and would be capable of handling thousands of requests automatically.






Learn More >>

The Problem


Malware and hacking is an issue that affects all of society. In 2016 alone, cyber attacks cost the US economy over $100B. These attacks not only affect the economy, but the general public as well.

A Story

In September 2018, hurricane Florence struck the North Carolina shoreline. It brought Category-1 force winds and massive flooding.

Hurricane Florence

Source: Threat Post

Hackers also sensed an opportunity.

Their target was the Onslow Water and Sewer Authority. ONWASA is responsible for providing clean water to 175,000 people in the region. The hackers slipped the trojan-horse malware Emotet onto ONWASA’s network. Once established, Emotet contacted its remote command and control servers and downloaded more malware. This malware contained ransomware which encrypted and locked all of ONWASA’s customer-facing systems – in the middle of their disaster response!

The only way to recover them was for ONWASA to pay a ransom to the hackers. With no guarantee that the hackers would honor the terms of their extortion, ONWASA refused to pay. They had to revert to responding to all requests using manual processes. It took weeks for ONWASA to recover.

This story illustrates a common attack pattern for malware. Trojan-horse malware will infect a single system on a network. It then contacts its command and control servers for new instructions. Once it has infected enough systems, then the trojan is given new malware to execute across the infected network.

The cyber attack experienced by ONWASA is just one of many similar attacks. There are two key aspects to these attacks which are described below.

Command and Control Servers

This diagram from US-CERT shows how Emotet and similar malware works. We have highlighted in green a key step in Emotet’s attack sequence: its communication with command and control servers.

Emotet infection process

Source: US-CERT, emphasis on step 3 added

Communication with command and control servers (C2 servers) allows malware to download new attack instructions or to monetize their botnet by selling access to the network of infected computers. It was this capability that downloaded the ransomware that caused the most acute trouble for ONWASA.

Domain Generation Algorithm (DGA)

Emotet is one of many malware variants that utilize a domain generation algorithm (DGA) to evade established malware protection strategies. These domain generating algorithms can create thousands of domains per day. The typical operation of DGA-based malware is depicted in the diagram below.

DGA-based malware operation

  • Step 1: The malware sends requests to a DNS server to see if a generated domain exists.
  • Step 2: The DNS server responds with an IP address, or a message that the domain does not exist.
  • Step 3: Once the malware has the IP address, the malware can then request update.
  • Step 4: The C2 server sends an update which launches the next phase of the attack.

The vast majority of DGA domains do not exist. But that is not a problem for the malware operator. They only need to register a small percentage of the domains for the malware to connect successfully to its C2 server.

Traditionally, security teams use blacklists to block these communications. Unfortunately, blacklists are not effective against domain generating algorithms: there are thousands of domains being generated every day.


Solution


To solve the problems presented by malware that uses domain generation algorithms, our team is building Reque DNS - an intelligent DNS firewall with an integrated deep-learning model that can block requests to algorithm-generated domains. This blockage prevents the malware from communicating with its C2 server.

Most DNS servers do not block traffic, and most DNS firewalls only use blacklists of known malicious domain names to determine if a domain is malicious or benign.

Below is an updated diagram illustrating how our solution works.

DNS solution that halts communication with C2 servers

  • Step 1: The malware sends requests to a DNS server to see if a generated domain exists.
  • Step 2: The DNS server analyzes the domain request with our deep-learning model and determines that the domain is malicious. It responds the same as if the domain did not exist.

The malware-infected computer never receives the IP address of the C2 server so no additional malware is downloaded. In ONWASA’s case, Reque DNS would have prevented the infection with the ransomware.

Architecture

Below is a high level diagram of our solution's architecture.


RequeX architecture


The Reque architecture has three layers - training, inference, and the application.

One of the important aspects of this problem space is that the malware keeps transforming with new algorithms, so the models and inferences need to be up to date for the DNS server to be effective in blocking the malicious domains. We use a layered architecture that can keep up with the changing datasets and evolving models, while keeping the application interface the same.

In the training layer, we have a pipeline to collect the data from different sources and then cleanup and merge them into a single dataset. This data is used to train our models. The trained models are evaluated and the best models are updated to the inference layer.

In the inference layer, the model is served in an inference engine using an Apache Storm based pipeline which listens on a Kafka bus for new requests. The inferences include a binary classification of whether a domain is malicious or benign, and a sub-classification of the malicious domains. These inferences are published on the Kafka bus to serve applications.

In the application layer, we have DNS resolvers with a frontend load balancer and a DNS Master. These servers connect to the inference layer using the Kafka bus with different topics.

When a computer is configured with our Reque DNS server, the requests are sent to one of these DNS resolvers. The resolvers publish any non-cached domains on a topic on a topic to the Kafka bus served by the inference engine. The inference engine responds with the responses on the Kafka bus with another topic. The DNS master listens to these topics and updates its Response Policy Zone (RPZ) list and publishes them to the DNS resolvers.

Training Pipeline

The diagram below shows the steps from importing raw data from Cisco Umbrella and Bambenek Consulting to training our model files.


RequeX training pipeline


Python scripts automate the training pipeline, from data collection to model evaluation.

Inference Pipeline

The inference pipeline diagram provides a detailed look at how a domain name is predicted and classified in the Reque DNS system.


RequeX training pipeline


Serving Applications

The inference engine, which runs on Flask, provides REST apis to the models, which are used to serve the web interface used in the demo. It is packaged in a docker container and can be launched in any cloud environment.

RequeX training pipeline


Exploratory Data Analysis


Datasets

Our data came from two main sources: the free Cisco Umbrella list of the top million domain names by popularity, which we assumed to be benign, and Bambenek Consulting's feeds of recent DGA domains. For the Bambenek list, we used both the default (or mixed) confidence feed, and the high confidence feed. We analyzed a total of 1.36M domain names, 1 million benign (73.5%) and 360k malicious (26.5%).

Analysis

We performed some EDA on the datasets to see if there were obvious differences in the names of the DGA domains and the benign domains. A few differences were apparent. Although we did not ultimately use the EDA for feature engineering, it served as a way to confirm the datasets should be different enough to predict using machine learning.

Character Frequency Analysis

The following chart shows the character frequencies of the characters in each domain name set. The values have been normalized to account for the different sizes of the datasets, while showing the different patterns of letter frequencies.

Character analysis



The data sets have distinct differences in frequency counts. There are far more j's, q's, x's and z's in the malicious domain names than in the benign ones. There are also no dashes in the malicious domain names and far more periods in the benign dataset.

Hierarchy Count Analysis

We calculated the hierarchy count for each domain name. Amazon.com would have a hierarchy count of 2, while pages.github.com would have a count of 3. The following histograms compare the three datasets by heirachy count.

Hierarchy  analysis

The one feature that jumps out is that benign domain names are more likely to have higher heirarchies than the malicious domain names, while the malicious DGA domains predominantly have hierarchy counts of 2. This matches what we saw above, that benign domain names have far more periods in them.



Domain Name Length Analysis

We also compared the name lengths of the domains in the datasets. The histograms below show the distribution of domain name lengths for the different datasets.

Character analysis



The distribution of benign domain names is a fairly standard distribution that peaks around 16 characters. Both the mixed and high confidence datasets have a second peak around 30 characters. You can also see that domain names of lengths less than 10 characters or greater than 33 are more likely to be benign than malicious.

Top Level Domain Analysis

The top level domains of .com, .net and .org were the top three for each dataset. The charts below show the relative frequency of the 15 most common top level domains. Each chart shows the top 15 ordered by one of the datasets, with the other two datasets shown in grey for comparison.

Top level domain frequencies Top level domain frequencies Top level domain frequencies

Of special note is that .ru (for Russia) is the 4th most common top level domain in the Cisco Umbrella set. The .biz domain is also far more common in the malicious datasets. And especially interesting is that the .pw domain is the 6th most common top level domain for the malicious data sets, given that it is the country code top-level domain for Palau, an island nation with a population size of roughly 22,000.

Distinct Word Analysis

We also explored Natural Language Processing techniques to segment the domain labels further into smaller chunks of meaningful words, to see if some metrics based on such segmentation are able to show any appreciable difference between the DGA and benign categories.

The strongest signal we found is in the number of distinct words in the domain name, normalized by the domain length. There is at least an order of difference between the two categories.

Other metrics, such as raw/normalized scoring based on vocabulary contexts and raw count of or normalized number of digit characters in the domain name don't seem to offer any appreciable differentiation between the DGA and benign domains.

Below you will see how DGA and benign domains compared in number of distinct words, normalized by domain length. Both linear and log results are shown.


Normalized Number of Distinct Words


Our Model


Objective

Our main objective was to predict whether a domain name represented a malicious website or a benign one, with an accuracy rate close to 99%. Our second objective was to predict the DGA class of domain names deemed malicious. And our third objective was to minimize our false positive rate so as not to block benign websites.

The Machine Learning Model

We wanted a model that can learn to capture the mechanism generating the sequence of characters in the domain name, but not necessarily memorize it. A long short-term memory (LSTM) model is suited for this. Employing an LSTM model avoids the vanishing gradients problem common with recurrent neural networks.

When our DNS server gets a domain name request, the process for classifying the domain goes like this:

  1. We convert the domain name characters into integer sequences based on a dictionary of characters, a process known as tokenization.
  2. The tokenized sequence goes through a character-level embedding layer which is then sent to the LSTM layer.
  3. Finally, some dropout is applied.
Building upon this framework, we first have a binary, benign/malicious model to allow or block the domain request. A second model, used for DGA categorization, works to classify the domain to one of 50+ DGA malware categories

Tokening of domain names

We created binary and multi-class LSTM models. After evaluating these LSTM models, we compared their performance against a baseline logistic regression model, which performed significantly worse for classifying domain names.

Binary and Multiclass models diagram

Model Metrics

In cybersecurity, it is important to have high accuracy in order to prevent incorrect blockages. Our binary model (that either allows or blocks an incoming domain request) achieves approximately 98.9% accuracy on our test data.

This accuracy is even maintained over time. With a model we trained using data from October, we were able to achieve 97.6% accuracy on data from four weeks later. The one area the October model struggled with was new malware categories. This will require us to keep our models up-to-date with the latest data.

Another important metric is the false positive rate - the number of benign domains getting classified as malicious, and hence blocked. A high false positive rate would cause legitimate sites to be blocked. This would lead to a frustrating user experience and severely limit our product’s uptake.

Our binary model achieves a 0.7% False Positive rate. In other words, for every 1000 requests processed, 7 benign domains are classified as malicious. While this is not too bad at this early stage, we want to get the False Positive rate even lower. Some research shows we can get it close to 1 in 10,000 with sufficient training data.

The resulting metrics for the 3 models we evaluated are listed in the table and chart below.

Metric LSTM Binary LSTM Multiclass Logistic Regression
Accuracy 0.99 0.96 0.869
Precision 0.975 0.958 0.865
Recall 0.988 0.96 0.869
F1 Score 0.982 0.958 0.865
False Positives 7 out of 1000 6 out of 1000 45 out of 1000
False Negatives 3 out of 1000 5 out of 1000 86 out of 1000

As shown above, our binary model performs the best, followed by our multiclass model (due to multiple classification categories) while the baseline logistic regression model lags substantially behind. Not only is the accuracy much lower, the logistic regression model is 6 to 7 times worse on False Positives.


Model metrics chart

Roadmap


MVP Roadmap Post-MVP Roadmap

Select data sources

Build binary and multiclass models

Build solution architecture

Automate:

  • data collection
  • model training
  • model evaluation

Build demonstration UI (requex.net)

⭕ Integrate solution into DNS infrastructure

⭕ Deploy to first customer and evaluate

⭕ Build customer reporting and customization portal

⭕ Diversify our DGA threat intelligence data

⭕ Build new binary and multiclass models

⭕ Scale our infrastructure



Our team has made significant progress towards our MVP. You can see our progress above.

Our next phase is to focus on integrating our models into our DNS infrastructure. From there, we need to find a customer willing to try our solution.

Beyond our MVP, we have plans to build a customer portal for reporting and customization. We also want to diversify our training data and build more sophisticated models. We anticipate malware will soon use more complex domain generating algorithms and we want to be ready for them.

Ultimately, we will need to scale our infrastructure to take on more customers.

Next Steps

We are looking for partners who can:

  • Provide more labeled known-good or known-malicious data feeds
  • Lend a hand with DNS expertise
  • Connect us with potential first customers
  • Expand the Reque DNS team
  • Fund us!


Our Team

Jason Hunsberger Aniruddh Nautiyal Surya Nimmagadda Elizabeth Shulok
Jason Hunsberger Aniruddh Nautiyal Surya Nimmagadda Elizabeth Shulok
Digital Product Manager Staff Test Engineer Distinguished Engineer Principal Engineer
Boston Area Non-profit Cypress Semiconductor Juniper Networks Autodesk
LinkedIn LinkedIn LinkedIn

Thank you


We would like to extend our deepest thanks to the following for their insight, support, data and time.

Paul Vixie

CEO, Farsight Security


John Bambenek

Bambenek Consulting


Joyce Shen

UC Berkeley and Tenfore Holdings


David Steier

UC Berkeley and Carnegie Mellon

Cisco Umbrella

Daily top 1 million domains


Endgame, Inc

Research on using LSTMs for Domain Generating Algorithms