How do you design ML models for malicious network detection?

Machine Learning (ML) has found its place into cybersecurity a long time ago and usage of ML has given cybersecurity teams much-needed insights into the malware network and effective ways to curb cyber attacks. Most ML-based solutions are proprietary or designed for specific feature representations.

In 2017, one of the most prominent credit reporting agencies (CRA) of the United States- Equifax, suffered a huge malicious attack that led to a data breach that is famous for all the wrong reasons. Personal and sensitive data worth 148 million was lost to a data breach. Such data breach and data risks are still prevalent irrespective of the endpoint protection and other monitoring techniques deployed by enterprises worldwide.

Q3 2019 hedge fund letters, conferences and more

Malicious attacks are the most dreaded cyber attacks. But, as the enterprises are collecting a large pool of data through their resources. These data sets are quite useful for machine learning models for the detection of malicious attacks and entities in the system.

ML techniques and models applied on the network data include systems for detecting malicious domains, methods for detecting malware delivery or command-and-control communication, techniques for detecting malicious web pages, and various industrial products for enterprise threat detection.

Malware Detection Cycle:

ML can be of utmost utilization with the shortening of the malware detection lifecycle. But there are certain limitations to the usage of the ML algorithms for shortening of malware detection cycle:

ML is most effective when trained through supervised learning. But, as the data which are subjected to malware attacks are unsupervised or unknown.
False positives and ML errors need expert analysis, leading to high cost.
Network traffic shows a dynamic diversity under normal operational conditions.
Lack of standard benchmark data sets does not allow standard evaluations.

To overcome such limitations of ML there are certain models that can be designed and in order to do the same, we need to know how the malware detection cycle works.

Image Source: researchgate

Malware detection is carried out when a particular number of machines are infected and a sample malware reaches one of the antivirus vendors. A detailed analysis of the malware gives an identification signature to the malware that is defined for the new malware which is added to the vendor’s antivirus database.

The signature detection reaches the clients with the new software update, which is performed the infection starts to reduce until it finally becomes obsolete. After this, malware creators start creating better malware that goes undetected by the updated systems and software.

ML for Malicious Network Traffic:

Searching for network intrusion has a wide scope of research, conventional snort methods have manually-generated rules for detecting well-known malware variants. But, with the ML, the augmentation of the rule-based system becomes quite easy. This has led to enterprises and businesses looking to secure the networks that exchange the data between their servers and mobile applications developed by mobile app development companies. The powers of machine learning can be leveraged to detect malware that evades the rule-based systems.

ML models can be used in three types of applications for malware detection:

Domain reputation systems using passive DNS(Deneid Service Attacks) data such as Notos and EXPOSURE.
Command-and-control detection using NetFlow data, such as DISCLOSURE and Botfinder.
Malicious communication detection using a web-proxy log like Execscent, BAYWATCH, and MADE.

We can leverage an open-source network monitoring agent that can collect a number of network logs. Network log data fields such as Transmission Control Protocol (TCP) connection timestamp, duration, source IP and port, destination IP and port, number of packets sent and received, number of bytes sent and received, and connection state can be collected through this open-source network monitoring agent. While for User Datagram Protocol, an entry is recorded for every UDP packet. This method includes a sole assumption, that attackers have no control over the network logs.

The ML Architecture for Malicious Network Detection:

As we already know that the supervised learning of ML can be a limitation in malicious network detection and for this reason, here, we are exploring an architecture where the supervised learning of ML is optimized for malicious network detection and classification. For this, there are certain factors to be considered,

Feature representations should consider specifics of any malware attack. For better results, the connection-level features are compared with the aggregated traffic statistics and temporal features.
Class imbalance should be countered as it hinders the performance of simple linear models like logistics regression and more.
Create a mix of different models like gradient boosting that has the capability of handling the class imbalance. Due to the mix of different capabilities, these models have a better chance of detecting malicious connections and classify the malicious connections from benign connections than the linear models.

Image Source: arxiv

As we can see in the above figure, the machine learning architecture uses an open-sourced network agent called ”Bro” for collecting all the network logs, these network logs are utilized through feature extraction and a set of different ML algorithms for testing and training purposes. Features extracted from the data based on the specifics of attacks are used to train the ML algorithm that can detect.

ML Classification of the Malware NetworkDetection for Learning:

Ground Truth Labelling:

According to the results on the application of the aforesaid architecture, the attacks do not remain active during the full span of network detection and it solely depends on the granularity of the labeling of data.

Coarse-Grained Labeling:

In this labeling system, all the connection logs are labeled as malicious that are generated through botnet IP.

Fine-Grained Labeling:

Considering an instance of DDoS called Rbot attack to obtain the IP address of the victimized machine. This IP address is then followed to detect further attacks. Every feature representation is labeled as malicious for a specific time window based on the attacks in the connection log for that particular time window.

Though fine-grained labeling is difficult to obtain, it certainly improves the performance of an ML model for botnet detection. The most popular ML models used for classification are gradient boosting, random forest and logistic regression.

These models and algorithms can be evaluated on the basis of specified metrics like precision and recall. The imbalance in the data is measured in a ratio of Malicious to Legitimate samples. This ratio is quite low the accuracy remains high.

These models should be used in tandem or assemble them to create an architecture that can effectively cope up with the imbalance of data, interpret the data sets and provide supervised learning based on the feature representations achieved on the network log collection and analysis.

Conclusion:

There is no doubt about the fact that data imbalance is hindering in the application of ML models for malware classification for cybersecurity. For this purpose, research and experimentations of different ML models based on the supervised learning through feature representations over the network logs should be encouraged.

Malware is considered to be the most dangerous threat of all the cyber threats present on the network today and the same should be realized by the data-driven business to avoid data breaches and data risks and ensure safeguards by using the exquisite technologies like Artificial intelligence and machine learning. There are new data regulations formed to protect data theft, but, as they say, desperate times need desperate measures!