Malware Detection and Classification in IoT Network using ANN

Internet of Things is an emerging technology in the modern world and its network is expanding constantly. Meanwhile, IoT devices are a soft target and vulnerable to attackers. The battle between malware attackers and security analysts is persistent and everlasting. Because malware is evolving constantly and thus asserting pressure on researchers and security analysts to cope up with modern threats by improving their defense systems. Complexity and diversity of current malicious software present immense challenges for protecting IoT networks from malware attacks. In this paper, we have explored the potential of neural networks for detection and classification of malware using IoT network dataset comprising of total 4,61,043 records with 3,00,000 as benign while 1,61,043 as malicious. With the proposed methodology, malware is detected with an accuracy of 94.17% while classified with 97.08% accuracy.


INTRODUCTION
n modern times, Internet of Things (IoT) is an interrelated network of multiple devices in which data is automatically collected from the environment by the sensors, transferred over the internet without human help and intervention such as home appliances, traffic lights, and lamp posts etc., that are related to the Internet.
IoT devices have a range of sensors that render useful data generation without human-to-human or humanto-machine interaction [1]. The Internet of Things is known as the third industrial transition. It is known as "interconnection, through the Internet, from computer equipment embedded in everyday objects, allowing it to send and receive data" [2]. The market for IoT is growing at a spectacular rate, beginning with 2 billion artifacts in 2006, a forecast rise of 200 billion by 2020 out of 200%. IoT sensors or appliances also gather and IoT devices are a soft target for hackers or unauthorized users as they are simpler to taint than regular PCs for the following reasons [8]: • Numerous IoT devices are associated with the Internet without any updates in security.
• For development of IoT devices, security is given a low priority.
• Implementing cryptography techniques in IoT devices is computationally costly because of memory and power limitations.
• Login credentials that are either given by the user or by the manufacturer are weak in IoT devices.
• Sometimes few backdoors are left by vendors of IoT devices to provide remote support for that device.
• IoT devices are often associated with the Internet without experiencing a firewall.
IoT software manufacturers don't routinely upgrade their apps unless the user initiates firmware updates. Due to resource constraints [9], these systems cannot run full-fledged protection protocols, so IoT devices are vulnerable to attack for longer periods (e.g. their default login keys, unpatched bugs) [10].
IoT devices operate more in an unattended environment, so there is a reasonable risk that an attacker can gain physical access to them intentionally. As a result, attackers can obtain valuable information via the communication channel by listening to the conversation secretly, since most IoT devices use wireless links. These devices do not integrate strong security features because there are restricted computing and power resources [11]. The implementation of strong security mechanisms is not only difficult due to the limited available resources but also due to non-trustful contact with the environment. Given the likelihood of compromised IoT devices in an IoT network, a comprehensive protection approach must be established based on time-to-time patching of vulnerabilities [12].
In recent years, numerous methods have been proposed by many researchers regarding malware detection and classification using machine learning algorithms. These works mainly focus on malware detection in Android devices, Windows or OS malware, and limited work on malware identification in IoT network which is a substantial security threat in recent times. Based on the above discussion, there is a need for an efficient technique that generates the best possible results for malware detection and classification in a shorter time.
In this paper challenge of detection and classification of malware using network traffic analysis has been taken up. Main contributions of the paper are summarized as: • Proposed the first ANN to detect malware by analyzing packets of network traffic generated by the IoT network.
• Another ANN is proposed that classifies malware families based on network traffic behavior.
• The proposed methodology is compared with traditional ML algorithms i.e., k-NN and Naïve Bayes.
• Analysis of results depicts that the proposed methodology is efficient for detecting malware with an accuracy of 94.17% while classifying malware with an accuracy of 97.08%. This paper is organized in different sections. Section 2 presents an overview of past literature. Section 3 demonstrates related background information. Dataset description and creation is explained in Section 4 while experimental results are evaluated in Section 5.
In the end conclusion along with future work and comparison is given in Section 6 and 7 respectively.

LITERATURE SURVEY
There are several works in the literature related to malware analysis, detection, and classification. Intrusion or malware detection is a trending area of research. However, it is unlikely that the resourceconstrained existence of most IoT devices and customized operating systems, traditional malware detection and prevention solutions would fit the real world. Malware can exploit vulnerabilities in compromised IoT systems, or it can cause specific limitations on some IoT apps. Therefore, the IoT network's security requirement that needs to be addressed is fixing malware.
Liu et al. [8] presented a multi-layer learning framework for classification of malware by converting samples to greyscale images. Machine learning algorithms i.e., the k-Nearest Neighbour (k-NN) and Random Forest (RF) are applied on malware datasets, compared with existing work, and accuracy is improved.
Kumar and Lim [6] presented a solution for the detection of malware in large scale networks rather than detection based on hosts. ML technique is used to analyze traffic patterns for detection of malware activity in IoT devices, store those traffic patterns in the database and perform necessary countermeasures for detecting the malicious activity of IoT bots i.e., blocking of traffic generated by botnets and report to network administrators. Target is to identify IoT bots before the actual attack i.e., in the scanning phase. Past work is done on PC based bots rather than IoT bots. Nguyen et al. [16] proposed an approach for detecting IoT botnet using Printable String Information (PSI) graphs. Dataset consists of 11200 elf files, 7199 malware samples while 4001 benign samples. Function call graphs were created using these samples. Further PSI graphs were created using functions that were close to IoT botnets. CNN classifies malware and benign samples for IoT devices using feature vector data from PSI graphs that indicate the rate and direction of change in features. 98.7% accuracy achieved using the PSI graph-based approach. However, there are also some limitations in this work as it has an analysis of control flow graphs that are complex, effort, and time-consuming.
Vinayakumar et al. [17] proposed framework ScaleMalNet that can handle Big Data of malicious samples. This paper also contributes to presenting novel image processing techniques for the classification of malware. Different deep and machine Yin et al. [18] proposed mechanism for dynamic analysis of malware using a deep neural network that comprises of three modules: one that monitors and analyze the dynamic behavior of malware, second that processes log files generated by previous module and third that consists of deep neural network mainly CNN used to detect and classify malware. Dataset comprises of 10,000 malware samples from 5 families each has 2000 samples. 97.3% of accuracy is achieved. In this work, data samples are a less and inadequate set of malware families is used.
Aman et al. [19] proposes a novel framework that classifies and identifies malware samples. As we look into related work, researchers detect and classify malware on windows, android apps, and IoT devices which comprise of binary image, control flow graphs, and portable executable files while using network packets very little or no work is seen. Moreover, data samples used in literature are small including less benign and malware samples as well as least malware families in terms of classification.
Koroniotis et al. [22] and Hamza et al. [23] recently proposed network-based IoT datasets that are comprised of attack scenarios. However, the datasets did not have a variety of attack types such as ransomware and Cross-Site Scripting (XSS) nor they contain sensor measurement data of IoT devices.
Traditional machine learning-based malware detection and classification rely on feature engineering, feature learning, and feature representation techniques that require extensive domain-level knowledge. In contrast to ML algorithms, the neural network tries to learn features from data in an incremental manner. So, there is a need for a methodology that can efficiently detect and classify malware in IoT networks using network traffic analysis.
These issues have persuaded us to come up with an IoT-related dataset that contains sensors' reading data as an information source for data-driven IoT-based Intrusion Detection System (IDS) to properly monitor the internal behavior of IoT applications, hereby securing them from malicious activities.

THEORETICAL BACKGROUND
This Section presents an overview of malware as a security threat for IoT network, analysis, and detection techniques of malware, also an overview of machine learning and deep learning approaches.

Malware
Malware is defined as software that fulfills the harmful intent of an attacker. Different researchers define malware with different definitions like a code that is added to the system to deliberately cause damage or invert the actual task of the system. Malware is of various kinds like Trojan horse, virus, worm, etc. as shown in Fig 2. Trojan horse is a kind of malware that is planted in a system or app by its manufacturer. The system performs intended actions but it also performs some invalid actions. A virus is a program that spreads to other programs by replication. An infected program that causes harm to other programs is called the host of the virus. The host spreads itself to other system programs. The worm is a program that spreads to other programs by replication of its code execution. The difference between worm and virus is that the former needs host to cause damage. The worm spreads and tries to infect the whole network [24]. The destruction caused by malware has increased adequately within the prior years. The main reason is the expanding recognition of the Internet and at the same time, there is an increase within the wide variety of vulnerable machines available because of security negligent users.
The destruction caused by malware has increased adequately within the prior years. The main reason is the expanding recognition of the Internet and at the same time, there is an increase within the wide variety of vulnerable machines available because of security negligent users. Another cause is the sophistication of malicious software has been improved over time.
Malicious software is based on signatures. If signatures are identified in a program's code that is asserted as malware then it can easily be detected [25].

Evolution of Malware
Dangers from malware are not new, even though malware or digital danger chasing stays a continuous challenge. For instance, with the expanding prominence of IoT devices and the absence of security insurance for such devices, these devices can be powerless against malware assaults [26].
Malware is deliberately intended to harm a PC, server, or any system and it has become one of the most noteworthy dangers on the Internet. It may have different names like virus, Trojan, ransomware, worm, command and control bot, etc. [17]. With the assistance of modern tools, it turns out to be easy to create new malware, bringing about an exceptionally fast increment in the quantity of malware. Moreover, those new malicious codes have the same behavior as benign codes making them harder to be distinguished, which have represented a noteworthy challenge to the vendors of anti-virus [27].
Early day malware was not encrypted utilizing complex cipher techniques and therefore were effectively identified and arranged by crosscoordinating some bit of code. But with the ongoing ideas of polymorphism and transformative nature like jumbling, malware characterization [14] turns into a difficult and dreary undertaking. Polymorphic malware exploit is an encryption technique, which encodes the code each time it repeats, while the encryption key stays steady which makes it simpler to identify. In the examination, metamorphic malware which not just encodes the code each time it repeats at the same time additionally changes its encryption key, which makes it difficult to recognize [28].

Malware Analysis
Machine is analyzed to comprehend the behavior and their contents. Malware analysis is the procedure of making sense of the ability of malware and answers to the following queries i.e., how malware functions, which machines and projects are influenced, which realities are being harmed and taken, and so forth. Malware can be analyzed either by examination of its code or by creating a safe environment for its execution. There are specifically two main strategies to investigate malware: • Static • Dynamic Static analysis inspects the malware without executing the genuine code [25]. The patterns of detection used in static analysis comprise of n-grams, string signature, syntactic library call, control drift graph, and opcode frequency distribution, etc. For static analysis, the executable has to be decrypted.
On the elective hand, dynamic assessment inspects the malware practices while executing its code in a safe and controlled environment i.e., installing different software like Wireshark, Regshot, Capture BAT, etc.
Malware assessment begins with fundamental static assessment and gets done with cutting edge dynamic assessment.
In comparison with static analysis, dynamic analysis is far better and does not need the executables to be disassembled. It unveils the natural behavior of malware that is more volatile to static analysis. The digital surroundings in which malware are finished are not like the actual one and the malware may perform in distinct approaches resulting in artificial conduct as an alternative than the exact one [29].

Malware Detection Techniques
Daily usage of the Internet comes with both its pros and cons. Internet world crimes are growing faster as compared to real-world crime because of different cyber-attacks infected with modern malware that can bypass all security measures. In the preceding days, the malware was simple and easy to detect but in the modern days, it is more complicated and difficult to detect. The signature-based approach was used before for malware detection but that is an old methodology and cannot detect modern malware that is complicated [30]. New methods have also been proposed for malware detection still it's impossible to detect all new malware. Malware detection involves three stages: first is to analyze malware, second is to extract features and third is to classify malware and benign. Malware detection can be static as well as dynamic i.e., can be detected when code is not running as well as detected when code is running. Different approaches for malware detection are described below [31]:

Machine Learning
Machine learning is the branch of Artificial Intelligence (AI) that can function automatically and learn from the previous and new experiences without being explicitly programmed or any human interaction. Machine learning approaches can be used to classify data automatically. This approach is further categorized as supervised learning and unsupervised learning. The difference between these two approaches is that in a supervised learning approach, the class label is present in the data before we apply any learning algorithm [32]. And in an unsupervised approach, the class label is not present so the learning algorithm has to analyze data and assign a class to it by organizing similarity clusters or groups.

Artificial Neural Network
Artificial Neural Network [33] is a network of numerous small connecting elements known as neurons also called the perceptron. ANN works on the principle of human brain. Each neuron can make decisions and information is transferred to other connected neurons that are organized in layers. It works as an artificial human nervous system that is used for transmitting, processing, and receiving information. A type of artificial neural network in which there exists one input layer for input variables, one hidden layer, and one output layer is known as the Shallow Neural Network. ANN with more than one hidden layer of neurons that process the inputs is known as Deep Neural Network. In ANN there are three layers which are as follows: Input Layer (All inputs provided to the model through this layer) • Hidden layer (maybe more than one depending upon the problems and used for processing the inputs received from input layer) • Output layer (For prediction)

DATASET DESCRIPTION AND CREATION
Dataset used in this paper is the ToN_IoT dataset [34] that is collected from the University of New South Wales (UNSW), Canberra created at their IoT Lab by Dr. Nour Mustafa. Dataset is called ToN_IoT as it consists of Telemetry datasets of IIoT and IoT sensors, datasets of Operating systems for both Ubuntu/Windows and datasets of Network traffic.
Current security solutions, including threat hunting and intelligence, digital forensics, malware detection, and intrusion detection are trending research areas in the domain of cybersecurity. With the advancement in AI, particularly deep learning, current solutions for security makes use of AI models yet these are not reliable due to diverse variety and complexity of recent hacking categories, unavailability of data sources for training, and validation of AI models. To fulfill that gap, a new dataset named ToN_IoT is designed to evaluate the fidelity of current security solutions based on AI models. Testbed developed consists of three tiers: • Edge (IoT and Network devices) • Fog (VM's and gateways) • Cloud (cloud services linked with fog and edge tiers including visualization and data analytics) Dataset is collected in pcap format using Wireshark that is converted to csv format. Dataset consists of both normal and attack scenarios. Tools used in testbed are Security Onion, Kali Linux, Wireshark, and Bro (named as Zeek).

Statistics of Dataset
TON IoT original dataset contains more than 22M ToN_IoT original dataset contains more than 22M records. For training and testing purposes original dataset is filtered to generate standard features and their labels.
The training and testing dataset consists a total of 4,61,043 records (as shown in Table 1) with 3,00,000 as normal or benign while 1,61,043 as malware that can be visualized in Fig. 3.

Malware Families
Malware data consists of 9 attacking families (as shown in Fig. 4)    • Accuracy: Ratio of correctly predicted observations to total observations.

• Precision:
The ratio of correctly predicted positive observations and the total predicted positive observations.

Precision TP TP FP
• Recall: The ratio of correctly predicted positive observations and the total predicted observations of the actual class.

Recall TP TP FN
• F1 Score: The weighted average of precision and recall.

F1 Score 2 Precision * Recall Precision Recall
Evaluation measures for malware detection along with classification report (  Evaluation measures for malware detection along with classification report (as in Table 4) are as follows: Accuracy 0.9708 Accuracy graphs of neural networks for malware detection and classification are shown in Fig 7 and 7: Accuracy of the model against unknown malware detection Fig.8: Accuracy of the model against unknown malware classification A comparison of proposed algorithm is made with k-NN and Naïve Bayes as shown in Table 5 which depicts that ANN outer-performs classical ML algorithms.

CONCLUSION & FUTURE WORK
As the diversity and range of IoT devices are promptly expanding, it is critical to secure such devices in the network against vulnerable attacks i.e., malware. We have highlighted security challenges in IoT network, background related to malware evolution, analysis, detection techniques, and different approaches. Various network datasets, for example, KDDCUP99, NSL-KDD [35], UNSW-NB15 [36] were generated for evaluating IDSs; however, they do not include any specific characteristics of IoT applications as these datasets contain neither sensors' reading data nor IoT network traffic.
Most of the recently published datasets [22,23,35,36] are network-based datasets, which primarily contain packet-level and flow-level information or a combination of both, for detecting attacks on the IoT network. However, they do not have the actual data generated from sensor readings.
This paper fills the gap of the unavailability of the dataset that contains a variety of network attacks as well as a real-world network dataset. In comparison with the literature [13][14][15]17], proposed methodology is highly capable to discriminate between malware and benign samples with an accuracy of 94.17% as well as classify malware families with an accuracy of 97.08% on basis of network traffic generated by IoT network.
Future research involves the construction of nextgeneration firewalls that can act as an intermediary between external networks and IoT networks preventing direct contact between two. Examine and identify advanced malware will also be taken into account in the future.