Machine Learning Based Intrusion Detection for IoT Botnet

In this article, we analyzed botnet traffic in an IoT environment using three machine learning classifiers: Logistic Regression, Support-Vector Machine and Random Forest. We classified each attack in each botnet for nine devices. We calculated the Accuracy, True Positive, False Positive, False Negative, True Negative, Precision, Recall, F1 score for each algorithm. We obtained impressive results (above 99%) using these three classifiers. We have a high attack detection rate. A brief analysis of the results is presented.


I. INTRODUCTION
A low estimate is that by 2025, the global worth of Internet of Things (IoT) devices will be $4 trillion dollars, and a high estimate is that by 2025, the global worth of IoT devices will be $11 trillion dollars [1]. With the development of IoT technologies, more and more devices have joined our lives, making security of systems an utmost concern. Many of the devices used in our everyday lives today, for example, smart phones, wearable devices, health monitoring devices, etc., generate vast amounts of private information, but have very little security, if any, built in. The internet is complex enough to secure, and these additional insecure IoT devices make the task of security even more challenging [1]. Botnets are able to infiltrate any internet connected device from smart watches and home smart kitchen appliances to corporate mainframes. Free availability of source code of IoT botnets like BASHLITE and Mirai have led to cyber attackers trying their hands at IoT malwares [1]. The IoT malware, Mirai, has actually inspired a renaissance of IoT malware and has been responsible for large scale DDos attacks [1]. The Mirai botnet and it's variants and imitators were basically a wake-up call to the industry to better secure IoT devices [2].
Botnets are typically created to infect as many devices as possible and complex botnets even self-propagate and update their behavior, finding and infecting devices automatically. Hence botnets are very difficult to detect [3]. Another reason why botnets are difficult to detect and contain is that they lurk on devices that do not significantly affect the performance of Manuscript  the device [3]. For example, a security camera may be part of an active botnet, but neither an average user nor a small business may be aware of this. Therefore, it is extremely important to identify botnets from the traffic of IoT devices.
In this paper, we use the dataset available in [4] to classify botnet traffic in the IoT environment. This dataset is real network traffic data, gathered from nine commercial IoT devices infected by two botnets, Mirai and BASHLITE. The data is analyzed using three classifiers, Logistic Regression (LR), Support Vector Machines (SVM) and Random Forest (RF), and classified by botnet, by attack, by device.
The rest of the paper is organized as follows: Section II presents the related works; Section III describes the datasetthe devices used, the attack categories and features; Section IV briefly presents the three classification algorithms used; Section V presents the results; Section VI presents the discussion; and Section VII presents the conclusions and future works.

II. RELATED WORKS
In this section we grouped the work based on works done on intrusion detection systems and works done directly on IoT Botnet.

A. Works on Intrusion Detection Systems
Several works have been done on intrusion detection systems. [5] designed fuzzy membership functions to solve dimensionality and anomaly mining, thereby reducing computational complexity and improving the computational accuracy of the classifier. [6] presented a dynamic coding mechanism, implementing a distributed signature based IDS in IP-USN (IP based ubiquitous sensor networks) and used Bloom filtering for signature matching. [7] designed and developed a virtual test platform to simulate a real network environment, deploying a signature-based Snort IDS for traffic monitoring and attack detection by mirroring the traffic to the server, and developing a stream-based IDS model using machine learning. They also implemented a flow-based anomaly detection model to overcome the limitations of the signature-based IDS. [8] designed a specification-based IDS for detecting a new type of threatthe topology attack. They proposed an IDS architecture using a network monitor backbone, and described its monitoring mechanisms through a RPL finite state machine. [9] developed a deep packet anomaly detection method that can be run on resource-constrained IoT devices, but can distinguish between normal and abnormal payloads.
Ref. [10] presented a DoS detection architecture for 6LoWPAN. This architecture integrated an IDS into the framework developed within the EU FP7 project ebbits. [11] proposed an IDS framework for IoT based on 6LoWPAN, which included a monitoring system and a detection engine. SVELTE [12], primarily targeting routing attacks, used a host based IDS under 6LoWPAN environment. The goal of [13] was to detect DoS attacks and attack protocols for 6LoWPAN and CoAP communications and propose an IDS framework for detecting and preventing attacks in the internet integrated environment. An intrusion detection model based on node consumption analysis in 6LowPAN was proposed in [14]. Irregular energy consumption of the routing scheme in the 6LoWPAN grid and the sensor nodes were used to identify malicious attacks. A malicious pattern matching engine for lightweight security systems was proposed in [15]. Two novel techniques, assisted transfer and early decision making, were proposed to reduce performance degradation due to computational power and memory limitations.
Ref. [16] proposed an event-processing IDS architecture using Complex Event Processing (CEP) technology. [17] proposed an architecture that employs a Bayesian event prediction model that uses historical event data generated by the IoT cloud to calculate the probability of future events. Based on the characteristics of the secure cloud service system, [18] proposed a secure high-order clustering algorithm that quickly searches and finds a mixed cloud density peak. The client first uses homomorphic encryption to construct the encrypted object tensor with user data, uploads it to the cloud to fully implement the proposed protocol, returning the clustering results of a random number of perturbations to the client, to eliminate the perturbations.
Kalis [19], an adaptive knowledge-driven expert intrusion detection system, which can monitor various protocols without changing existing IoT software, is a comprehensive method for IoT intrusion detection.
A real-time hybrid intrusion detection framework, including an anomaly-based and specification-based intrusion detection module, is proposed in [20]. The anomaly-based intrusion detection agent, located in the root node, uses the unsupervised optimal path forest algorithm to predict the clustering model by using incoming packets. The specification-based intrusion detection agent in the router node analyzes the behavior of its host node and sends its local result through ordinary data packets to the root node. [21] proposed a new network intrusion detection method for IoT networks based on a conditional variational autoencoder with a specific architecture, which integrates intrusion tags.

B. Works on IoT Botnets Specifically
Few works have also been done on detecting botnets on IoT devices. The authors of [22] proposed a host-based detection system based on one-class classifiers. Host based detection techniques can be considered less realistic for attacks on IoT botnets for various reasons including the fact that we would have to rely on the IoT manufacturers to install host-based anomaly detectors on the products. Also, given that IoT botnet attacks mutate at a very fast rate [2] and are becoming increasingly more and more complex by the day, some of these mutations will succeed in bypassing existing methods of early detection [23].
Ref. [24] used a one-class Support Vector Machine built with features such as CPU and memory usage to detect malicious activities. [25] proposed a deep learning-based botnet traffic analyzer called Botnet Traffic Shark (BoTShark) that uses only network transactions and is independent of deep packet inspection techniques to identify correlations between original features and new features in each layer of the autoencoder or CNN extracted in a cascaded manner. [26] proposed a state-of-the-art T-IDS, built on a novel randomized data partitioned learning model (RDPLM) relying on a compact network feature set and feature selection techniques, simplifying sub-spacing and multiple randomized meta-learning techniques. [27] analyzed the effectiveness of some community detection algorithms in detecting P2P botnets, especially with partial information. They showed that the approach can work with only about half of the nodes, reporting their communication graphs with only a small increase in detection errors. A method to detect compromised IoT devices included in a botnet is proposed in [28]. This method is based on logistic regression, which allows the estimation of the probability that a device initiating a connection is running a bot.
Ref. [29] empirically evaluates a network-based anomaly detection method which extracts behavior snapshots of the network and uses deep autoencoders to detect anomaly in network traffic from compromised IoT devices. [29] also presents a very good summary of IoT-related anomalies, botnets and malware attacks done by others.
While many of the previous works were on simulated data, in this paper we used real network traffic data, presented in [4], [29], to classify each attack in each botnet on each device using three classifiers, LR, RF and SVM.

III. DATASET DESCRIPTION
The dataset used by this paper is from UCI's machine learning repository [4]. The data is divided into 10 attacks carried by 2 botnets, gafgyt and mirai. The 9 IoT devices are: Danmini Doorbell, Ecobee Thermostat, Ennio Doorbell, Philips B120N10 Baby Monitor, Provision PT 737E Security Camera, Provision PT 838 Security Camera, Samsung SNH 1011 N Webcam, SimpleHome XCS7 1002 WHT Security Camera, and SimpleHome XCS7 1003 WHT Security Camera.
Most of these devices were infected by both gafgyt and mirai, as can be seen in Tables I through VII; but Ennio Doorbell and Samsung SNH 1011 N Webcam was infected only by gafgyt and the Philips B120N10 Baby Monitor was infected only by Mirai.
Mirai is a kind of malware that can make a computing system running Linux a remotely controlled "zombie." This can lead to large-scale network attacks though Mirai's mainly infected IoT devices such as web cameras, routers, etc. Devices infected by Mirai continuously scan the IP address of the IoT device on the Internet. The default username and password are used to log in to the vulnerable devices, and then the Mirai software is injected. The Mirai botnet has five types of attacks: scan, ack, syn, udp, and udpplain. Scan does automatic scanning for vulnerable devices. Ack causes Ack flooding. Syn causes Syn flooding. UDP causes UDP flooding. UDPplain causes UDP flooding with fewer options, optimized for higher PPS. [29] Gafgyt (also known as BASHLITE) is a malware that infects Linux systems to initiate Distributed Denial of Service (DDoS) attacks. It mainly uses the Metasploit module to exploit known vulnerabilities in the WeMo UPnP protocol. The Gafgyt botnet also has five types of attacks: combo, junk, scan, udp, and tcp. Combo sends spam data and opens a connection to a specified IP address and port. Junk sends spam data. Scan scans the network for vulnerable devices. UDP causes UDP flooding. TCP causes TCP flooding. [29] This dataset has 23 basic features [30] which can be categorized into the following attribute types: stream aggregation, time-frame and statistics extracted from packet streams.
Stream aggregation is composed of: (i) H stats, which summarizes the recent traffic from this packet's host (IP); (ii) MI stats, which summarizes the recent traffic from this packet's host (IP + MAC); (iii) HH stats, which summarizes the recent traffic going from this packet's host (IP) to the packet's destination host; (iv) HH_jit stats, which summarizes the jitter of the traffic going from this packet's host (IP) to the packet's destination host; (v) HpHp stats, which summarizes the recent traffic going from this packet's host+port (IP) to the packet's destination host+port.
Time-frame or the decay factor Lambda used in the damped window is: L5, L3, L1, L0.1 and L0.01. These statistics capture the recent history of the streams.
The statistics extracted from the packet streams are: (i) weight, which includes the weight of the stream (number of items observed in recent history); (ii) mean; (iii) standard deviation; (iv) radius, which is the root squared sum of the two streams' variances; (v) magnitude, which is the root squared sum of the two streams' means; (vi) cov, which is an approximated covariance between two streams; (vii) pcc, which is an approximated correlation coefficient between two streams. These features are extracted from a total of five time windows: 100ms, 500ms, 1.5sec, 10sec, and 1min, thus totaling 115 features. More details of each feature can be seen from [30].
The statistics are summarized from all of the traffic as follows [30]: 1) Originating from this packet's source MAC and IP address (denoted SrcMAC-IP). 2) Originating from this packet's source IP (denoted SrcIP). 3) Sent between this packet's source and destination IPs (denoted Channel). 4) Sent between this packet's source and destination TCP/UDP Socket (denoted Socket).

IV. CLASSIFIERS
Three classifiers were used in this study: Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF).
LR is a machine learning classifier used to model the probability of a certain class. Though LR can also be extended to classifying several classes, in it's basic form, LR uses a logistic function to model a binary dependent variable. SVM, relatively computationally inexpensive, is a supervised learning classifier mainly used for binary classification. In SVMs, we find the best hyperplane that divides the data into two categories and we generally have a low generalization error. The farther the data point from a decision boundary, the more confident we are about the prediction. The points separating the hyperplane are known as support vectors.
RF refers to a classifier that uses multiple trees to train and predict samples. Random forests establish a forest in a random way. After getting the forest, when a new sample is entered, each decision tree in the forest makes a separate judgment to see which class the sample should belong to (for the classification algorithm). The sample is predicted to be of the class to which it was classified the most times.

V. EXPERIMENTAL SETUP
Since we are classifying each attack in each botnet for each device, the data was grouped by device, by botnet and then by attack. Our initial results using the three classifiers, LR, SVM, and RF did not give us good performance, which was mainly due to the highly imbalanced nature of the data. To address this issue, we used an almost equal number of benign (normal) data as well as malicious data. The almost 50% of the benign data was randomly selected from the set of benign data and added to the malicious dataset before running the algorithms.
The data was then pre-processed using z-score normalization. Each of the classifiers (LR, SVM, and RF) were then used as binary classifiers on the normalized data and training and prediction was performed. 80% of the data was used for training and 20% for testing. Scikit Learn was used to run the classifiers.

VI. RESULTS
Eight metrics were used to evaluate and analyze the results: True Positive (TP) is actually positive, and the prediction is positive; False Positive (FP) is actually negative, and the prediction is positive; True Negative (TN) is actually negative, and the prediction is negative; False Negative (FN) is actually positive, and the prediction is negative; Accuracy, Precision, Recall and F1-score.
Accuracy is the ratio of the model's correct data (TP+TN) to the total data, given by: Recall, also referred to as sensitivity, or Attack Detection Rate (ADR): This is the effectiveness of the model in identifying an attack, that is, for all positive cases (TP+FN) in the dataset, the positive cases (TP) correctly judged by the model, given by: Precision: This is the percentage of classified attack instances that are truly classified as attacks, that is, for all positive cases (TP+FP) judged by the model, the proportion of the real cases (TP).

Precision=TP/(TP+FP)
(3) F1-score: This is the relationship between precision and recall, given by: The higher the F1-score, the more robust the classification model [24].   Tables I-VII compare the accuracy and other statistical metrics of the three classification models, LR, SVM and RF, for 7 of the devices for the different attack types. Fig. 1 and Fig. 2 present the classification accuracy of the other two devices, SimpleHome XCS7 1002 WHT Security Camera and SimpleHome XCS7 1003 WHT Security Camera. The classification accuracy is compared by classifier, LR, SVM and RF, by attack.   International Journal of Machine Learning and Computing, Vol. 11, No. 6, November 2021 VII. DISCUSSION From the statistical results we observe that the best performance is given by the RF classifier, followed by the LR. But, for the Provision_PT_737E Security Camera and the Provision_TP_838 Security Camera, SVM performs better than LR for the UDP attack. Though RF and LR perform better than SVM overall, the SVM results are only very slightly lower than RF and LR. In terms of attacks, we can say that the udp attack, of the gafgyt botnet, had a slightly lower classification rate than most other attacks. It would be difficult to say which attack had the best classification rate overallmost of the classification results were very good. Couple reasons for the good classification results might be: (i) the flow is expressed very finely and pre-processed using z-score normalization; and (ii) all features were collected in five time windows, and this data was pretty consistent for all time windows. As future work it might be good to see if all five different time windows are necessary and which features are really important for this classification.  From these results we can also note a very high attack detection rate, well over 99% in most cases and even 100% in many cases, mostly using the RF algorithm. The Damini Doorbell and Provision_PT_838 Security Camera had 100% ADR using the other algorithms too, mostly in the Mirai botnet. All three algorithms also had a very high precision and F1 scores (one or very close to one) for almost all of the attacks.
We present the graphical results of classification accuracy for the SimpleHome Security Camera and SimpleHome_XCS71003WHT Security Camera. From these two figures too, we can observe that, on the average, RF performed the most consistently, LR performed the second best and SVM performed the least consistently, though the classification accuracy of all three algorithms were very high.

VIII. CONCLUSIONS AND FUTURE WORKS
Though the results, by botnet, for each attack on each device, for all three classifiers, show very high ADRs and classification accuracy (over 99%) with regard to determining whether an IoT device is attacked by a particular botnet, we can say that, on the average, the RF algorithm performed the best and SVM performed the lowest of the three algorithms. The high F1 scores show the robustness of three algorithms used.
This being an initial study, we used all the features in the dataset. As a follow-up study, it would be good to do feature selection and see which of the features perform the best for each attack for each device. A detailed study of the features would also be useful information. For example, it would be interesting to see if each of the attacks on the security cameras had similar characteristics or each of the attacks on the doorbells had similar characteristics, etc. This would be helpful in determining how to handle and prevent future attacks.

CONFLICT OF INTEREST
There is no conflict of interest to report.

AUTHOR CONTRIBUTIONS
Dr. Sikha Bagui helped in designing the research plan and the write-up of the paper. Xiaojian also helped in the research plan, worked on the programming and on the write-up of the paper. Dr. Subhash Bagui provided the statistical guidance.