Machine Learning Classification of Port Scanning and DDoS Attacks: A Comparative Analysis

Cyber security is one of the major concerns of today’s connected world. For all the platforms of today’s communication technology such as wired, wireless, local and remote access, the hackers are present to corrupt the system functionalities, circumvent the security measures and steal sensitive information. Amongst many techniques of hackers, port scanning and Distributed Denial of Service (DDoS) attacks are very common. In this paper, the benefits of machine learning are taken into consideration for classification of port scanning and DDoS attacks in a mix of normal and attack traffic. Different machine learning algorithms are trained and tested on a recently published benchmark dataset (CICIDS2017) to identify the best performing algorithms on the data which contains more recent vectors of port scanning and DDoS attacks. The classification results show that all the variants of discriminant analysis and Support Vector Machine (SVM) provide good testing accuracy i.e. more than 90%. According to a subjective rating criterion mentioned in this paper, 9 algorithms from a set of machine learning experiments receive the highest rating (good) as they provide more than 85% classification (testing) accuracy out of 22 total algorithms. This comparative analysis is further extended to observe training performance of machine learning models through k-fold cross validation, Area Under Curve (AUC) analysis of the Receiver Operating Characteristic (ROC) curves, and dimensionality reduction using the Principal Component Analysis (PCA). To the best of our knowledge, a comprehensive comparison of various machine learning algorithms on CICIDS2017 dataset is found to be deficient for port scanning and DDoS attacks while considering such recent features of attack.


INTRODUCTION
ort scanning and DDoS attacks are very common techniques of cyber attackers to scan for vulnerabilities and exhaust the resources of a target respectively. When port scanning is in process, a scanning tool identifies the open ports in the target, informs about the running services, and enumerate in After an attacker gathers sufficient information such as IP scheme, datacenter locations and target profile during the reconnaissance phase, port scanning is involved as an early stage of the enumeration step. NMAP is used as the most common tool of port scanning worldwide [2]. A common example is scanning for ports supporting the Transmission Control Protocol / Internet Protocol (TCP/IP) traffic. A scanner sends the synchronization (SYN) signal to the target. If target responds with Synchronization & Acknowledgment (SYN+ACK) signal, it means that the port is open. Now in order to close the connection after knowing that the port is open, the scanner may send rest (RST) signal. On the other hand, leaving the connection open may drive this scenario to a kind of denial of service attack known as TCP SYN attack. This is shown in Fig. 1. DDoS attacks are another major area of cyber security concern where attackers are able to flood the target (also known as the victim) with malformed traffic to exhaust its resources in such a way that it is impossible for the victim to respond to legitimate requests. DDoS is a huge problem due to the fact that it is possible to create denial of service at every layer of the Open Systems Interconnection (OSI) communication model [3]. According to the statistics from akamai.com for Summer-2018 [4], an increase of 16% is observed in DDoS attacks when compared with the attacks of the last Summer-2017. In a traditional scenario of DDoS attack, an attacker has a number of machines under control which are compromised as a result of some exploitation such as malware. Such compromised machines are called zombies and they are the part of a malicious network called botnet. The zombies are used to directly attack the victim. Some high powered machines in the botnet are also selected as the handlers of zombies which are used to pass the attacker's instructions on to the zombies. The attacker's server runs the Command & Control (C&C) function which is the direct instruction of attack [5]. A typical scenario of DDoS attack is shown in Fig. 2. DDoS attack can also adjust the rate of malicious traffic sent to the victim according to the capacity of network's bandwidth. Therefore, when DDoS attack is modified as per the available network, bandwidth, and the nature of traffic acceptable on the network, it can be applied on traditional as well as more recent types of networking [6].

Fig. 2: Typical scenario of DDoS attack
Machine learning is one of the major tools of today to find optimal solutions of numerous real-world problems through soft computing approach. Cyber security is not an exception and researchers are using several techniques based on machine learning to address the issues of cyber attacks and find solutions for intrusion detection. The machine learning is mainly classified in two forms i.e. supervised learning and unsupervised learning. When data is available with labeled target values, supervised learning methods are used to find the response for new data with unknown output. The supervised learning is also divided into two major components i.e. classification/prediction and optimization. In the former category, the classification is made for discrete output whereas the prediction is associated with continuous output. The latter category corresponds to the machine learning approaches with evolutionary computing either for final outputs or for various other supporting solutions such as feature selection [7].
In this paper, supervised machine learning is used for intrusion detection of port scanning and DDoS attacks. The classifications are obtained in a mix of normal and attack traffic. Different machine learning algorithms are trained and tested on a recently published benchmark dataset to identify the best performing algorithms on the data which contains more recent vectors of port scanning and DDoS attacks. The dataset used in this research is CICIDS2017 which is an intrusion detection dataset created in 2017 and published in 2018 by Canadian Institute for Cybersecurity (CIC) [8]. The relevant category of attacks has been taken from the Friday -working hours' scenarios of CICIDS2017. CIC is also the creator of other benchmark intrusion detection datasets such as NSL-KDD 2009 and ISCXIDS2012. The reason for selection of CICIDS2017 in this research is the fact that it contains more recent vectors of attacks as compared to the older techniques which are not being exercised by attackers with noticeable frequencies these days. The other machine learning based comparative analyses are available in research on older benchmark datasets of intrusion detection. On the other hand, the contributions of this paper are: • Comparing different machine learning algorithms using a recent benchmark dataset of intrusion detection with focus on port scanning and DDoS attacks. The work is carried out with feature selection approach using correlation coefficient scores to reduce processing overhead while achieving results with minimal performance hit.
• Providing the comparative analysis of train/test accuracies and other training statistics on specified computing resource. The analysis also covers the performance observations with cross validation, area under curve evaluation, and dimensionality reduction.
• Discussing the results in the light of recent attack vectors and proposing the future line of action.
The rest of this paper is organized as follows: Section 2 provides the related work in this research area. Section 3 provides a briefing on machine learning algorithms applied in this paper. Section 4 explains the experimental setup, and Section 5 provides classification results. Section 6 discusses the machine learning analysis, and Section 7 provides experimental observations with cross validation, area under curve evaluation, and dimensionality reduction. Finally, Section 8 provides the conclusion and future work followed by the references.

RELATED WORK
Brahmi et al. [9] worked on DARPA 98 dataset [10] with four types of attacks i.e. Scan (or Probe), Denial of Service (DoS), User to Remote (U2R) and Remote to Local (R2L). Attack detection rates were obtained with multidimensional association rule mining, where six-dimensional rule mining gave the best rates. They obtained 95% and 99% detection accuracy for Scan and DoS attacks respectively. Jemili et al. [11] worked on KDD 99 dataset [12] to first categorize normal and attack traffic with junction tree inference module. In subsequent steps for attack category, Scan and DoS attacks were detected using anomaly detection module with accuracy rates of 99% and 89% respectively. The other types of attacks showed less accuracy due to lower number of training samples. Zhang et al. [13] grouped major attacks in KDD 99 dataset (Scan and DoS) and further expanded them into four attack levels. The detection accuracies achieved for the levels were 95%, 93%, 90% and 87% with constant 1% false acceptance rate using random forest classification. In [14], the authors used similar technique of random forest classification on KDD 99 dataset. However, the dataset was normalized in this work during the preprocessing stage. A comparative study was also presented with Naïve Bayes, Decision Tree and Gaussian maximum likelihood classifiers. The detection accuracies with random forest classification obtained for Scan and DoS attacks were 76% and 97% respectively.
Gao et al. [15] determined five features to classify distributed reflection DoS attacks with the SVM algorithm. A few experimental runs also displayed 100% accuracy without any false positive. However, the limitation of their work exists as a limited set of experiments. In [16], several feature scores including correlation ranking were input to an ensemble method of feature selection. The features exhibiting high scores under various methods easily crossed the threshold of final feature selection. In this way, 16 most important features were determined from the CAIDA 07 dataset [17]. A comparative study was presented using algorithms such as Naïve Bayes (NB), Random Forest (RF) and Multi-Layer Perceptron (MLP). High detection accuracy of 98.3% was achieved with MLP and the boosted feature set. In [18], a comparison was presented among different machine learning algorithms to detect SYN flood attack (a variant of denial of service) in a virtualized environment of cloud computing. An intersection process was used to identify certain important features based on statistical analysis of TCP/IP header from an extended feature space. With the limited set of 25 features, Naïve Bayes, J48, neural network and supervised K-means algorithms were compared. The highest accuracy of 99.995% was achieved with J48 algorithm (a Java based decision tree algorithm in WEKA tool).
In [19], detection accuracy of SVM machine learning algorithm was compared with the accuracy of SNORT, an open source intrusion detection tool. The SVM classification was applied in libsvm (SVM Library of Java). With the evaluation metrics of true positives and false positives, the SVM provided 99% detection accuracy of attacks as compared to SNORT having 89% of accurate detections. Lu et al. [20] compared RF, NB and SVM algorithms to detect establishment of C&C session before the launch of DDoS attack using the feature vector of network traffic with 55 dimensions. The traffic comprising of normal and C&C session traffic was generated with HTTP and IRC protocols running on ports 80 and 6667 respectively. The RF algorithm showed better results in terms of detection accuracy as compared to NB and SVM. In [21], the authors simulated modern types of DDoS attacks at application layer along with traditional network layer attacks. The comparative analysis of attack detection was provided using machine learning algorithms of NB, RF and MLP. The highest accuracy of 98.63% was achieved with MLP followed by 98.02% and 96.91% accuracies of RF and NB respectively. In [22], the authors simulated DDoS attacks and presented a comparative analysis of attack detection using machine learning algorithms of NB, RF, MLP, logistic regression and radial basis function. The highest accuracy of 93.67% was achieved using NB algorithm configured with multinomial classifier. However, the limitation of their work exists as having quite a limited set of data samples. Robinson  The overall analysis of related work reveals that the machine learning based comparative studies of port scanning and DDoS attack classifications are available in literature either on older benchmark datasets or simulated network traffic. Hence, the work of presenting comparative analysis of different machine learning algorithms using a newer benchmark dataset of intrusion detection with recent vectors of port scanning and DDoS attacks is a need of community to extend the research in this domain with available information of the best performing algorithms.

Tree
Different variants of decision tree algorithm work in a way that the subsets of dataset are created based on splitting the samples with respect to target classes and separated by the most contributing feature in the dataset. In the subsequent phases, each subset is independently split into further subsets based on high contributing vectors. In this way, decision trees are built, and the system learns how to split the new data points with respect to the feature set to reach the classification results [24].

Discriminant Analysis
Discriminant analysis is a feature extraction technique of machine learning. From 'n' independent variables of a dataset, 'p' new independent variables (p ≤ n) are extracted which separate most of the classes of target variable. Unlike principal component analysis where variance within feature variable is considered, DA considers the classes of target variable hence it is a supervised method of feature extraction [25].

Support Vector Machine (SVM)
Support Vector Machine is quite a popular machine learning approach for predicting and classifying data in high dimensional space. SVM brings out the information of data for separating target classes in terms of introducing hyperplanes among the feature vectors in such a way that the distance between points nearest to the hyperplanes is maximized. These points lying the closest to the hyperplanes are termed as support vectors. It is a complex technique of machine learning due to high dimensional computations [26].

K-Nearest Neighbors (KNN)
K-Nearest Neighbors algorithm relies on the information provided by the 'K' number of already classified or trained points closest to the new data point in feature space. The voting mechanism decides the fate of new data point on assigning a class to it. The closeness factor to choose 'K' points is determined by some applied metric. Euclidean distance (straight line distance between two points in n-dimensional space) is usually the most common metric applied in KNN [27].

Ensemble Classifiers
Ensemble classifiers apply independent algorithms under the hood to solve classification problem with the help of individual results provided by the underlying algorithms. For example, the boosted tree ensemble classifier applies a preconfigured number of decision trees in such a way that the result of a tree will be used to boost its more contributing features in the subsequent tree. Hence, a series of decision tree results are used to find the weighted average for final classification. In the case of bagged tree ensemble classifier, independent decision trees are run in parallel to provide results for ensemble technique. The simple average or voting is used for the final classification [28].
There are different datasets available for research in the domain of intrusion detection. In Table1, the important datasets are mentioned with relevant information [29][30][31].

EXPERIMENTAL SETUP
From CICIDS2017 dataset, total 512212 instances are taken from the Friday-Working hours-Afternoon scenarios of port scan and DDoS attacks. There are three classes in total labeled as 0, 1 and 2 for Normal, Port scanning and DDoS traffic respectively. 225255 instances are labeled normal, 158930 instances are port scanning traffic, and remaining 128027 instances belong to DDoS attacks. There are 78 independent variables (features) in default state with no missing values; however significant feature preprocessing is required as mentioned below: • 45 features are removed having below 20% correlation coefficient with respect to the dependent (target) variable. According to [32], labeling systems exist that roughly consider correlation coefficients which are ≤ 0.35 being the representation of low or weak correlations. Hence, it is assumed that all decisive variables are included in the final set of features after feature selection. This configuration is made using corrcoef function of Numpy package for scientific computing in Python 3.
• 21 features remain in the dataset as the most significant features according to the configured value of correlation coefficient. These 21 features are taken for the classification.
• Data normalization is done using StandardScaler class of scikit-learn library in Python 3. The dataset is split in 70-30% ratio for training and testing in randomized manner using train_test_split class of scikit-learn. It provides 358548 samples of data for training (157701 normal, 111292 port scanning, and 89555 DDoS) and 153664 samples for testing (67554 normal, 47638 port scanning, and 38472 DDoS).
Classification experiments are performed in Matlab R2017a due to the availability of enriched set of algorithms in Apps section under Classification Learner. The 21 independent variables shortlisted for the classification's feature space are mentioned in Table 2.

CLASSIFICATION RESULTS
The classification results of different algorithms with accuracy scores and other parameters of training including the confusion matrices are provided in Table  3. The numbers under "Predicted Class" columns in Table 3

DISCUSSION AND ANALYSIS
The classification results obtained in Table 3 reveal that some machine learning algorithms can exhibit substandard performance in classifying port scanning and DDoS attacks even after they show good training accuracies. As testing instances are different from the training set, the considerable differences of feature vectors of the two sets can make it harder for even a trained model to show better results in terms of classification accuracy. In this analysis, the best performing algorithms can be found in terms of classification accuracy of port scanning and DDoS attacks which show good training as well as classification scores. Figure 3 shows the results of all specified machine learning algorithms.
From Fig. 3 as well as Table 3, it is observed that the Fine Gaussian variant of SVM is the best performing machine learning model among the experiments which shows 99% testing (classification) accuracy as well as 99% training accuracy. For collective analysis, it is observed that all the variants of discriminant analysis and SVM provide good classification results of port scanning and DDoS attacks. On the other hand, inefficient performance in the range of 49-69% is exhibited by the tree based models as well as KNN and most of the ensemble classifier based algorithms. However, the subspace discriminant variant of ensemble classifier provides 85.5% testing accuracy which is still competitive to other high performing algorithms.
Based on the analysis, the specified machine learning algorithms are rated in Table 4 Table 4. On the other hand, several other classifiers such as tree based and KNN showed performance degradation in testing phase when new/unseen data was presented for classification. In general, it is observed that all the algorithms can identify the attack traffic well in training, and the misclassifications mainly belong to the normal traffic due to its higher share in the mixed traffic with a factor of noise in the data.
For the analysis of training time exhibited by the specified machine learning algorithms, it is observed that it generally increases when observations per second by the classification algorithm decreases in the training phase. As high differences of Kobservations/sec and training time among various machine learning algorithms are observed in Table 3, they are shown in scatter plot in Fig. 5 with normalized values between 0 and 1. It can be seen that the less number of training observations per second requires high amount of training time to complete the learning phase of a model in general. However, some models are also comparatively efficient as they take less time to complete with small numbers of training observations per second (e.g. a few variants of ensemble classifier and SVM from Table 3). Fast  Fig. 4: Average TPR and FNR of machine learning algorithms during training training completions are provided by the tree and discriminant analysis based algorithms. Hence in terms of fast training and high training/testing accuracy scores, discriminant analysis based machine learning models in this comparative study are found to be more accurate and efficient to classify port scanning and DDoS attacks.

COMPARATIVE ANALYSIS WITH VALIDATION, EVALUATION AND DIMENSIONALITY REDUCTION
The machine learning algorithms should not be trusted without validating the results to avoid overfitting and false sense of prediction strength. For this purpose, the steps of validation and evaluation are added in this paper to analyze whether the training part should be trusted to avoid overfitting, and evaluated through acceptable means. Fig. 6 explains the proposed scheme where k-fold cross validation and AUC analysis of ROC curves are included in the experiments. There is also a factor of dimensionality reduction, for which the Principal Component Analysis (PCA)) is used in this paper. Fig. 6: Proposed scheme of analysis with validation, evaluation and dimensionality reduction.

K-Fold Cross Validation
In order to avoid overfitting during the training phase, k-fold cross validation is an effective tool by shifting the train-test splits for certain number of rounds to know whether a particular split is not an overfitting state. It can be established if other splits also produce the training accuracy close to the original one. In Fig.  7, a comparison is provided between no validation and 10-fold cross validation (k=10) of training accuracies of the machine learning algorithms. It can be noticed that for some comparisons, the average training accuracy of cross validation is slightly dropped from the one without validation as different splits can produce different accuracies (e.g. medium tree, linear discriminant, and coarse KNN). Hence, the average accuracy can be different with the validation step. In a few cases, it is also increased as compared to training accuracy with no validation (e.g. weighted KNN).

AUC Analysis of Roc Curves
Area-under-curve analysis of ROC is another effective tool of evaluation in order to avoid the accuracy paradox [33]. This term refers to the fact that a machine learning algorithm can provide the accuracy score which can be valid for only an instantaneous  Table 4. In Fig. 8, both variants of discriminant analysis are plotted. In Fig. 9, three variants of SVM providing the respective highest training accuracies from Fig. 3 are plotted. In Fig. 10, the single good variant of ensemble classifier is evaluated. It can be noticed that all the mentioned algorithms show area under curve scores which tally the respective training accuracies of machine learning models, hence the accuracy paradox can be avoided. Here, the 10-fold cross validation is kept enabled for effective validation followed by the evaluation step. The curves are made with one vs. all approach i.e. normal traffic vs. attack traffic (covering both port scanning and DDoS types of attacks).

DIMENSIONALITY REDUCTION WITH PCA
Principal Component Analysis (PCA) is an unsupervised tool of dimensionality reduction. It is unsupervised because it does not take into account the target classes to reduce dimensionality. It considers the variance of original features in a dataset and produces new features to preserve most of the variance. Hence, it is a feature extraction technique instead of a feature selection approach. The calculations are made by obtaining eigenvalues and eigenvectors from the covariance matrices [34]. In this analysis, training accuracies are obtained for all machine learning models with two PCA configurations, along with observing the prediction speeds. In Table 5, two different PCA settings are used i.e. PCA explaining 85% variance, and PCA explaining 90% variance. The reason behind using two different settings is the comparative analysis for different number of extracted features. There are five features extracted for PCA explaining 85% variance, and six features for PCA explaining 90% variance. It can be noticed that prediction speeds are reduced in most cases as compared to full-feature analysis for the reason that although dimensionality is reduced but 10fold cross validation is kept enabled for effective validation along with the dimensionality reduction. Also, the prediction speed is generally lower in 90% analysis than 85% analysis due to the presence of an extra extracted feature in the latter case.

CONCLUSION AND FUTURE WORK
In this paper, a comparative analysis is presented on a recently published benchmark dataset CICIDS2017 to classify port scanning and DDoS attacks in a mix of normal and attack traffic. 22 different machine learning algorithms are trained and tested to check their performance on the recent vectors of attacks. The classification results show that all the variants of discriminant analysis and SVM provide testing accuracies of more than 90%. The best accuracy score of 99% is obtained with the Fine Gaussian variant of SVM. In general, the training time increases with a decrease in number of observations per second during the training phase. The fastest convergence of training time is exhibited by tree and discriminant analysis based algorithms. Hence in terms of fast training and high training/testing accuracies, discriminant analysis based models are more productive. In the subjective rating of algorithms, 9 algorithms receive the highest rating i.e. good for showing more than 85% testing accuracy. This comparative analysis is further extended to observe training performance of machine learning models through k-fold cross validation, AUC analysis of ROC curves, and dimensionality reduction using PCA. The 10-fold cross validation shows that the average training accuracy in some cases is slightly dropped from the one without validation as different splits can produce different accuracies to bring a slight change in the average accuracy with cross validation. The AUC analysis of ROC curves shows that all the observed algorithms provide area under curve scores which tally the respective training accuracies of machine learning models. Finally, the dimensionality reduction with PCA explaining 85% and 90% variances, providing 5 and 6 extracted features respectively, shows that the prediction speeds as compared to full-feature analysis can vary with respect to the dimensionality reduction as well as enabling the 10-fold cross validation for effective results to avoid overfitting.
Machine learning is recently being explored in research for effective and efficient applications in the field of information security [35,36]. In fact, security is always one of the top concerns in the development of automated communication systems [37]. Intrusion detection is one of the major domains under cyber security, and machine learning is being actively applied and tested in this area to get fruitful results [38]. In future work, more variants of machine learning models including neural networks (multilayer perceptron) will be considered in conjunction with detailed feature engineering to find enriched comparisons of machine learning algorithms on recent datasets of port scanning and DDoS attacks. In addition to this, more analysis on the techniques of dimensionality reduction will be performed to decrease the performance overhead in significant manner.