A clustering-based method for outlier detection under concept drift

Mahjabeen Tahir; Azizol Abdullah; Nur Izura Udzir; Khairul Azhar Kasmiran

doi:10.22581/muet1982.3269

Mahjabeen Tahir Department of Computer Science and Information Technology, University Putra Malaysia (UPM), Serdang,43400, Selangor, Malaysia
Azizol Abdullah Department of Computer Science and Information Technology, University Putra Malaysia (UPM), Serdang,43400, Selangor, Malaysia
Nur Izura Udzir Department of Computer Science and Information Technology, University Putra Malaysia (UPM), Serdang,43400, Selangor, Malaysia
Khairul Azhar Kasmiran Department of Computer Science and Information Technology, University Putra Malaysia (UPM), Serdang,43400, Selangor, Malaysia

DOI: https://doi.org/10.22581/muet1982.3269

Abstract

The ongoing challenge against network security issues persists, necessitating the exploration of alternative approaches. Anomaly-based strategies, diverging from traditional signature-based methods, gain popularity for their effectiveness in detecting new attacks. However, accurately defining normal network behavior becomes increasingly challenging due to data fluctuations. This study introduces a two-step process for recognizing evolving anomalies in streaming network data. Initially, clusters are updated incrementally upon new data arrival (the updating phase). Subsequently, anomalies are identified by discerning outer and inner outliers using minimum and maximum density thresholds. A buffer concept temporarily stores incoming data to prevent misclassification of normal network samples as anomalies. Performance evaluation in Python 3 assesses the impact on detection rate, false positives, and accuracy using two popular streaming datasets (NSL-KDD and UNSWNB-15). The algorithm achieves notable results, with a detection rate of 99.12% on UNSWNB-15 and a 7.9% false positive rate on NSL-KDD, marking significant progress. The proposed approach CADSD (Cluster-based Anomaly Detection with Streaming Data), operates in real-time without pre-training. However, challenges may arise from assuming the majority of data comprises normal instances, particularly during sudden spikes in attack data, potentially diminishing algorithm effectiveness. Nonetheless, the method shows the potential to enhance network security by promptly identifying emerging anomalies in real-time streaming data. The incorporation of a buffer concept to prevent the misidentification of normal network samples as anomalies underscores the innovative nature of this approach.