An Empirical Study on Anomaly Detection Using Density-based and Representative-based Clustering Algorithms

Gerard Shu  Fuhnwi; Janet O. Agbaje; Kayode Oshinubi; Olumuyiwa James Peter

doi:10.46481/jnsps.2023.1364

Authors

Gerard Shu Fuhnwi
Department of Computer Science, Montana State University, Montana, USA
Janet O. Agbaje
Department of Mathematical Science, Montana Technological University, Montana, USA.
Kayode Oshinubi
School of Informatics, Computing and Cyber Systems, Northern Arizona University, Arizona, USA.
Olumuyiwa James Peter
[email protected]

Department of Mathematical and Computer Sciences & Department of Epidemiology and Biostatistics, School of Public Health, University of Medical Sciences, Ondo, Nigeria.

Keywords:

Outliers, Noise points, ANN, k-means−−, DBSCAN, DBSCAN .

Abstract

In data mining, and statistics, anomaly detection is the process of finding data patterns (outcomes, values, or observations) that deviate from the rest of the other observations or outcomes. Anomaly detection is heavily used in solving real-world problems in many application domains, like medicine, finance , cybersecurity, banking, networking, transportation, and military surveillance for enemy activities, but not limited to only these fields. In this paper, we present an empirical study on unsupervised anomaly detection techniques such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), (DBSCAN++) (with uniform initialization, k-center initialization, uniform with approximate neighbor initialization, and $k$-center with approximate neighbor initialization), and $k$-means$--$ algorithms on six benchmark imbalanced data sets. Findings from our in-depth empirical study show that k-means-- is more robust than DBSCAN, and DBSCAN++, in terms of the different evaluation measures (F1-score, False alarm rate, Adjusted rand index, and Jaccard coefficient), and running time. We also observe that DBSCAN performs very well on data sets with fewer number of data points. Moreover, the results indicate that the choice of clustering algorithm can significantly impact the performance of anomaly detection and that the performance of different algorithms varies depending on the characteristics of the data. Overall, this study provides insights into the strengths and limitations of different clustering algorithms for anomaly detection and can help guide the selection of appropriate algorithms for specific applications.

Dimensions

REFERENCES

S. M. Shagari, D. Gabi, N. M. Dankolo & N. N. Gana, “Countermeasure to Structured Query Language Injection Attack for Web Applications using Hybrid Logistic Regression Technique“, Journal of the Nigerian Society of Physical Sciences 4 (2022) 832. https://doi.org/10.46481/ jnsps.2022.832 DOI: https://doi.org/10.46481/jnsps.2022.832

C. L. Udeze & I. E. Eteng & A. E. Ibor, “Application of Machine Learning and Resampling Techniques to Credit Card Fraud Detection”, Journal of the Nigerian Society of Physical Sciences 4 (2022) 3769. https://doi. org/10.46481/jnsps.2022.769 DOI: https://doi.org/10.46481/jnsps.2022.769

K. Oshinubi, A. Amakor, O. J. Peter, M. Rachdi & J. Demongeot, “Approach to COVID-19 time series data using deep learning and spectral analysis methods[J]“, AIMS Bioengineering 9 (2022) 1.https://www. aimspress.com/article/doi/10.3934/bioeng.2022001. DOI: https://doi.org/10.3934/bioeng.2022001

V. Chandola, A. Banerjee & V. Kumar, “Anomaly Detection: A Survey”, ACM computing surveys (CSUR), ACM New York, NY, USA 41 (2009) 1. https://doi.org/10.1145/1541880.1541882 DOI: https://doi.org/10.1145/1541880.1541882

P. O. Odion & M. N. Musa, & S. U. Huaibu, “Age Prediction from Sclera Images using Deep Learning“, Journal of the Nigerian Society of Physical Sciences 4 (2022) 787. https://doi.org/10.46481/jnsps.2022.787 DOI: https://doi.org/10.46481/jnsps.2022.787

Z. He, X. Xu & S. Deng, “Discovering Cluster Based Local Outliers”, Pattern Recogn. 24 (2003) 1641 DOI: https://doi.org/10.1016/S0167-8655(03)00003-5

Z. Li, Y. Zhao, N. Botta, C. Ionescu & X. Hu, “COPOD: Copula-Based Outlier Detection.“, Pattern Recogn. 24 (2020) 9. DOI: https://doi.org/10.1109/ICDM50108.2020.00135

R. J. G. B. Campello, D. Moulavi, A. Zimek J. Sander, “Hierarchical density estimates for data clustering, visualization, and outlier detection”, ACM Transactions on Knowledge Discovery from Data (TKDD), ACM New York, NY, USA 10 (2015) 1. DOI: https://doi.org/10.1145/2733381

S. Hariri, M. C. Kind & R. J. Brunner, “Extended isolation forest“, IEEE Transactions on Knowledge and Data Engineering 44 (2019) 4.

P. Guo, W. Lijuan, S. Jun & F. Dong, “A hybrid unsupervised clusteringbased anomaly detection method”, Tsinghua Science and Technology 26 (2020) 146. DOI: https://doi.org/10.26599/TST.2019.9010051

Y. Zhang, “DBSCAN Clustering Algorithm Based on Big Data Is Applied in Network Information Security Detection“, Security and Communication Networks 2022 (2022) 9951609. DOI: https://doi.org/10.1155/2022/9951609

G. Du, X. Li, L. Zhang, L. Liu & C. Zhao, “Novel Automated K-means++ Algorithm for Financial Data Sets”, Mathematical Problems in Engineering 2021 (2021) 1. DOI: https://doi.org/10.1155/2021/5521119

T. Srikanth, B. Philip, J. Jiong & S. Jugdutt, “A comprehensive survey of anomaly detection techniques for high dimensional big data“, Journal of Big Data 7 (2020) 1. DOI: https://doi.org/10.1186/s40537-020-00320-x

W. Wang, X. Hu & Y. Du, “Algorithm optimization and anomaly detection simulation based on extended Jarvis-Patrick clustering and outlier detection”, Alexandria Engineering Journal 61 (2022) 2106.

W. Wang, X. Hu & Y. Du, “Algorithm optimization and anomaly detection simulation based on extended Jarvis-Patrick clustering and outlier detection“, Alexandria Engineering Journal 61 (2022) 2106. DOI: https://doi.org/10.1016/j.aej.2021.08.009

T. Chandrakala, & S. N. S. Rajini, “An Analysis of Outlier Detection through clustering method”, International Journal of Advanced Engineering, Management and Science 6 (2020) 571. DOI: https://doi.org/10.22161/ijaems.612.13

S. Chawla & G. Aristides, “K-means-: A unified approach to clustering and outlier detection“, Proceedings of the 2013 SIAM International Conference on Data Mining (SDM) (2013) 189. DOI: https://doi.org/10.1137/1.9781611972832.21

J. Han, M. Kamber & J. Pei, Data Minig: Concepts and Techniques , Third Edition, pp. 471–476.

J. Jang & H. Jiang, “DBSCAN++: Towards fast and scalable density clustering”, Proceedings of Machine Learning Research (PMLR) 97 (2019) 3019.

S. Har-Peled, Geometric Approximation Algorithms, American Mathematical Society, 2011. DOI: https://doi.org/10.1090/surv/173

E. Bernhardsson, spotify/annoy:v1.17.0.https://github.com/spotify/annoy

D. Dheeru & G. Casey, “UCI Machine Learning Repository“, University of California, Irvine (2017). http://archive.ics.uci.edu/ml

M. J. Zaki& W. Meira, Data Mining and Machine Learning: Fundamental Concepts and Algorithms, Cambridge University Press, 2020. DOI: https://doi.org/10.1017/9781108564175

F. T. Liu & K. M. Ting, & Z. H. Zhou, Isolation forest, Eighth IEEE International Conference on Data Mining, 2008. DOI: https://doi.org/10.1109/ICDM.2008.17

L. M. Manevitz & M. Yousef, “One-class SVMs for document classification”,Journal of machine Learning research 2 (2011) 139.