The effect of imbalance data mitigation techniques on cardiovascular disease prediction

Raphael Ozighor  Enihe; Rajesh Prasad; Francisca Nonyelum Ogwueleka; Fatimah Binta  Abdullahi

doi:10.46481/jnsps.2025.2385

Authors

Raphael Ozighor Enihe
[email protected]
Department of Computer Science, Baze University, Abuja, Nigeria https://orcid.org/0000-0001-8155-4205
Rajesh Prasad Department of Computer Science & Engineering, Ajay Kumar Garg Engineering College, Ghaziabad, India; Department of Computer Science, University of Abuja, Abuja, Nigeria
Francisca Nonyelum Ogwueleka Department of Computer Science, University of Abuja
Fatimah Binta Abdullahi Department of Computer Science, University of Abuja, Abuja, Nigeria

Keywords:

Imbalance dataset, Cardiovascular disease prediction, SMOTE-TOMEK, Marchine learning, Overfitting and Underfitting

Abstract

The prevalence of class imbalance is a common challenge in medical datasets, which can adversely affect the performance of machine learning models. This paper explores how several data imbalance mitigation techniques affect the performance of cardiovascular disease prediction. This study applied various data balancing techniques on a real-life cardiovascular disease (CVD) dataset of 1000 patient records with 14 features obtained from the University of Abuja Teaching Hospital Nigeria to address this problem. The data balancing techniques used include random under-sampling, Synthetic Minority Over-sampling Technique (SMOTE), Synthetic Minority Oversampling-Edited Nearest Neighbour (SMOTE-ENN), and the combination of SMOTE and Tomek Links undersampling (SMOTE-TOMEK). After applying these techniques, their performance was evaluated on seven machine learning models, including Random Forest, XGBoost, LightGBM, Gradient Boosting, K-Nearest Neighbours, Decision Tree, and Support Vector Machine. The evaluation metrics used are precision, recall, F1-score, accuracy, and receiver operating characteristic-area under the curve (ROC-AUC). Learning curve plots were also used to showcase the impact of the different data balancing techniques on the challenges of overfitting and underfitting. The results showed that the application of data balancing techniques significantly enhances the performance of machine learning models in heart disease prediction and effectively addresses the challenges of overfitting and underfitting with SMOTE-TOMEK, yielding the best-balanced fit as well as the highest precision, recall, F1-score, accuracy of 92%, and ROC-AUC of 96% on the Lightweight Gradient Boosting Machine (LightGBM) model. These results underscore the critical role of data balancing in predictive modelling for heart disease and highlight the effectiveness of specific techniques and models in achieving accurate, more reliable, and generalised predictions.

Dimensions

REFERENCES

[1] M. Di Cesare, P. Perel, S. Taylor, C. Kabudula, H. Bixby, T. A. Gaziano, D. V. McGhie, J. Mwangi, B. Pervan, J. Narula, D. Pineiro & F. J. Pinto, “The heart of the world”, Global Heart 19 (2024) 11. https://doi.org/10.5334/gh.1288.

[2] O. Olamide, O. Adebayo, A. Emmanuel, L. Eyitayo, O. Beatrice & M. Tomisin, “Prevalence and risk factors of cardiovascular diseases among the Nigerian population: A new trend among adolescents and youths”, IntechOpen 2023 (2023) 1. https://doi.org/10.5772/intechopen.108180.

[3] S. Hossain, M. K. Hasan, M. O. Faruk, N. Aktar, R. Hossain & K. Hossain, “Machine learning approach for predicting cardiovascular disease in Bangladesh: evidence from a cross-sectional study in 2023”, BMC Cardiovascular Disorders 24 (2024) 214. https://doi.org/10.1186/s12872-024-03883-2.

[4] W. W. Fan & C. H. Lee, “Classification of imbalanced data using deep learning with adding noise”, Journal of Sensors 2021 (2021) 1. https://doi.org/10.1155/2021/1735386.

[5] I. Araf, A. Idri & I. Chairi, “Cost-sensitive learning for imbalanced medical data: a review”, Artificial Intelligence Review 57 (2024) 80. https://doi.org/10.1007/s10462-023-10652-8.

[6] I. M. Alkhawaldeh, I. Albalkhi & A. J. Naswhan, “Challenges and limitations of synthetic minority oversampling techniques in machine learning” World Journal of Methodology 13 (2023) 373. https://doi.org/10.5662/wjm.v13.i5.373.

[7] A. Hassan, S. G. Ahmad, E. U. Munir, I. A. Khan & N. Ramzan, “Predictive modelling and identification of key risk factors for stroke using machine learning”, Scientific Reports 14 (2024) 11498. https://doi.org/10.1038/s41598-024-61665-4.

[8] Q. Y. Yin, J. S. Zhang, C. X. Zhang & N. N. Ji, “A novel selective ensemble algorithm for imbalanced data classification based on exploratory undersampling”, Mathematical Problems in Engineering 2014 (2014) 1. https://doi.org/10.1155/2014/358942.

[9] N. W. Minja, D. Nakagaayi, T. Aliku, W. Zhang, I. Ssinabulya, J. Nabaale, W. Amutuhaire, S. R. de Loizaga, E. Ndagire, J. Rwebembera, E. Okello & J. Kayima, “Cardiovascular diseases in Africa in the twenty-first century: Gaps and priorities going forward”, Frontiers in Cardiovascular Medicine 9 (2022) 1008335. https://doi.org/10.3389/fcvm.2022.1008335.

[10] M. A. Sufian, W. Hamzi, S. Zaman, L. Alsadder, B. Hamzi, J. Varadarajan & M. A. K. Azad, “Enhancing clinical validation for early cardiovascular disease prediction through simulation, ai, and web technology”, Diagnostics (Basel, Switzerland) 14 (2024) 1308. https://doi.org/10.3390/diagnostics14121308.

[11] C. Aliferis & G. Simon, “Overfitting, underfitting and general model overconfidence and under-performance pitfalls and best practices in machine learning and AI”, in Artificial Intelligence and Machine Learning in Health Care and Medical Sciences, G. J. Simon & C. Aliferis, Eds., Health Informatics, Springer, Cham, 2024, pp. 477-524. http://dx.doi.org/10.1007/978-3-031-39355-6 10.

[12] Q. Chen & N. Ma, “Heart disease prediction method based on ANN”, Highlights in Science, Engineering and Technology 85 (2024) 411. https://doi.org/10.54097/fgt46k23.

[13] A. J. Albert, R. Murugan & T. Sripriya, “Diagnosis of heart disease using oversampling methods and decision tree classifier in cardiology”, Research on Biomedical Engineering 39 (2023) 99. https://doi.org/10.1007/s42600-022-00253-9.

[14] A. S. Jaddoa, “Heart disease prediction system using (SMOTE technique) balanced dataset and decision tree classifier”, AIP Conference Proceedings 2834 (2023) 050006. https://doi.org/10.1063/5.0161558.

[15] R. Masram, S. K. Sharma & N. Kumar, “Heart disease identification methods using machine learning and efficient data balancing techniques”, International Research Journal of Engineering and Technology 11 (2024) 377. https://www.irjet.net/archives/V11/i7/IRJET-V11I753.pdf.

[16] B. Duraisamy, R. Sunku, K. Selvaraj, V. V. R. Pilla & M. Sanikala, “Heart disease prediction using support vector machine”, Multidisciplinary Science Journal 6 (2023) 2024ss0104. https://doi.org/10.31893/multiscience.2024ss0104.

[17] C. M. Bhatt, P. Patel, T. Ghetia & P. L. Mazzeo, “Effective heart disease prediction using machine learning techniques”, Algorithms 16 (2023) 88. https://doi.org/10.3390/a16020088.

[18] M. S. Pathan, A. Nag, M. M. Pathan & S. Dev, “Analyzing the impact of feature selection on the accuracy of heart disease prediction”, Healthcare Analytics 2 (2022) 100060. https://doi.org/10.1016/j.health.2022.100060.

[19] F. Yang, Y. Qiao, P. Hajek & M. Z. Abedin, “Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledgedriven explainability”, Expert Systems with Applications 255 (2024) 124886. https://doi.org/10.1016/j.eswa.2024.124886.

[20] E. Dritsas & M. Trigka, “Efficient data-driven machine learning models for cardiovascular disease risk prediction”, Sensors 23 (2023) 1161. https://doi.org/10.3390/s23031161.

[21] J. Hoyos-Osorio, A. Alvarez-Meza, G. Daza-Santacoloma, A. OrozcoGutierrez, G. Castellanos-Dominguez, “Relevant information undersampling to support imbalanced data classification”, Neurocomputing 436 (2021) 136. https://doi.org/10.1016/j.neucom.2021.01.033.

[22] A. X. Wang, S. S. Chukova & B. P. Nguyen, “Synthetic minority oversampling using edited displacement-based k-nearest neighbors”, Applied Soft Computing 148 (2023) 110895. https://doi.org/10.1016/j.asoc.2023.110895.

[23] D. H. Jeong, S. E. Kim, W. H. Choi & S. H. Ahn, “A comparative 17 study on the influence of undersampling and oversampling techniques for the classification of physical activities using an imbalanced accelerometer dataset”, Healthcare (Basel) 10 (2023) 1255. https://doi.org/10.3390/healthcare10071255.

[24] T. Wongvorachan, S. He & O. Bulut, “A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining”, Information 14 (2023) 54. https://doi.org/10.3390/info14010054.

[25] N. A. Azhar, M. S. Mohd Pozi, A. Mohamed Din & A. Jatowt, “An investigation of SMOTE-based methods for imbalanced datasets with data complexity analysis”, IEEE Transactions on Knowledge and Data Engineering 35 (2022) 6651. https://doi.org/10.1109/TKDE.2022.3179381.

[26] N. Rout, D. Mishra & M. K. Mallick, “An advance extended binomial GLMBoost ensemble method with synthetic minority over-sampling technique for handling imbalanced datasets”, International Journal of Electrical and Computer Engineering (IJECE) 13 (2023) 4357. https://doi.org/10.11591/ijece.v13i4.pp4357-4368.

[27] T. Sasada, Z. Liu, T. Baba, K. Hatano & Y. Kimura, “A resampling method for imbalanced datasets considering noise and overlap”, Procedia Computer Science 176 (2020) 420. https://doi.org/10.1016/j.procs.2020.08.043.

[28] X. Yi, Y. Xu, Q. Hu & others, “ASN-SMOTE: a synthetic minority oversampling method with adaptive qualified synthesizer selection”, Complex Intell. Syst. 8 (2022) 2247. https://doi.org/10.1007/s40747-021-00638-w.

[29] E. F. Swana, W. Doorsam & P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset”, Sensors 22 (2022) 3246. https://doi.org/10.3390/s22093246.

[30] Z. Xu, D. Shen, T. Nie & Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data”, Journal of Biomedical Informatics 107 (2020) 103465. https://doi.org/10.1016/j.jbi.2020.103465.

[31] K. M. Hasib, S. Azam, A. Karim, A. A. Marouf, F. M. Javed Mehedi Shamrat & S. Montaha, “MCNN-LSTM: Combining CNN and LSTM to classify multi-class text in imbalanced news data”, IEEE Access 11 (2023) 93048. https://doi.org/10.1109/ACCESS.2023.3309697.

[32] T. Ma, S. Lu & C. Jiang, “A membership-based resampling and cleaning algorithm for multi-class imbalanced overlapping data”, Expert Systems with Applications 240 (2024) 122565. https://doi.org/10.1016/j.eswa.2023.122565.

[33] S. Luo & T. Chen, “Two derivative algorithms of gradient boosting decision tree for silicon content in blast furnace system prediction”, IEEE Access 8 (2020) 196112. https://doi.org/10.1109/ACCESS.2020.3034566.

[34] A. F. Bulagang, G. W. Ng, J. Mountstephens & J. Teo, “A review of recent approaches for emotion classification using electrocardiography and electrodermography signals”, Informatics in Medicine Unlocked 20 (2020) 100363. https://doi.org/10.1016/j.imu.2020.100363.

[35] A. Miller, J. Panneerselvam & L. Liu, “A review of regression and classification techniques for analysis of common and rare variants and geneenvironmental factors”, Neurocomputing 489 (2022) 466. https://doi.org/10.1016/j.neucom.2021.08.150.

[36] M. Mallik, A. K. Panja, C. Chowdhury, “Paving the way with machine learning for seamless indoor–outdoor positioning: A survey”, Information Fusion 94 (2023) 126. https://doi.org/10.1016/j.inffus.2023.01.023.

[37] S. K. Kiangala & Z. Wang, “An effective adaptive customization framework for small manufacturing plants using extreme gradient boostingXGBoost and random forest ensemble learning algorithms in an Industry 4.0 environment”, Machine Learning with Applications 4 (2021) 100024. https://doi.org/10.1016/j.mlwa.2021.100024.

[38] D. Packwood, L. T. H. Nguyen, P. Cesana, G. Zhang, A. Staykov, Y. Fukumoto & D. H. Nguyen, “Machine learning in materials chemistry: An invitation”, Machine Learning with Applications 8 (2022) 100265. https://doi.org/10.1016/j.mlwa.2022.100265.

[39] S. Huang, M. Huang & Y. Lyu, “A novel approach for sand liquefaction prediction via local mean-based pseudo nearest neighbor algorithm and its engineering application”, Advanced Engineering Informatics 41 (2019) 100918. https://doi.org/10.1016/j.aei.2019.04.008.

[40] T. F. Monaghan, S. N. Rahman, C. W. Agudelo, A. J. Wein, J. M. Lazar, K. Everaert & R. R. Dmochowski, “Foundational statistical principles in medical research: sensitivity, specificity, positive predictive value, and negative predictive value”, Medicina (Kaunas, Lithuania) 57 (2021) 503. https://doi.org/10.3390/medicina57050503.

[41] S. Orozco-Arias, J. S. Pina, R. Tabares-Soto, L. F. Castillo-Ossa, R. Guyot˜ & G. Isaza, “Measuring performance metrics of machine learning algorithms for detecting and classifying transposable elements”, Processes 8 (2020) 638. https://doi.org/10.3390/pr8060638.

[42] S. A. Hicks, I. Strumke, V. Thambawita, M. Hammou, M. A. Riegler, Pål¨ Halvorsen & S. Parasa, “On evaluation metrics for medical applications of artificial intelligence”, Scientific Reports 12 (2022) 5979. https://doi.org/10.1038/s41598-022-09954-8.

[43] H. Belyadi & A. Haghighat, “Supervised learning”, in Machine Learning Guide for Oil and Gas Using Python, H. Belyadi & A. Haghighat, Eds., Gulf Professional Publishing, 2021, pp. 169–295. https://doi.org/10.1016/B978-0-12-821929-4.00004-4.

[44] T. Saito & M. Rehmsmeier, “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets”, PloS One 10 (2015) e0118432. https://doi.org/10.1371/journal.pone.0118432.

[45] O. A. Montesinos Lopez, A. Montesinos L´ opez & J. Crossa, “Overfitting,´ model tuning, and evaluation of prediction performance”, in Multivariate Statistical Machine Learning Methods for Genomic Prediction, Springer, 2022, pp. 109–139. http://dx.doi.org/10.1007/978-3-030-89010-0_4.