Addressing class imbalance in lassa fever epidemic data, using machine learning: a case study with SMOTE and random forest

Authors

  • Osowomuabe Njama-Abang Department of Computer Science, University of Calabar PMB 1115, Etta Agbo Rd, Calabar, Nigeria https://orcid.org/0000-0001-5271-1267
  • Denis U. Ashishie Department of Computer Science, University of Calabar PMB 1115, Etta Agbo Rd, Calabar, Nigeria
  • Paul T. Bukie Department of Computer Science, University of Calabar PMB 1115, Etta Agbo Rd, Calabar, Nigeria

Keywords:

Lassa fever, Machine learning, SMOTE, Random forest, Class imbalance

Abstract

Class imbalance in epidemiological datasets, particularly for rare outcomes like Lassa Fever fatalities, complicates predictive modeling. This study addresses the issue by employing SMOTE to rebalance the dataset and Random Forest for classification while identifying significant predictors such as age, symptom severity, and residence. SMOTE successfully balanced the dataset (minority class recall improved from 0.60 to 1.00 in Random Forest), mitigating the bias toward majority classes. Without SMOTE, models including Random Forest, XGBoost, and LightGBM achieved high accuracy (> 99%) but demonstrated poor minority recall (?0.75), confirming the challenge of imbalanced data. Post-SMOTE balancing, these models achieved 100% accuracy, precision, recall, and F1-scores across major classes. Notably, the hybrid ensemble model further enhanced outcomes, achieving an F1-score of 0.80 for the rarest class. These results underscore the superiority of SMOTE in improving classification for underrepresented outcomes compared to reliance on Random Forest alone, demonstrating its value in developing equitable predictive tools for outbreak management.

Dimensions

[1] World Health Organization, “Lassa fever fact sheet”, World Health Organization, Geneva, Switzerland, 2023. [Online]. https://www.who.int/news-room/fact-sheets/detail/lassa-fever.

[2] Centers for Disease Control and Prevention, “Lassa fever epidemiology”, CDC, Atlanta, GA, USA, 2023. [Online]. https://www.cdc.gov/vhf/lassa/epidemiology.html.

[3] D. G. Bausch, C. M. Hadi, S. H. Khan & J. L. Lertora, “Review of the literature and proposed guidelines for the use of oral ribavirin as postexposure prophylaxis for Lassa fever”, Clinical Infectious Diseases 51 (2010) 1435. https://doi.org/10.1086/657315.

[4] G. Douzas, F. Bacao & F. Last, “Improved sampling for imbalanced data using Gaussian mixture models”, Expert Systems with Applications 91 (2018) 464. https://doi.org/10.1016/j.eswa.2017.09.030.

[5] R. Blagus & L. Lusa, “SMOTE for high-dimensional class-imbalanced data”, BMC Bioinformatics 14 (2013) 106. https://doi.org/10.1186/1471-2105-14-106.

[6] P. Probst, M. N. Wright & A. L. Boulesteix, “Hyperparameters and tuning strategies for random forest”, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (2019) e1301. https://doi.org/10.1002/widm.1301.

[7] J. Wiens, S. Saria, M. Sendak, M. Ghassemi, V. X. Liu, F. Doshi-Velez, K. Jung, K. Heller, D. Kale, M. Saeed, P. N. Ossorio, S. Thadaney-Israni & A. Goldenberg, “Do no harm: a roadmap for responsible machine learning for health care”, Nature Medicine 25 (2018) 1337. https://doi.org/10.1038/s41591-019-0548-6.

[8] H. He & E. A. Garcia, “Learning from imbalanced data”, IEEE Transactions on Knowledge and Data Engineering 21 (2009) 1263. https://doi.org/10.1109/TKDE.2008.239.

[9] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue & G. Bing, “Learning from class-imbalanced data: review of methods and applications”, Expert Systems with Applications 73 (2017) 220. https://doi.org/10.1016/j.eswa.2016.12.035.

[10] N. Grubaugh, J. T. Ladner, P. Lemey, O. G. Pybus, A. Rambaut, E. C. Holmes & K. G. Andersen, “Tracking virus outbreaks in the 21st century using phylogenetic and statistical methods”, Nature Microbiology 4 (2018) 10. https://doi.org/10.1038/s41564-018-0296-2.

[11] S. K. Gire et al., “Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak”, Science 345 (2014) 1369. https://doi.org/10.1126/science.1259657.

[12] P. Branco, L. Torgo & R. P. Ribeiro, “A survey of predictive modeling on imbalanced domains”, ACM Computing Surveys (CSUR) 49 (2016) 1. https://doi.org/10.1145/2907070.

[13] T. Chen & C. Guestrin, “XGBoost: a scalable tree boosting system”, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 2016, pp. 785. https://doi.org/10.1145/2939672.2939785.

[14] G. Ke, Q. Meng, T. Finley, T. Wang & W. Chen, “LightGBM: a highly efficient gradient boosting decision tree”, in Proceedings of Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 3146. https://dl.acm.org/doi/10.5555/3294996.3295074.

[15] T. Saito & M. Rehmsmeier, “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets”, PLOS ONE 10 (2015) e0118432. https://doi.org/10.1371/journal.pone.0118432.

Published

2025-08-01

How to Cite

Addressing class imbalance in lassa fever epidemic data, using machine learning: a case study with SMOTE and random forest. (2025). Journal of the Nigerian Society of Physical Sciences, 7(3), 2586. https://doi.org/10.46481/jnsps.2025.2586

Issue

Section

Computer Science

How to Cite

Addressing class imbalance in lassa fever epidemic data, using machine learning: a case study with SMOTE and random forest. (2025). Journal of the Nigerian Society of Physical Sciences, 7(3), 2586. https://doi.org/10.46481/jnsps.2025.2586