Optimized Breast Cancer Classification using Feature Selection and Outliers Detection



  • A. B Yusuf Department of Information and Communication Technology, Usmanu Danfodiyo University, Sokoto State
  • R. M Dima Department of Computer Science, Federal University Dutsinma, Katsina State
  • S. K Aina Department of Computer Science, Federal University Gashua, Yobe State


Breast cancer is the second most commonly diagnosed cancer in women throughout the world. It is on the rise, especially in developing countries, where the majority of cases are discovered late. Breast cancer develops when cancerous tumors form on the surface of the breast cells. The absence of accurate prognostic models to assist physicians recognize symptoms early makes it difficult to develop a treatment plan that would help patients live longer. However, machine learning techniques have recently been used to improve the accuracy and speed of breast cancer diagnosis. If the accuracy is flawless, the model will be more efficient, and the solution to breast cancer diagnosis will be better. Nevertheless, the primary difficulty for systems developed to detect breast cancer using machine-learning models is attaining the greatest classification accuracy and picking the most predictive feature useful for increasing accuracy. As a result, breast cancer prognosis remains a difficulty in today's society. This research seeks to address a flaw in an existing technique that is unable to enhance classification of continuous-valued data, particularly its accuracy and the selection of optimal features for breast cancer prediction. In order to address these issues, this study examines the impact of outliers and feature reduction on the Wisconsin Diagnostic Breast Cancer Dataset, which was tested using seven different machine learning algorithms. The results show that Logistic Regression, Random Forest, and Adaboost classifiers achieved the greatest accuracy of 99.12%, on removal of outliers from the dataset. Also, this filtered dataset with feature selection, on the other hand, has the greatest accuracy of 100% and 99.12% with Random Forest and Gradient boost classifiers, respectively. When compared to other state-of-the-art approaches, the two suggested strategies outperformed the unfiltered data in terms of accuracy. The suggested architecture might be a useful tool for radiologists to reduce the number of false negatives and positives. As a result, the efficiency of breast cancer diagnosis analysis will be increased.


M. R. Mohebian, H. R. Marateb, M. Mansourian, M. A. Ma˜nanas & F. Mokarian, “A hybrid computer-aided-diagnosis system for prediction of breast cancer recurrence (HPBCR) using optimized ensemble learning,” Computational and Structural Biotechnology Journal 15 (2017) 75. DOI: https://doi.org/10.1016/j.csbj.2016.11.004

S. Amin, H. S. Ewunonu, E. Oguntebi & I. Liman, “Breast cancer mortality in a resource-poor country: a 10-year experience in a tertiary institution,” Sahel Medical Journal 20 (2017) 9. DOI: https://doi.org/10.4103/smj.smj_64_15

M.W. Huang, C.W. Chen, W.C. Lin, S.W. Ke & C.F. Tsai, “SVM and SVM ensembles in breast cancer prediction,” PLoS ONE 12 (2017) 161501. DOI: https://doi.org/10.1371/journal.pone.0161501

CDC, “What is breast cancer?” (2021).

R. J. Oskouei, N. M. Kor & S. A. Maleki, “Data mining and medical world: breast cancers’ diagnosis, treatment, prognosis and challenges,” American Journal of Cancer Research 7 (2017) 610.

L. A. Aaltonen, R. Salovaara, P. Kristo, F. Canzian, A. Hemminki, P. Peltom¨aki, R. B. Chadwick, H. K¨a¨ari¨ainen, M. Eskelinen, H. J¨arvinen, J. P. Mecklin, & A.DelaChapelle, “Incidence of hereditary nonpolyposis colorectal cancer and the feasibility of molecular screening for the disease,” New England Journal of Medicine 338 (1998) 1481. DOI: https://doi.org/10.1056/NEJM199805213382101

A. Khamparia, S. Bharati, P. Podder, D. Gupta, A. Khanna, T. K. Phung & D. N. H. Thanh, “Diagnosis of breast cancer based on modern mammography using hybrid transfer learning,” Multidimensional Systems and Signal Processing 32 (2021) 747. DOI: https://doi.org/10.1007/s11045-020-00756-7

H. Kurihara, C. Shimizu, Y. Miyakita, M. Yoshida, A. Hamada, Y. Kanayama, K. Yonemori, J. Hashimoto, H. Tani, M. Kodaira, M. Yunokawa, H. Yamamoto, Y. Watanabe, Y. Fujiwara & K. Tamura, “Molecular imaging using PET for breast cancer,” The Japanese Breast Cancer Society 23 (2016) 24. DOI: https://doi.org/10.1007/s12282-015-0613-z

T. Nagashima, M. Suzuki, H. Yagata, H. Hashimoto, T. Shishikura, N. Imanaka, T. Ueda & M. Miyazaki, “Dynamic-enhanced MRI predicts metastatic potential of invasive ductal breast cancer,” Breast Cancer 9 (2002) 226. DOI: https://doi.org/10.1007/BF02967594

C. S. Park, S. H. Kim, N. Y. Jung, J. J. Choi, B. J. Kang & H. S. Jung, “Interobserver variability of ultrasound elastography and the ultrasound BI-RADS lexicon of breast lesions,” Breast Cancer 22 (2015) 153. DOI: https://doi.org/10.1007/s12282-013-0465-3

S. I. Ayon, M.Islam &M.R.Hossain, “Coronary artery heart disease prediction: A comparative study of computational intelligence techniques,” IETE Journal of Research (2020) 1. DOI: https://doi.org/10.1080/03772063.2020.1713916

M.M.Islam, H.Iqbal, R. Haque &K.Hasan, “Prediction of breast cancer using support vector machine and k-nearest neighbors,” IEEE Region 10 Humanitarian Technology Conference (R10-HTC) (2017) 226. DOI: https://doi.org/10.1109/R10-HTC.2017.8288944

L. J. Muhammad, M. M. Islam, S. S. Usman & S. I. Ayon, “Predictive data mining models for novel coronavirus (covid-19) infected patients’ recovery,” SN Computer Science 1 (2020) 206. DOI: https://doi.org/10.1007/s42979-020-00216-w

A. Yusuf & O. Akande, “Hyper-parameter optimization and evaluation on selected machine learning algorithm using hepatitis dataset,” FUDMA Journal of Sciences 5 (2021) 447. DOI: https://doi.org/10.33003/fjs-2021-0502-649

S. I. Ayon & M. Islam, “Diabetes prediction: a deep learning approach,” International Journal of Information Engineering and Electronic Business 11 (2019) 2. DOI: https://doi.org/10.5815/ijieeb.2019.02.03

Z. Islam, M. Islam & A. Asraf, “A combined deep CNN-LSTM network for the detection of novel coronavirus (covid-19) using x-ray images,” Informatics in Medicine Unlocked 20 (2020) 100412. DOI: https://doi.org/10.1016/j.imu.2020.100412

K. Hasan, M. Islam & M. M. A. Hashem, “Mathematical model development to detect breast cancer using multigene genetic programming,” International Conference on Informatics, Electronics and Vision (2016) 574. DOI: https://doi.org/10.1109/ICIEV.2016.7760068

M. T. Ahmed, M. N. Imtiaz & A. Karmakar, “Analysis of wisconsin breast cancer original dataset using data mining and machine learning algorithms for breast cancer prediction,” Journal of Science Technology and Environment Informatics 9 (2020) 665. DOI: https://doi.org/10.18801/jstei.090220.67

M. M. Islam, Md. R. Haque, H. Iqbal, Md. M. Hasan, M. Hasan & M. N. Kabir, “Breast cancer prediction: A comparative study using machine learning techniques,” SN Computer Science 1 (2020) 290. DOI: https://doi.org/10.1007/s42979-020-00305-w

N. Khuriwal & N. Mishra, “Breast cancer diagnosis using deep learning algorithm,” International Conference on Advances in Computing, Communication Control and Networking (2018) 98. DOI: https://doi.org/10.1109/ICACCCN.2018.8748777

C. Shah & A. G. Jivani, “Comparison of data mining classification algorithms for breast cancer prediction,” Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT) (2013) 1. DOI: https://doi.org/10.1109/ICCCNT.2013.6726477

F. A. Muhammet, “A comparative analysis of breast cancer detection and diagnosis using data visualization and machine learning applications,” Healthcare 8 (2020) 111. DOI: https://doi.org/10.3390/healthcare8020111

N. F. Idris & M. A. Ismail, “Breast cancer disease classification using Fuzzy-ID3 algorithm with FUZZYDBD method: automatic fuzzy database definition,” PeerJ Computer Science 7 (2021) 427. DOI: https://doi.org/10.7717/peerj-cs.427

R.Harikumar&C.Sannasi, “Effectiveclassification framework for breast tumors using optimized multi-kernel SVM with controlled skewness,” International Journal of Aquatic Science 12 (2021) 1604.

M. Nilashi, O. Ibrahim, H. Ahmadi & L. Shahmoradi, “A knowledgebased system for breast cancer classification using fuzzy logic method,” Telematics and Informatics 34 (2017) 133. DOI: https://doi.org/10.1016/j.tele.2017.01.007

D. A. Omondiagbe, S. Veeramani & A. S. Sidhu, “Machine learning classification techniques for breast cancer diagnosis,” IOP Conference Series: Materials Science and Engineering 495 (2019) 012033. DOI: https://doi.org/10.1088/1757-899X/495/1/012033

A. Sayg?l?, “Classification and diagnostic prediction of breast cancers via different classifiers,” International Scientific and Vocational Studies Journal 2 (2018) 56.

A. Derangula, S. Edara & P. K. Karri, “Feature selection of breast cancer data using gradient boosting techniques of machine learning,” Clinical Medicine 7 (2020) 17.

S. Raj, S. Singh, A. Kumar, S. Sarkar & C. Pradhan, “Feature selection and randomforest classification for breast cancer disease,” Data Analytics in Bioinformatics (2021) 191. DOI: https://doi.org/10.1002/9781119785620.ch8

T. H. Cheng, C. P. Wei & V. S. Tseng, “Feature selection for medical data mining: comparisons of expert judgment and automatic approaches,” 19th IEEE Symposium on Computer-Based Medical Systems (2006) 165.

S. N. Ghazavi & T. W. Liao, “Medical data mining by fuzzy modeling with selected features,” Artificial Intelligence in Medicine 43 (2008) 195. DOI: https://doi.org/10.1016/j.artmed.2008.04.004

S. M. Vieira, J. M. C. Sousa & U. Kaymak, “Fuzzy criteria for feature selection,” Fuzzy Sets and Systems 189189 (2012)1. DOI: https://doi.org/10.1016/j.fss.2011.09.009

S. B. Sakri, N. B. Abdul Rashid & Z. Muhammad Zain, “Particle swarm optimization feature selection for breast cancer recurrence prediction,” IEEE Access 6 (2018) 29637. DOI: https://doi.org/10.1109/ACCESS.2018.2843443

E. E. Bron, M. Smits, W. J. Niessen & S. Klein, “Feature selection based on the SVM weight vector for classification of dementia,” IEEE Journal of Biomedical and Health Informatics 19 (2015) 1617. DOI: https://doi.org/10.1109/JBHI.2015.2432832

M. Kumari, V. Singh & P. Ahlawat, “Automated decision support system for breast cancer prediction,” International Journal on Emerging Technologies 11 (2020) 193.

L. Breiman, “Random forests: random features,” Technical Report 567, Statistics Department, University of California, Berkeley (1999) 29.

S. V. Stehman, “Selecting and interpreting measures of thematic classification accuracy,” Remote Sensing of Environment 62 (1997) 77. DOI: https://doi.org/10.1016/S0034-4257(97)00083-7



How to Cite

Yusuf, A. B. ., Dima, R. M., & Aina, S. K. (2021). Optimized Breast Cancer Classification using Feature Selection and Outliers Detection. Journal of the Nigerian Society of Physical Sciences, 3(4), 298–307. https://doi.org/10.46481/jnsps.2021.331



Original Research