A feature selection and scoring scheme for dimensionality reduction in a machine learning task

PHILEMON UTEN EMMOH; christopher  ifeanyi Eke; Timothy Moses

doi:10.46481/jnsps.2025.2273

Authors

Philemon Uten Emmoh
[email protected]

Department of Computer Science, Federal University Wukari, P.M.B 1020, Katsina-Ala Road, Wukari, Taraba State, Nigeria
Christopher Ifeanyi Eke
Department of Computer Science, Federal University of Lafia, P.M.B 146, Lafia, Nasarawa State, Nigeria
Timothy Moses
Department of Computer Science, Federal University of Lafia, P.M.B 146, Lafia, Nasarawa State, Nigeria

Keywords:

Algorithm, Dataset, Dimensionality reduction, Feature selection

Abstract

Selection of important features is very vital in machine learning tasks involving high-dimensional dataset with large features. It helps in reducing the dimensionality of a dataset and improving model performance. Most of the feature selection techniques have restriction in the kind of dataset to be used. This study proposed a feature selection technique that is based on statistical lift measure to select important features from a dataset. The proposed technique is a generic approach that can be used in any binary classification dataset. The technique successfully determined the most important feature subset and outperformed the existing techniques. The proposed technique was tested on lungs cancer dataset and happiness classification dataset. The effectiveness of the proposed technique in selecting important features subset was evaluated and compared with other existing techniques, namely Chi-Square, Pearson Correlation and Information Gain. Both the proposed and the existing techniques were evaluated on five machine learning models using four standard evaluation metrics such as accuracy, precision, recall and F1-score. The experimental results of the proposed technique on lung cancer dataset shows that logistic regression, decision tree, adaboost, gradient boost and random forest produced a predictive accuracy of 0.919%, 0.935%, 0.919%, 0.935% and 0.935% respectively, and that of happiness classification dataset produced a predictive accuracy of 0.758%, 0.689%, 0.724%, 0.655% and 0.689% on random forest, k-nearest neighbor, decision tree, gradient boost and cat boost respectively, which outperformed the existing techniques.

Dimensions

REFERENCES

D. T. Patel, N. Honest, P. Vyas & A. Patel, “Univariate and multivariate filtering techniques for feature selection and their applications in field of machine learning”, in Applying data science and learning analytics throughout a learner’s lifespan, G. Trajkovski, M. Demeter & H. Hayes (Eds.), IGI Global, 2022, pp. 73-93. https://doi.org/10.4018/978-1-7998-9644-9.ch004.

X. Zhang & J. Gao, “Measuring feature importance of convolutional neural networks”, IEEE Access 8 (2020) 196062. https://doi.org/10.1109/ACCESS.2020.3034625.

H. Peng, F. Long & C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance & min-redundancy”, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005) 8. https://doi.org/10.1109/TPAMI.2005.159.

E. K. Bodur and D. D. Atsa’am, “Filter variable selection algorithm using risk ratios for dimensionality reduction of healthcare data for classification”, Processes 7 (2019) 4. https://doi.org/10.3390/pr7040222.

M. Cherrington, F. Thabtah, J. Lu & Q. Xu, “Feature selection: filter methods performance challenges”, in 2019 International Conference on Computer and Information Sciences (ICCIS), IEEE, Apr. 2019, pp. 1–4. https://doi.org/10.1109/ICCISci.2019.8716478.

A. F. R. Araújo, V. O. Antonino & K. L. Ponce-Guevara, “Self-organizing subspace clustering for high-dimensional and multi-view data”, Neural Networks 130 (2020) 6. https://doi.org/10.1016/j.neunet.2020.06.022.

Y. Li, Y. Chai, H. Yin & B. Chen, “A novel feature learning framework for high-dimensional data classification”, Int. J. Mach. Learn. Cybern. 12 (2021) 2. https://doi.org/10.1007/s13042-020-01188-2.

N. Meinshausen, “Hierarchical testing of variable importance”, Biometrika 95 (2008) 2. https://doi.org/10.1093/biomet/asn007.

A. Biancolillo, K. H. Liland, I. Måge, T. Næs & R. Bro, “Variable selection in multi-block regression”, Chemom. Intell. Lab. Syst. 156 (2016) 8. https://doi.org/10.1016/j.chemolab.2016.05.016.

G. Heinze and D. Dunkler, “Five myths about variable selection”, Transpl. Int. 30 (2017) 1. https://doi.org/10.1111/tri.12895.

R. B. O’Hara and M. J. Sillanpää, “A review of Bayesian variable selection methods: what, how and which”, Bayesian Anal. 4 (2009) 1. https://doi.org/10.1214/09-BA403.

N. Pudjihartono, T. Fadason, A. W. Kempa-Liehr & J. M. O’Sullivan, “A review of feature selection methods for machine learning-based disease risk prediction”, Front. Bioinforma. 2 (2022) 6. https://doi.org/10.3389/fbinf.2022.927312.

T. Mehmood, K. H. Liland, L. Snipen, and S. Sæbø, “A review of variable selection methods in partial least squares regression”, Chemom. Intell. Lab. Syst. 118 (2012) 8. https://doi.org/10.1016/j.chemolab.2012.07.010.

G. Soledad, ‘Alternative feature selection methods in machine learning”, 2021. [Online]. https://www.kdnuggets.com/2021/12/alternative-feature-selection-methods-machine-learning.html.

S. B. Pooja & R. S. Balan, “Point biserial correlated feature selection of weather data”, Int. J. Eng. Adv. Technol. 8 (2019) 6. https://doi.org/10.35940/ijeat.F7891.088619.

C. I. Eke, A. A. Norman & L. Shuib, “Multi-feature fusion framework for sarcasm identification on twitter data: A machine learning based approach”, PLoS One, 16 (2021) 6. https://doi.org/10.1371/journal.pone.0252918.

D. D. Atsa’am, “Feature selection algorithm using relative odds for data mining classification”, in Big data analytics for sustainable computing, A. Haldorai & A. Ramu (Eds.), IGI Global, 2020, pp. 81–106.https://doi.org/10.4018/978-1-5225-9750-6.ch005.

S. DeySarakar & S. Goswami, “Empirical study on filter based feature selection methods for text classification”, Int. J. Comput. Appl. 81 (2013) 6. https://doi.org/10.5120/14018-2173.

K. Pavya & D. B. Srinivasan, “Feature selection techniques in data mining: a study”, Int. J. Sci. Dev. Res. 2 (2017) 6. www.ijsdr.org.

S. K. Gajawada, “Chi-square test for feature selection in machine learning”, 2019. [Online]. https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223.

G. Chandrashekar & F. Sahin, “A survey on feature selection methods”, Comput. Electr. Eng. 40 (2014) 16. https://doi.org/10.1016/j.compeleceng.2013.11.024.

S. Solorio-Fernández, J. F. Mart??nez-Trinidad & J. A. Carrasco-Ochoa, “A supervised filter feature selection method for mixed data based on spectral feature selection and Information-theory redundancy analysis”, Pattern Recognit. Lett. 138 (2020) 321. https://doi.org/10.1016/j.patrec.2020.07.039.

G. Singer, R. Anuar & I. Ben-Gal, “A weighted information-gain measure for ordinal classification trees”, Expert Syst. Appl. 152 (2020) 11337. https://doi.org/10.1016/j.eswa.2020.113375.

Nurhayati, A. E. Putra, L. K. Wardhani & Busman, “Chi-square feature selection effect on naive bayes classifier algorithm performance for sentiment analysis document”, in 2019 7th International Conference on Cyber and IT Service Management (CITSM), IEEE, Nov. 2019, pp. 1–7. https://doi.org/10.1109/CITSM47753.2019.8965332.

L. Sun, T. Wang, W. Ding, J. Xu & Y. Lin, “Feature selection using Fisher score and multilabel neighbor-hood rough sets for multilabel classification”, Inf. Sci. (Ny). 578 (2021) 887. https://doi.org/10.1016/j.ins.2021.08.032.

S. Xie, Y. Zhang, D. Lv, X. Chen, J. Lu & J. Liu, “A new improved maximal relevance and minimal redundancy method based on feature subset”, J. Supercomput. 79 (2023) 3. https://doi.org/10.1007/s11227-022-04763-2.

F. Thabtah, F. Kamalov, S. Hammoud & S. R. Shahamiri, “Least Loss: A simplified filter method for feature selection”, Inf. Sci. (Ny). 534 (2020) 9. https://doi.org/10.1016/j.ins.2020.05.017.

C. I. Eke, A. A. Norman & L. Shuib, “Multi-feature fusion framework for sarcasm identification on twitter data: A machine learning based approach”, PLoS ONE 16 (2021) e0252918. https://doi.org/10.1371/journal.pone.0252918.

P. Chen, F. Li & C. Wu, “Research on intrusion detection method based on Pearson correlation coefficient feature selection algorithm”, J. Phys. Conf. Ser. 1757 (2021) 012054. https://doi.org/10.1088/1742-6596/1757/1/012054.

F. Thabtah, F. Kamalov, S. Hammoud & S. R. Shahamiri, “Least loss: A simplified filter method for feature selection”, Inf. Sci. (Ny). 534 (2020) 9. https://doi.org/10.1016/j.ins.2020.05.017.

K. Vu, R. A. Clark, C. Bellinger, “The index lift in data mining has a close relationship with the association measure relative risk in epidemiological studies”, BMC Med. Inform. Decis. Mak. 19 (2019) 112. https://doi.org/10.1186/s12911-019-0838-4.

C. Iwendi, S. Khan, J. H. Anajemba, M. Mittal, M. Alenezi & M. Alazab, “The use of ensemble models for multiple class and binary class classification for improving intrusion detection systems”, Sensors 20 (2020) 9. https://doi.org/10.3390/s20092559.

C. I. Eke, A. A. Norman, Liyana Shuib & H. F. Nweke, “Sarcasm identification in textual data: systematic review, research challenges and open directions”, Artif. Intell. Rev., 53 (2020) 6. https://doi.org/10.1007/s10462-019-09791-8.

A. Alhazmi, R. Mahmud, N. Idris, M. E. Mohamed Abo & C. I. Eke, “Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models”, PLoS One 19 (2024) 7. https://doi.org/10.1371/journal.pone.0305657.

Dewi Widyawati and Amaliah Faradibah, “Comparison Analysis of Classification Model Performance in Lung Cancer Prediction Using Decision Tree, Naive Bayes & Support Vector Machine”, Indones. J. Data Sci. 4 (2023) 2. https://doi.org/10.56705/ijodas.v4i2.76.

D. Li, G. Li, S. Li & A. Bang, “Classification prediction of lung cancer based on machine learning method”, Int. J. Healthc. Inf. Syst. Informatics 19 (2023) 1. https://doi.org/10.4018/IJHISI.333631.

R. Patil, C. G. Sinchana, P. Tejashwini, K. N. Tejaswini & V. V. Ganiga, “Lung cancer prediction system using logistic regression approach”, International Research Journal of Modernization in Engineering Technology and Science 2 (2020) 656. https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.irjmets.com/uploadedfiles/paper/volume2/issue_12._december_2020/5379/1628083215.pdf.