Machine learning-based feature selection for ultra-high-dimensional survival data: a computational approach

Nahid Salma; Majid Khan Majahar Ali; Raja Aqib  Shamim

doi:10.46481/jnsps.2025.2810

Authors

Nahid Salma School of Mathematical Sciences, Universiti Sains Malaysia, 11800, Pulau Penang, Malaysia | Department of Statistics and Data Science, Jahangirnagar University, Savar, 1342, Dhaka, Bangladesh
Majid Khan Majahar Ali
[email protected]
School of Mathematical Sciences, Universiti Sains Malaysia, 11800, Pulau Penang, Malaysia
Raja Aqib Shamim School of Mathematical Sciences, Universiti Sains Malaysia, 11800, Pulau Penang, Malaysia | Department of Mathematics, University of Kotli, 11100, Azad Jammu and Kashmir, Pakistan

Keywords:

Ultra-high dimension, Machine Learning, Feature Selection, Renal Cell Carcinoma, Survival Data

Abstract

Ultra-high-dimensional (UHD) survival data presents significant computational challenges in biomedical research, particularly in Renal Cell Carcinoma (RCC), where genomic complexity complicates risk assessment. Effective feature selection is crucial for identifying key biomarkers that improve RCC diagnosis, prognosis, and treatment. This study evaluates machine learning (ML)-based feature selection methods to address limitations in scalability, feature redundancy, and predictive accuracy in UHD RCC survival data. Gene expression data from 4,224 differentially expressed genes across 74 individuals was analyzed using LASSO, EN, Adaptive LASSO, Group LASSO, SIS, ISIS, SCAD, and SVM. Models were assessed using Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² values. SCAD demonstrated the best predictive performance (MSE: 529.00, RMSE: 23.00, R²: 0.69), surpassing ISIS (R²: 0.61), SIS (R²: 0.60), and EN (R²: 0.57). LASSO and Adaptive LASSO underperformed. SCAD identified 14 key genes—NCAM1, ATP1B3, NAT8, MT2A, GTF2F2, X4197, GUCY2C, SLC3A1, CRYZ, DES, MT1L, NFYB, PRKAR2B, and CLIP1—as potential RCC biomarkers. Gene interaction network analysis confirmed their role in RCC progression. Despite SCAD’s strong performance, it left 31% of data variability unexplained, suggesting hybrid ML models that integrate ensemble learning, two-component regression structures, and deep learning-based feature selection could further enhance gene selection and predictive accuracy. This research supports SDG 3 (Good Health and Well-being) and SDG 9 (Industry, Innovation, and Infrastructure) by advancing precision medicine, early RCC detection, and biomedical data-driven innovations for improved clinical decision-making.

Dimensions

REFERENCES

[1] J. Rahnenführer, R. De Bin, A. Benner, F. Ambrogi, L. Lusa, A. L. Boulesteix, E. Migliavacca, H. Binder, S. Michiels, W. Sauerbrei & L. McShane, “Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges”, BMC medicine 21 (2023) 182. https://doi.org/10.1186/s12916-023-02858-y.

[2] X. Han & D. Song, “Using a machine learning approach to identify key biomarkers for renal clear cell carcinoma”, International Journal of General Medicine 15 (2022) 3541. https://doi.org/10.2147/IJGMS351168.

[3] P. Lambin, E. Rios-Velazquez, R, Leijenaar, S. Carvalho, R. G.van Stiphout, P. Granton, C. M. Zegers, R. Gillies, R. Boellard, A. Dekker & H. J. Aerts, “Radiomics: extracting more information from medical images using advanced feature analysis”, Eur J Cancer 48 (2012) 441. https://doi.org/10.1016/j.ejca.2011.11.036.

[4] M. Mahootiha M, H. A. Qadir, J. Bergsland & I. Balasingham, “Multimodal deep learning for personalized renal cell carcinoma prognosis: Integrating CT imaging and clinical data”, Computer Methods and Programs in Biomedicine 244 (2024) 107978. https://doi.org/10.1016/j.cmpb.2023.107978.

[5] S. W. Oh, S. S. Byun, J. K. Kim, C. W. Jeong, C. Kwak, E. C. Hwang, S. H. Kang, J. Chung, Y. J. Kim, Y. S. Ha & S. H. Hong, “Machine learning models for predicting the onset of chronic kidney disease after surgery in patients with renal cell carcinoma”, BMC Medical Informatics and Decision Making 24 (2024) 85. https://doi.org/10.1186/s12911-024-02473-8.

[6] N. P. Singh, R. S. Bapi & P. K. Vinod, “Machine learning models to predict the progression from early to late stages of papillary renal cell carcinoma”, Computers in biology and medicine 100 (2018) 92. https://doi.org/10.1016/j.compbiomed.2018.06.030.

[7] P. Terrematte, D. S. Andrade, J. Justino, B. Stransky, D.S. de Araújo & A. D. Dória Neto, “A novel machine learning 13-gene signature: improving risk analysis and survival prediction for clear cell renal cell carcinoma patients”, Cancers 14 (2022) 2111. https://doi.org/10.3390/cancers14092111.

[8] Z. Xin, R. Lv, W. Liu, S. Wang, Q. Gao, B. Zhang & G. Sun, “An ensemble learning-based feature selection algorithm for identification of biomarkers of renal cell carcinoma”, PeerJ Computer Science 10 (2024) 1768. https://doi.org/10.7717/peerj-cs.1768.

[9] Y. Zhan, W. Guo, Y. Zhang, Q. Wang, X. J. Xu & L. Zhu, A five-gene signature predicts prognosis in patients with kidney renal clear cell carcinoma. Computational and Mathematical Methods in Medicine 1 (2015) 842784. https://doi.org/10.1155/2015/842784.

[10] H. Liu, Y. Luo, S. Zhao, J. Tan, M. Chen, X. Liu, X. & W. Zhong, “A reactive oxygen species–related signature to predict prognosis and aid immunotherapy in clear cell renal cell carcinoma”, Frontiers in Oncology 13 (2023) 1202151. https://doi.org/10.3389/fonc.2023.1202151.

[11] T. Ebru, O. P. Fulya, A. Hakan, Y. C. Vuslat, S. Necdet, C. Nuray & O. Filiz, “Analysis of various potential prognostic markers and survival data in clear cell renal cell carcinoma”, International Braz J Urol 43 (2017) 440. https://doi.org/10.1590/S1677-5538.IBJU.2015.0521.

[12] R. L. Siegel, K. D. Miller, N. S. Wagle & A. Jemal, “Cancer statistics”, CA Cancer J Clin 73 (2023) 17. https://doi.org/10.3322/caac.21763.

[13] A. C. Society, “What is kidney cancer?”. [Online]. https://www.cancer.org/cancer/types/kidney-cancer/about/what-is-kidney-cancer.html.

[14] World Cancer Research Fund International. “Kidney Cancer Statistics”, 2023. [Online]. https://www.wcrf.org/cancertrends/kidney-cancer-statistics/.

[15] S. A. Padala, A. Barsouk, K. C. Thandra,K. Saginala, A. Mohammed, A. Vakiti & A. Barsouk, “Epidemiology of renal cell carcinoma”, World journal of oncology 11 (2020) 79. https://doi.org/10.14740/wjon1279.

[16] B. Ljungberg, S. C. Campbell, H. Y. Cho, D. Jacqmin, J. E. Lee, S. Weikert & L. A. Kiemeney, “The epidemiology of renal cell carcinoma”, European urology 60 (2011) 615. https://doi.org/10.1016/j.eururo.2011.06.049.

[17] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal & F. Bray, “Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries”, CA: a cancer journal for clinicians 71 (2021) 209. https://doi.org/10.3322/caac.21660.

[18] A. Znaor, J. Lortet-Tieulent, M. Laversanne, A. Jemal & F. Bray, “International variations and trends in renal cell carcinoma incidence and mortality”, European urology 67 (2015) 519. https://doi.org/10.1016/j.eururo.2014.10.002.

[19] H. Chamlal, A. Benzmane & T. Ouaderhman, “Elastic net-based high dimensional data selection for regression”, Expert Systems with Applications 244 (2024) 122958. https://doi.org/10.1016/j.eswa.2023.122958.

[20] S. Bajaj, D. Gandhi, D. Nayar & A. Serhal, “Von Hippel–Lindau disease (VHL): characteristic lesions with classic imaging findings”, Journal of Kidney Cancer and VHL 10 (2023) 23. https://doi.org/10.15586/jkcvhl.v10i3.293.

[21] S. Nabi, E. R. Kessler, B. Bernard, T. W. Flaig & E. T. Lam, “Renal cell carcinoma: a review of biology and pathophysiology”, F1000Research 7 (2018) 29568504. https://doi.org/10.12688/f1000research.13179.1.

[22] D. Hou, W. Zhou, Q. Zhang, K. Zhang & J. Fang, “A comparative study of different variable selection methods based on numerical simulation and empirical analysis”, PeerJ Computer Science 9 (2023) e1522. https://doi.org/10.7717/peerj-cs.1522.

[23] F. Li, M. Yang ,Y. Li, M. Zhang , W. Wang , D. Yuan & D. Tang, “An improved clear cell renal cell carcinoma stage prediction model based on gene sets”, BMC Bioinformatics 21 (2020) 232. https://doi.org/10.1186/s12859-020-03543-0.

[24] I. Alnazer, O. Falou, P. Bourdon, T. Urruty, R. Guillevin, M. Khalil, A. Shahin, C. Fernandez-Maloigne, “Usefulness of computed tomography textural analysis in renal cell carcinoma nuclear grading”, Journal of Medical Imaging 9 (2022) 054501. https://doi.org/10.1117/1.JMI.9.5.054501.

[25] R. Tibshirani, “Regression shrinkage and selection via the LASSO”, Journal of the Royal Statistical Society Series B: Statistical Methodology 58 (1996) 267. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.

[26] H. Zhou & H. Zou, “The nonparametric box–cox model for high-dimensional regression analysis”, Journal of Econometrics 239 (2024) 105419. https://doi.org/10.1016/j.jeconom.2023.01.025.

[27] H. Zou & T. Hastie, “Regularization and variable selection via the elastic net”, Journal of the Royal Statistical Society Series B: Statistical Methodology 67 (2005) 301. https://doi.org/10.1111/j.1467-9868.2005.00503.x.

[28] A. Mendez-Civieta, M. C. Aguilera-Morillo & R. E. Lillo, “Adaptive sparse group LASSO in quantile regression”, Advances in Data Analysis and Classification 15 (2021) 547. https://doi.org/10.1007/s11634-020-00413-8.

[29] D. F. Saldana & Y. Feng, “SIS: an R package for sure independence screening in ultrahigh-dimensional statistical models”, Journal of Statistical Software 83 (2018) 1. https://doi.org/10.18637/jss.v083.i02.

[30] J. Fan & J. Lv, “Sure independence screening for ultrahigh dimensional feature space”, Journal of the Royal Statistical Society Series B: Statistical Methodology 70 (2008) 903. https://doi.org/10.1111/j.1467-9868.2008.00674.x.

[31] A. Domingo-Relloso, Y. Feng, Z. Rodriguez-Hernandez, K. Haack, S. A. Cole, A. Navas-Acien & J. D. Bermudez, “Omics feature selection with the extended SIS R package: identification of a body mass index epigenetic multi-marker in the Strong Heart Study”, American Journal of Epidemiology 193 (2024) 1010. https://doi.org/10.1093/aje/kwae006.

[32] W. Wang, J. Liang, R. Liu, Y. Song & M. Zhang, “A robust variable selection method for sparse online regression via the elastic net penalty”, Mathematics 10 (2022) 2985. https://doi.org/10.3390/math10162985.

[33] M. Baldomero-Naranjo, L. I. Martinez-Merino & A. M. Rodriguez-Chia, “A robust SVM-based approach with feature selection and outliers’ detection for classification problems”, Expert Systems with Applications 178 (2021) 115017. https://doi.org/10.1016/j.eswa.2021.115017.

[34] B. Lu, F. Wang, S. Wang, J. Chen, G. Wen & R. Fu, “Improvement of motor imagery electroencephalogram decoding by iterative weighted Sparse-Group Lasso”, Expert Systems with Applications 238 (2024) 122286. https://doi.org/10.1016/j.eswa.2023.122286.

[35] A. Spooner, E. Chen, A. Sowmya, P. Sachdev, N. A. Kochan, J. Trollor & H. Brodaty, “A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction”, Scientific reports 10 (2020) 20410. https://doi.org/10.1038/s41598-020-77220-w.

[36] S . Sartori, Penalized regression: Bootstrap confidence intervals and variable selection for high-dimensional data sets, Ph.D. dissertation, Department of Statistical Sciences, Università degli Studi di Milano, Milan, Italy, 2011. https://air.unimi.it/bitstream/2434/153099/6/phd unimiR07738.pdf.

[37] J. Fan, R. Samworth & Y. Wu, “Ultrahigh dimensional feature selection: beyond the linear model”, The Journal of Machine Learning Research 10 (2009) 2013. https://doi.org/0.5555/1577069.1755853.

[38] R. Derraz, F. M. Muharam, K. Nurulhuda, N. A. Jaafar & N. K. Yap, “Ensemble and single algorithm models to handle multicollinearity of UAV vegetation indices for predicting rice biomass”, Computers and Electronics in Agriculture 205 (2023) 107621. https://doi.org/10.1016/j.compag.2023.107621.

[39] A. Araveeporn, “The penalized regression and penalized logistic regression of lasso and elastic net methods for high-dimensional data: a modelling approach”, Trans. Innov. Sci. Technol 3 (2022) 28. https://doi.org/10.9734/bpi/ist/v3/1695B.

[40] T. Xiong, Y. Wang & C. Zhu, “A risk model based on 10 ferroptosis regulators and markers established by LASSO-regularized linear Cox regression has a good prognostic value for ovarian cancer patients”, Diagnostic Pathology 19 (2024) 4. https://doi.org/10.1186/s13000-023-01414-9.

[41] A. Ghosh, M. Jaenada & L. Pardo, “Robust adaptive variable selection in ultra-high dimensional linear regression models”, Journal of Statistical Computation and Simulation 94 (2024) 571. https://doi.org/10.1080/00949655.2023.2262669.

[42] R. Madadjim, T. An & J. Cui, “MicroRNAs in Pancreatic Cancer: Advances in Biomarker Discovery and Therapeutic Implications”, International Journal of Molecular Sciences 25 (2024) 3914. https://doi.org/10.

[43] A. Bhattacharjee, J. Dey & P. Kumari, “A combined iterative sure independence screening and Cox proportional hazard model for extracting and analyzing prognostic biomarkers of adenocarcinoma lung cancer”, Healthcare Analytics 2 (2022) 100108. https://doi.org/10.1016/j.health.2022.100108.

[44] N. Salma, A. H. M. Al-Rammahi & M. K. M. Ali, “A novel feature selection method for ultra high dimensional survival data”, Malaysian Journal of Fundamental and Applied Sciences 20 (2024) 1149. https://doi.org/10.11113/mjfas.v20n5.3665.

[45] Z. Arsad, Chapter 2: Multiple linear regression. Regression analysis, School of Mathematical Sciences, Universiti Sains Malaysia, Pulau Pinang, Malaysia, 2023. [Online]. https://math.usm.my/images/pdf/RegressionLogistic MacApr.pdf.

[46] M. Franz, H. Rodriguez, C. Lopes, K. Zuberi, J. Montojo, G. D. Bader & Q. Morris, “GeneMANIA update 2018”, Nucleic acids research 46 (2018) W60. https://doi.org/10.1093/nar/gky311.

[47] H. Eo, “Discovery of novel genetic alteration using meta-analysis of colorectal cancer”, International Journal of High School Research 6 (2024) 38. https://doi.org/10.36838/v6i1.7.

[48] A. H. AL-Rammahi & T. R. Dikheel, “Freund’s model with iterated sure independence screening in Cox proportional hazard model”, In AIP Conference Proceedings, Al-Samawa, Iraq, 2022, 060009. https://doi.org/10.1063/5.0093464.

[49] H. M. Nayem, S. Aziz & B. G. Kibria, “Comparison among Ordinary Least Squares, Ridge, Lasso, and Elastic Net Estimators in the Presence of Outliers: Simulation and Application”, International Journal of Statistical Sciences 24 (2024) 25. https://doi.org/10.3329/ijss.v24i20.78212.

[50] A. H. Al-Rammahi & T. R. Dikheel, “Sure independent screening elastic net for ultra-high dimensional survival data”, AIP Conference Proceedings, Al-Samawa, Iraq, 2021, 040001. https://doi.org/10.1063/5.0069137.

[51] K. Enwere, E. Nduka & U. Ogoke, “Comparative analysis of ridge, bridge and lasso regression models in the presence of multicollinearity”, IPS Intelligentsia Multidisciplinary Journal 3 (2023) 1. https://doi.org/10.54117/iimj.v3i1.5.

[52] R. Muthukrishnan & C. K. James, “The effect of multicollinearity on feature selection”, Indian Journal of Science and Technology 17 (2024) 3664. https://doi.org/10.17485/IJST/v17i35.1876.

[53] J. Pannu & N. Billor, “Robust group-Lasso for functional regression model”, Communications in statistics-simulation and computation 46 (2017) 3356. https://doi.org/10.1080/03610918.2015.1096375.

[54] C. Shang, H. Ji, X. Huang, F. Yang & D. Huang, “Generalized grouped contributions for hierarchical fault diagnosis with group Lasso”, Control Engineering Practice 93 (2019) 104193. https://doi.org/10.1016/j.conengprac.2019.104193.

[55] F. Khan & O. Albalawi, “Analysis of fat big data using factor models and penalization techniques: a Monte Carlo simulation and application”, Axioms 13 (2024) 418. https://doi.org/10.3390/axioms13070418.

[56] J. Fan & J. Lv, “Sure independence screening”, Wiley StatsRef: Statistics Reference Online, John Wiley & Sons, Ltd., Hoboken, NJ, USA, 2018, pp. 1–8. https://doi.org/10.48550/arXiv.math/0612857.

[57] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000149294-NCAM1/2025/ (accessed 08 March 2025).

[58] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000069849-ATP1B3/2025/ (accessed 08 March 2025).

[59] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000144035-NAT8/2025/ (accessed 08 March 2025).

[60] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000125148-MT2A/2025/ (accessed 08 March 2025).

[61] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000188342-GTF2F2/2025/ (accessed 08 March 2025).

[62] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000070019-GUCY2C/2025/ (accessed 08 March 2025.

[63] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000138079-SLC3A1/2025/ ( accessed 08 March 2025) .

[64] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000116791-CRYZ/2025/ ( accessed 08 March 2025).

[65] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000130779-CLIP1/2025/ (accessed o8 March 2025).

[66] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000121486-TRMT1L/2025/ (accessed 08 March 2025).

[67] The Human Protein Atlas. [Online]. https://www.proteinatlas.org/ENSG00000175084-DES/ (accessed 08 March 2025).

[68] M. A. Climent, J. Muñoz-Langa, L. Basterretxea-Badiola & C. Santander-Lobera, “Systematic review and survival meta-analysis of real-world evidence on first-line pazopanib for metastatic renal cell carcinoma”, Critical Reviews in Oncology/Hematology 121 (2018) 45. https://doi.org/10.1016/j.critrevonc.2017.11.009.

[69] D. Chicco, M. J. Warrens & G. Jurman, “The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation”, Peerj computer science 7 (2021) e623. https://doi.org/10.7717/peerj-cs.623.

[70] A. P. Brady, “Error and discrepancy in radiology: inevitable or avoidable?”, Insights Imaging 8 (2017) 171. https://doi.org/10.1007/s13244-016-0534-1.