Robust hybrid algorithms for regularization and variable selection in QSAR studies

Christian N. Nwaeme; Adewale F. Lukman

doi:10.46481/jnsps.2023.1708

Authors

Christian N. Nwaeme
African Institute for Mathematical Sciences, Mbour-Thies and BP. 1418, Senegal
Adewale F. Lukman
[email protected]

University of Medical Sciences, Ondo State, PMB 536, Nigeria. | University of North Dakota, Grand Forks, ND, USA

Keywords:

High dimension, QSAR, Multicollinearity, Outliers, Sparse Least trimmed squares, Random forest

Abstract

This study introduces a robust hybrid sparse learning approach for regularization and variable selection. This approach comprises two distinct steps. In the initial step, we segment the original dataset into separate training and test sets and standardize the training data using its mean and standard deviation. We then employ either the LASSO or sparse LTS algorithm to analyze the training set, facilitating the selection of variables with non-zero coefficients as essential features for the new dataset. Secondly, the new dataset is divided into training and test sets. The training set is further divided into k folds and evaluated using a combination of Random Forest, Ridge, Lasso, and Support Vector Regression machine learning algorithms. We introduce novel hybrid methods and juxtapose their performance against existing techniques. To validate the efficacy of our proposed methods, we conduct a comprehensive simulation study and apply them to a real-life QSAR analysis. The findings unequivocally demonstrate the superior performance of our proposed estimator, with particular distinction accorded to SLTS+LASSO. In summary, the twostep robust hybrid sparse learning approach offers an effective regularization and variable selection applicable to a wide spectrum of real-world problems.

Dimensions

REFERENCES

M. Andrea, C. Viviana & R. Todeschini, Molecular descriptors: handbook of computational chemistry, Springer International Publishing, Milan, Italy 2017. pp. 2065–2093. https://doi.org/10.1007/978-3-319-27282-5 51.

R. Todeschini, V. Consonni, A. Mauri, & M. Pavan, Dragon Software: An easy approach to molecular descriptor calculations (2006) Corpus ID: 13716008. https://api.semanticscholar.org/CorpusID:13716008.

R. Todeschini & C. Viviana, “Molecular descriptors for chemoinformatics”, ChemMedChem Journal 5 (2010) 306 . https://doi.org/10.1002/cmdc.200900399.

S. Zhong, J. Hu, X. Yu, & H. Zhang, “Molecular image-convolutional neural network (CNN) assisted QSAR models for predicting contaminant reactivity toward OH radicals: transfer learning, data augmentation and model interpretation”, Chemical Engineering Journal 408 (2021) 127998. https://doi.org/10.1016/j.cej.2020.127998.

G. Mohammad, D. Bieke, & V. Heyden-Yvan, “Feature Selection Methods in QSAR Studies”, Journal of AOAC International 95 (2019) 636. https://doi.org/10.5740/jaoacint.sge goodarzi.

A. M. Al-Fakih, Z. Y. Algamal & M. H. Lee, “High-dimensional quantitative structure–activity relationship modeling of influenza neuraminidase A/PR/8/34 (H1N1) inhibitors based on a two-stage adaptive penalized rank regression”, Journal of Chemometrics 30 (2012) 50.

https://doi.org/ 10.1002/cem.2766.

Z. Y. Algamal, M. H. Lee, A. M. Al-Fakih & M. Aziz, “High-dimensional QSAR prediction of anticancer potency of imidazo[4,5-b]pyridine derivatives using adjusted adaptive LASSO”, Journal of Chemometrics 29 (2015) 547.

https://doi.org/10.1002/cem.2741.

J. M. Luco, & F. H. Ferretti, “QSAR based on multiple linear regression and PLS methods for the anti-HIV activity of a large group of HEPT derivatives”, Journal of chemical information and computer sciences 37 (1997) 392.

https://doi.org/10.1021/ci960487o.

F. R. Burden & and D. A. Winkler, “Robust QSAR models using Bayesian regularized neural networks”, Journal of Medicinal Chemistry 42 (1999) 3183.

https://doi.org/10.1021/jm980697n.

N. AlNuaimi, M. M. Masud, M. A. Serhani & N. Zaki, “Streaming feature selection algorithms for big data: A survey” 18 (2022) 113.

https://doi.org/10.1016/j.aci.2019.01.001.

A. Hoerl & R. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems”, Technometrics 12 (1970) 55.

https://doi.org/10.2307/1267351.

L. Breiman, “Heuristics of instability and stabilization in model selection.”, The Annals of Statistics 30 (1996) 2350.

https://doi.org/10.1214/aos/1032181158.

R. Tibshirani, “Regression Shrinkage and Selection via the Lasso”, Journal of the Royal Statistical Society. Series B (Methodological) 58 (1996) 267.

https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.

I. Frank & J. Friedman, “A statistical view of some chemometrics regression tools”, Technometrics 35 (1993) 109.

https://doi.org/10.2307/1269656.

W. Fu, “Penalized regression: The bridge versus the lasso”, Journal of Computational and Graphical Statistics 7 (1998) 397.

https://doi.org/10. 1080/10618600.1998.10474784.

H. Zou & T. Hastie, “Regularization and variable selection via the elastic net”, Journal of the Royal Statistical Society 67 (2005) 301.

https://doi. org/10.1111/j.1467-9868.2005.00503.x.

M. Eklund, U. Norinder, S. Boyer, & L. Carlsson, “Benchmarking variable selection in QSAR”, Molecular Informatics 31 (2012) 173.

https://doi.org/10.1002/minf.201100142.

Z. Li & M.J. Sillanpaa, “Overview of LASSO-related penalized regression methods for quantitative trait mapping and genomic selection”, Theoretical and Applied Genetics 125 (2012) 419.

https://doi.org/10.1007/s00122-012-1892-9.

F. Ghasemi, “Deep neural network in QSAR studies using deep belief network”, Applied Soft Computing Journal 62 (2018) 251.

https://doi.org/10.1016/j.asoc.2017.09.040.

S. Wacker & S. Y. Noskov, “Performance of machine learning algorithms for qualitative and quantitative prediction drug blockade of hERG1 channel”, Computational Toxicology 6 (2018) 55.

https://doi.org/10.1016/j. comtox.2017.05.001.

Z. Mozafari, M. A. Chamjangali, M. Arashi, & N. Goudarzi, “Application of the LAD-LASSO as a dimensional reduction technique in the ANNbased QSAR study: Discovery of potent inhibitors using molecular docking simulation.”, Chemometrics and Intelligent Laboratory Systems 222 (2022) 104510. https://doi.org/10.1016/j.chemolab.2022.104510.

A. R. Maronna, R. D. Martin & V. J. Yohai, Robust Statistics: Theory and Methods, Wiley, New York, USA 2006, pp. 51-85. https://doi.org/10.1002/0470010940

H. Wang, G. Li, & G. Jiang, “Robust regression shrinkage and consistent variable selection through the LAD-lasso”, Journal of Business Economic Statistics 25 (2007) 347. https://doi.org/10.1198/073500106000000251.

A. Alfons, C. Croux & S. Gelper, “Sparse Least Trimmed Squares Regression For Analyzing High-Dimensional Large Data Sets”, The Annals of Applied Statistics 7 (2013) 226. https://doi.org/10.1214/12-AOAS575.

F. Motamedi, H. Perez-S´ anchez, A. Mehridehnavi, A. Fassihi & F.´ Ghasemi, “Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies”, Bioinformatics 38 (2022) 469. https://doi.org/10.1093/bioinformatics/btab659.

V. I. Jurtz, A. J. Rosenberg, M. Nielsen, J. J. Almagro-Armenteros, H. Nielsen, C. K. Sonderby, O. Winther & S. K. Sonderby, “An Introduction to Deep Learning on Biological Sequence Data: Examples and Solutions”, Bioinformatics 33 (2017) 3685. https://doi.org/10.1093/ bioinformatics/btx531.

Y. Liu & S. J. Qin, “A stable Lasso algorithm for inferential sensor structure learning and parameter estimation”, Journal of Process Control 107 (2021) 70. https://doi.org/10.1016/j.jprocont.2021.10.005.

Z. Mozafari, M. A. Chamjangali, & M. Arashi, “Combination of least absolute shrinkage and selection operator with Bayesian Regularization artificial neural network (LASSO-BR-ANN) for QSAR studies using functional group and molecular docking mixed descriptors”, Chemometrics and Intelligent Laboratory Systems 200 (2020) 103998. https://doi.org/10.1016/j.chemolab.2020.103998.

X. Yan, X. Su & World Scientific, Linear Regression Analysis: Theory and Computing, World Scientific Publication, Florida, USA, 2009. pp348. https://doi.org/10.1142/6986.

J. Friedman, T. Hastie, & R. Tibshirani, “The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition”, Springer Series in Statistics New York, California, USA, 2001, pp. 5057. https://doi.org/10.1007/978-0-387-84858-7.

B. Efron, T. Hastie, I. Johnstone & R. Tibshirani, “Discussion of Least angle regression”, The Annals of Statistics 32 (2004) 407. https://doi.org/ 10.1214/009053604000000067.

G. Li, H. Peng & L. Zhu “Nonconcave penalized M-estimation with a diverging number of parameters”, Statistica Sinica 21 (2011) 391. https://www3.stat.sinica.edu.tw/sstest/oldpdf/A21n117.pdf.

P. Rousseeuw & K. Driessen Van, “Computing LTS regression for large data sets”, Data Mining and Knowledge Discovery 12 (2006) 3204. https://doi.org/10.1007/s10618-005-0024-4.

T. K. Ho, “The random subspace method for constructing decision forests”, EEE Transactions on Pattern Analysis and Machine Intelligence 30 (1998) 832. https://doi.org/10.1109/34.709601.

Y. Amit & D. Geman, “Shape quantization and recognition with randomized trees”, Neural Computation 9 (1997) 1545. https://doi.org/10.1162/ neco.1997.9.7.1545.

T.G. Dietterich, “An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization”, Machine Learning 40 (2000) 139. https://doi.org/10.1023/A: 1007607513941.

Y. Freund & R. Shapire Experiments with a new boosting algorithm. In L. Saitta, editor, Machine Learning, Proceedings of the 13th International Conference, San Francisco, USA, 1996. pp 148-156.

http://dl.acm.org/citation.cfm?id=3091696.3091715.

R. Genuer, J. M. Poggi & C. Tuleau, “Random Forests: Random Forests: some methodological insights”, Biomedical Signal Processing (2008) arXiv. https://doi.org/10.48550/arXiv.0811.3619.

B. Debasish, P. Srimanta, & C. P. Dipak, “Support Vector Regression, Neural Information Processing”, Statistics and Computing, 10 (2007) 203. https://tinyurl.com/mr46aykt.

V. Vapnik, S. Golowich & A. Smola, “Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing”, NIPS’96: Proceedings of the 9th International Conference on Neural Information Processing Systems, Cambridge, USA, 1997, pp. 281–287. https://dl.acm.org/doi/abs/10.5555/2998981.2999021.

A. J. Smola & B. Scholkopf, “A tutorial on support vector regression”, Statistics and Computing 14 (2004) 199.

https://doi.org/10.1023/ B:STCO.0000035301.49549.88.

M. Junshui, T. James, & P. Simon, “Accurate On-line Support Vector Regression”, Neural Computation 15, Massachusetts Institute of Technology 15 (2003) 2683. https://doi.org/10.1162/089976603322385117.

C. C. Chang & C. J. Lin, “Training v-support vector regression: Theory and algorithms.”, Neural Computation, 14 (2002) 1959. https://doi.org/ 10.1162/089976602760128081.

G. Cauwenberghs & T. Poggio, “Incremental and decremental support vector machine learning. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in neural information processing systems”, Proceedings 2001 IEEE International Conference on Data Mining. San Jose, CA, USA, 2001, pp 409—123. https://doi.org/10.1109/ICDM.2001.989589.

K. Liu, “A new class of biased estimate in linear regression”, Communications in Statistics Theory and Methods Journal 22 (1993) 393. https://doi.org/10.1080/03610929308831027.

K. Liu, “Using Liu-Type estimator to combat collinearity”, Communications in Statistics Theory and Methods Journal 32 (2003) 1009. https: //doi.org/10.1081/STA-120019959.

M. Arashi, A. F. Lukman & Z. Y. Algamal, “Liu regression after random forest for prediction and modeling in high dimension”, Journal of Chemometrics 36 (2022) e3393. https://doi.org/10.1002/cem.3393.

N. A. Alao, K. Ayinde & G. S. Solomon, “A Comparative Study on Sensitivity of Multivariate Tests of Normality to Outliers”, ASM Science Journal 12 (2019) 65. https://rb.gy/08c9t.

A. F. Lukman, M. Arashi & V. Prokaj, “Robust biased estimators for Poisson regression model: simulation and applications”, Concurrency and Computation: Practice and Experience 35 (2023) e7594. https://doi.org/ 10.1002/cpe.7594.

A. F. Lukman, E. Adewuyi, K. Månsson & G. B. M. Kibria, “A new estimator for the multicollinear poisson regression model: simulation and application.”, Scientific Reports 11 (2021) 3732. https://doi.org/10.1038/ s41598-021-82582-w.

A. F. Lukman, A. C. Onate, K. Ayinde & S. Binuomote, “Modified ridgetype estimator to combat multicollinearity: Application to chemical data”, Journal of Chemometrics 33 (2019) e3125. https://doi.org/10.1002/cem.3125.

A. F. Lukman, K. Ayinde, S. L. Jegede, S. L. & G. B. M. Kibria, “Modified One-Parameter Liu Estimator for the Linear Regression Model”, Modelling and Simulation in Engineering 2020 (2020) 427. https://doi. org/10.1155/2020/9574304.

A. F. Lukman, K. Ayinde, B. Rasak & B. B. Aladeitan, “ An unbiased estimator with prior information”, Arab Journal of Basic and Applied Sciences 27 (2020) 45. https://doi.org/10.1080/25765299.2019.1706799.

N. M. Ghanem, F. Farouk, R. F. George, S. E. S. Abbas, & O. M. ElBadry, “Design and synthesis of novel imidazo[4,5-b]pyridine based compounds as potent anticancer agents with CDK9 inhibitory activity”, Bioorganic Chemistry 80 (2018) 565. https://doi.org/10.1016/j.bioorg.2018.07.006.

Z. Y. Algamal, M. H. Lee, A. M. Al-Fakih, & M. Aziz, “Highdimensional QSAR modelling using penalized linear regression model with L1/2-norm”, SAR and QSAR in Environmental Research 27 (2016) 703. https://doi.org/10.1080/1062936X.2016.1228696.

L. Majdouline, C. Samir, A. Azeddine, H. Rachid, M. Bouachrine, & L. Tahar, “Anticancer activity of novel molecules based on Imidazo [4, 5-B] 13 Pyridine. 3D-QSAR Study”, International Journal of Advanced Research in Computer Science and Software Engineering 4 (2014) 34. https://doi. org/10.1080/1062936X.2016.1228696.