Robust hybrid algorithms for regularization and variable selection in QSAR studies

This study introduces a robust hybrid sparse learning approach for regularization and variable selection. This approach comprises two distinct steps. In the initial step, we segment the original dataset into separate training and test sets and standardize the training data using its mean and standard deviation. We then employ either the LASSO or sparse LTS algorithm to analyze the training set, facilitating the selection of variables with non-zero coe ffi cients as essential features for the new dataset. Secondly, the new dataset is divided into training and test sets. The training set is further divided into k folds and evaluated using a combination of Random Forest, Ridge, Lasso, and Support Vector Regression machine learning algorithms. We introduce novel hybrid methods and juxtapose their performance against existing techniques. To validate the e ffi cacy of our proposed methods, we conduct a comprehensive simulation study and apply them to a real-life QSAR analysis. The findings unequivocally demonstrate the superior performance of our proposed estimator, with particular distinction accorded to SLTS + LASSO. In summary, the two-step robust hybrid sparse learning approach o ff ers an e ff ective regularization and variable selection applicable to a wide spectrum of real-world problems.


Introduction
The Quantitative Structure-Activity Relationship (QSAR) dates back to the nineteenth century and has since been employed in different fields for risk assessment, drug discovery, toxicity prediction, and regulatory decisions.QSAR models adopt supervised machine learning models, such as regression and classification and seek to predict a response variable, such as the biological activity of a chemical, with a set of predictors, such as the physicochemical properties of synthetic chemical drugs or theoretical molecular descriptors of chemicals [1][2][3][4].Furthermore, mathematical and statistical QSAR models have proven to be the best computational methods in drug discovery, saving time and resources.As a result, QSAR research is becoming more prominent in finding new drugs [5].QSAR models have also been used to deduce the activity of a chemical compound from its structural features.1 Numerous studies exist on QSAR modelling.For instance, it has become an essential process in the pharmaceutical industry with certain limitations.QSAR data may include hundreds of thousands of chemicals (descriptors), leading to highdimensional data.This high-dimensional data has become more common in computational chemistry studies where more molecules exist than molecular descriptors [6].As a result, the significance of quantitative structure-activity relationship (QSAR) studies has increased in this field, concerning the structural characteristics of a group of chemical substances, with the goal of QSAR being to simulate various biological processes [2].
The commonly used fingerprints in QSAR modelling often result in correlated features and sparsity, with some values being zero.These issues make it challenging for QSARbased models to achieve accurate predictions.The least squares method is not appropriate for QSAR models since X T X matrix becomes non-invertible in high-dimensional data [7].Therefore, more stable predictions in QSAR modelling are often achieved using machine learning models such as Bayesian neural networks and others [8,9].
In high-dimensional modelling, an efficient dimension reduction method is essential to provide parsimonious models with strong prediction ability and interpretation.The availability of high-dimensional statistics in computational chemistry is increasing, but the selection of molecular descriptors remains a critical challenge in QSAR investigations.The significant variation of QSAR models generally leads to poor prediction performance.Therefore, it is necessary to improve prediction accuracy by selecting only the most critical molecular predictors.Other factors, such as the optimization of the chemical shape, the modelling technique, the risk of getting stuck in local minima, redundancy, and over-fitting, also greatly influence a QSAR model's ability to make suitable predictions.
Over the past decade, there has been an increased focus on big data as researchers seek to address critical issues with QSAR models such as redundancy, over-fitting, and being stuck in local minima [10].Since 2015, deep learning architectures have gained preference over shallow learning models.These architectures have become popular as computational drug design tools because they can detect complex statistical patterns among the vast number of descriptors extracted from various compounds.Deep learning architectures used in QSAR applications include Artificial Neural Networks (ANN), Convolutional Neural Networks, Recurrent Neural Networks, and Support Vector Machines (SVM), which utilize multiple levels of linear and nonlinear techniques.
To increase prediction accuracy and address computational issues with high-dimensional data, the objective function of the regression can be modified by adding a penalty term to the regression coefficients.However, this strategy results in a tradeoff between reduced variance and increased bias.Therefore, traditional statistical topics such as regularization and variable selection have received significant attention.Ridge regression [11] is an example of a regularization technique that reduces the residual sum of squares while maintaining a predetermined range for the L 2 norm of the coefficients.Ridge regression bal-ances bias and variance to achieve optimal prediction performance but always includes all predictors in the model, failing to yield a parsimonious model.In contrast, [12] highlights that although best subset selection creates a sparse model, it is valuable due to its inherent discreteness.
Tibshirani proposed a promising approach called the lasso [13].The lasso is a penalized least squares method that penalizes the regression coefficients by applying an L 1 penalty.The lasso performs continuous shrinkage and automatic variable selection simultaneously due to the properties of the L 1 penalty.When the lasso, ridge, and bridge regressions [14] were compared for prediction performance, Tibshirani and Fu found that none of them consistently outperformed the others [13,15].However, given the growing importance of variable selection in contemporary data analysis, the lasso is considerably more appealing because of its ability to produce a sparse representation.Despite its limitations, the lasso has been effective in many situations.Its limitations include: (a) The lasso may select only one variable out of a set of highly correlated variables, making the selected variable somewhat arbitrary.
(b) When the number of predictors is much larger than the sample size, the lasso may select too many variables, which can lead to overfitting.
Zou and Hastie developed the elastic-net approach by combining L 2 and L 1 penalties on the regression coefficients [16].Elastic net aims to group together strongly correlated variables, resulting in their inclusion or exclusion from the model.It performs best when there are high absolute values of pairwise correlations among the groups.In the case of correlated data, elastic net often outperforms lasso in terms of prediction error.However, since it does not reveal the underlying group structure in its solution, the elastic net may not perform well when the groups change and have only modest pairwise correlations.In QSAR studies, LASSO and Elastic-net have yielded fascinating results in terms of variable selection, estimation, and prediction [17][18][19][20][21].
Penalized regression techniques, such as Lasso, elastic nets, and others, are known to be sensitive to outliers or unusual observations, which are common problems in QSAR modelling [21].It is essential to understand that these methods can become entirely untrustworthy with just one anomaly, which can negatively impact the prediction outcome.To address outliers in low-dimensional data, robust alternatives such as the Least Absolute Deviation (LAD) and Least Trimmed Squares (LTS) estimators are recommended [22,23].These estimators are effective in handling outliers in the y-direction but do not perform variable selection.
To address both outlier detection and variable selection, Wang et al. [23] developed the LAD-LASSO, which adds an L 1 norm to the LAD regression for robust prediction and variable selection.More recently, the Sparse Least Trimmed Squares regression was proposed by adding an L 1 penalty to the LTS regression, which combines outlier detection and variable selection in a robust way [24].
Deep learning models are powerful algorithms that have shown great promise in various research fields, including the pharmaceutical industry, for addressing regression and classification problems.However, deep learning algorithms also have some drawbacks, such as high computational time, over-fitting, and a requirement for a large amount of data and memory space [25].
This study will not focus on deep learning models, such as artificial neural networks, due to the nature of the adopted data and the need for faster computation.Instead, a variety of techniques have been developed to mitigate the core limitations of deep learning models, such as long-running times and high processing demands.Random Forest (RF), Bagging, and Support Vector Regression are some of the extensively utilized variable selection algorithms in computational drug design, as they offer criteria for obtaining the most crucial descriptors.Additionally, algorithms such as multivariate adaptive regression splines, Relief, and Boruta have also been used [26].
In recent studies, hybrid algorithms have been adopted to enhance prediction.For instance, Motamedi et al. [25] proposed LASSO-RF, which selects molecular descriptors using LASSO and predicts using random forest.Liu and Qin [27] developed a two-step approach by applying Lasso to the trained data and then performing regularization on the selected features using Elastic-net and Ridge regression.They concluded that the two-step algorithm produced more optimal models than LASSO alone.More recently, in a QSAR study, molecular descriptors were selected using LAD-LASSO and biological activity was predicted using artificial neural networks (ANN) [28].
This study aims to develop a new hybrid approach for selection and prediction.We selected molecular descriptors from the QSAR data using LASSO and Sparse LTS and predicted biological activity using random forest, support vector, and Ridge regressions.Additionally, we conducted a simulation study with high-dimensional data and contaminated the data with outliers in the response variable.Finally, we compared the performance of the algorithms using the root mean squared error, mean absolute deviation, and median absolute error.Section 2 provides an exhaustive exploration of established methodologies, while Section 3 introduces our novel approach.Section 4 focuses on the simulation studies and real-life analyses, and Section 5 gives the concluding remarks.

Literature Review: Concepts and Mathematical Model
In this section, we will briefly review two important concepts in regression analysis: multicollinearity and outliers.We will then delve into a detailed discussion of several popular estimators that were introduced in the previous section.These estimators include ridge regression, lasso regression, Random Forest, support vector regression, and sparse LTS.By the end of this section, you will have a comprehensive understanding of these estimators and their applications in regression analysis.

Regularization
To combat over-fitting, regularization is a technique that reduces generalization error while minimally affecting the training error.Overfitting often occurs when overly complex models are used to fit the training data, while underfitting happens when the model is too simple.Therefore, it is crucial to select an appropriate level of complexity for the model.However, this task is challenging as it cannot be determined solely from the provided training data.Thus, selecting the right model complexity for training requires careful consideration.

Different Types of regularisation techniques
There are different types of regularization techniques that affect the model very differently.Here are some of those; Ridge regression addresses some of the limitations of linear regression.While linear regression can produce estimates with large magnitudes and high variance, ridge regression adds a constraint to the ordinary least squares (OLS) method to shrink the regression coefficients towards zero.This regularization reduces the variance of the estimates and the prediction error, without overly compromising bias.Specifically, ridge regression minimizes a penalized residual sum of squares (RSS), similar to OLS, but with a penalty term that depends on a tuning parameter, which controls the amount of shrinkage.As a result, the ridge regression estimates are biased but have less variance than the OLS estimates [11].

Mathematical Formulation
The coefficients for ridge regression are obtained by minimizing the residual sum of squares (RSS), subject to an additional constraint, When α is the tuning parameter, which we explain in the following section, and N i=1 δ 2 i is the square of the vector δ norm, term α N i=1 δ 2 i is referred to as a "shrinkage penalty."The L 2 norm is what is referred to as ||δ|| 2 = N i=1 δ 2 i .In other words, the ridge coefficients δ ridge minimize a penalized RSS, which we refer to as an L 2 penalty as the penalty is determined by the L 2 norm [29].
To determine the δ ridge parameters, we consider two critical assumptions of ridge regression.First, the intercept is not subjected to a penalty.Second, normalizing the predictors is essential.Unlike ordinary least squares (OLS) estimates, where scaling the coefficients inversely affects them proportionally, ridge coefficients can be significantly altered by multiplying them by a constant.Therefore, we standardize each predictor value by subtracting the mean and dividing the result by the standard deviation of the corresponding value in the training set, as recommended by Friedman et al. [30].
Next, we will discuss ridge regression using matrix algebra representation.The input consists of a centered n by p matrix (X) and a centered n-dimensional vector (Y).Both X and Y have zero means, and X has a unity variance.We standardize the inputs before transforming the minimization problem in Equation (2.1.1)into an L 2 penalized problem using matrices.
The ridge coefficients transform into δridge = (X T X + αI) −1 X T y, such that I is an identity matrix.

LASSO (L 1 Regularization)
The Lasso regularization method, short for "Least Absolute Shrinkage and Selection Operator," is a technique that extends ridge regression by introducing two key features.In contrast to ridge regression, Lasso not only shrinks coefficients but can also reduce some of them to exactly zero, which is known as "sparsity."Another distinctive feature of Lasso is that it can identify and prioritize important variables by reducing some specific coefficients, a property known as "variable selection."Together, these properties allow Lasso to perform both regularization and variable selection simultaneously [13].

Mathematical Formulation
This section focuses on the lasso algorithm.The Lagrangian formulation of the lasso is as follows: where the shrinkage parameter is denoted by α.The shrinkage penalty N i=1 |δ j |, is actually provided by the vector's L 1 norm, δ defined as ||δ|| 1 = |δ j |.The predictors are normalized and the intercept, which is calculated as δ 0 = ȳ, is not included in the model.Therefore, the main difference between lasso and ridge regression is that lasso uses an L 1 penalty whereas ridge regression uses an L 2 penalty.The difference between an L 1 penalty and an L 2 penalty is that the L 1 penalty has the effect of shrinking some coefficients exactly to zero [29].The least angle regression (LAR) approach, for instance, is one of several strategies that can be used to solve this quadratic programming problem [13].
We investigate the matrix algebra formulation to elucidate the properties of the lasso estimations.In this instance, the solutions to an L 1 penalized issue are the lasso coefficients, δlasso = argmin Contrary to ridge regression, the coefficients δlasso lack a closed form since the L 1 penalty imposes an absolute value constraint that cannot be distinguished.Due to the non-smooth nature of the constraint, the solutions to the lasso issue are nonlinear in y j [29].

Sparse Least Trimmed Squares (SLTS) models
Sparse Least Trimmed Squares (LTS) models are a modification of the LTS regression method, which is a robust regression technique that is effective in the presence of outliers.The goal of Sparse LTS is to identify a subset of the data that can produce the lowest sum of squared residuals, while at the same time, enforcing sparsity in the model.By introducing sparsity constraints, Sparse LTS can help identify the most relevant variables that contribute to the regression, leading to a more interpretable and efficient model.This approach is particularly useful when dealing with high-dimensional data, where many of the variables may not be relevant to the regression task at hand.Sparse LTS is widely used in various fields, including finance, biology, and engineering.

Mathematical Formulation
Let x i = (x 1 , x 2 , . . ., x n ) be the d-dimensional observations on the predictor variables where i ∈ [1, n], and y i = (y 1 , y 2 , . . ., y n ) be the observations on the response, respectively.The linear regression model is examined using a regression parameter δ = (δ 1 , . . ., δ p ) ′ with the error terms ε i ∼ N(0, γ).Applied statistics often face the challenge of outliers in the data, which can significantly impact the performance of penalized estimators like the lasso, ridge, and elastic net, which utilize the least squares loss function.To address this issue, several dependable alternatives have been proposed in the literature.One popular approach is to use penalized M-estimators, such as those proposed by Rosset and Zhu [31], Wang et al. [23], and Li et al. [32], which are designed to be resilient against outliers in the response variable but not necessarily in the predictor space.
However, to achieve robustness against outliers in the predictor space, one can regularize appropriate robust regression techniques, such as the least trimmed squares, as proposed by Rousseeuw and Van [33].These techniques can effectively address the issue of outliers in both the response and predictor variables.Therefore, it is essential to carefully select an appropriate approach based on the nature of the data and the research question at hand to ensure accurate and reliable statistical inference.
The Least Trimmed Square (LTS) regression is a widely recognized and extensively studied method for dealing with outliers in regression analysis [22].It is a commonly used robust regression model due to its simple specification and efficient computation.The squared vector of residuals can be denoted as The LTS model is defined as a regression model that minimizes the sum of squared residuals, subject to a constraint on the proportion of data points to be included in the regression.Specifically, LTS regression minimizes the sum of squared residuals for a subset of the data that contains a specified proportion of the observations that have the smallest squared residuals.This subset is determined by minimizing the value of δ that satisfies the constraint on the proportion of data points.The LTS model provides a balance between robustness and efficiency, making it a useful tool in many applications where outliers are of concern.
The squared residuals are typically ordered as (r 2 (δ)) 1:n ≤ • • • ≤ (r 2 (δ)) n:n , with h ≤ n.The goal of LTS regression is to identify a subset of h observations with the least squares that results in the lowest sum of squared residuals.The choice of h determines the range of desirable observations within the statistics and can be used to model the subset length.Although LTS regression is a robust method, it may not produce estimates from sparse models.When h ≤ p, it is impossible to compute the LTS model.To address this issue, an L 1 penalty can be applied to the objective function 7, with a penalty parameter α, resulting in a sparse and regularized LTS model, often referred to as Sparse LTS.
The high breakdown point of sparse LTS has been demonstrated by Alfons et al. [24].It is resistant to leverage points and vertical outliers.In addition to being very durable and functioning like LTS, sparse LTS • increases prediction accuracy with the aid of decreasing variance whilst the pattern size is small in the assessment of the size.
• advanced interpretability is assured by the simultaneous model choice, and • overcomes the computational problems of traditional robust regression strategies when managing highdimensional data.

THEOREMS
1.The SLTS Model's Breakdown point: The replacement finite-sample breakdown point is the most commonly used measure of an estimator's robustness [22].Where N = (X, y) is the represented sample.The breakdown point for regression δ is defined as Ñ is the corrupted data obtained from N by substituting arbitrary values form of the original n data points.
2. A convex and symmetric loss function, φ(x), with φ(x) > 0 and φ(0) = 0 for x = 0, is defined as φ(x):= (φ(x 1 ), . . ., φ(x n )).Consider the regression model with subset size h > n; given that (φ(y − Xδ)) 1:n ≤, . . ., ≤ (φ(y − Xδ)) n:n are the information order of the regression loss.The estimator δ breakdown factor is then given by For any loss function φ that meets the assumptions, the breakdown point remains the same.In the SLTS estimator δS LT S with subset length h ≤ n, in which φ(x) = x 2 , the breakdown point remains n−h+1 n .The breakdown point increases as h decreases.It is possible to have a breakdown point greater than 50% by taking h small enough [24].

Random Forest
Breiman's ideas in machine learning were significantly influenced by a number of pioneering methods including the early random subspace method of Ho [34], the geometric variable selection work of Amit and Geman [35], and the random split selection approach of Dietterich [36].These techniques have since paved the way for more advanced methods such as boosting [37] and support vector machines, but none have been able to match the performance and versatility of random forests (RF).Random forests have proven to be highly effective in handling a large number of input variables without over-fitting, while also being simple and quick to implement, and producing highly accurate predictions.They are widely regarded as one of the most precise and reliable all-purpose learning methods available.For readers seeking a deeper understanding of random forests and related methods, the survey conducted by Genuer et al. [38] can provide valuable insights and a solid foundation for comprehension.
Random Forest (RF) is a powerful machine-learning technique that combines the results of multiple decision trees to produce robust and accurate predictions.In an RF model, each decision tree is built using a bootstrap sample of the training data and only a random subset of the available input features.Predictions are made by aggregating the individual tree predictions through either majority voting or averaging, depending on the task at hand.
In regression, the final predicted value is the average of the predicted values of each tree.The RF algorithm grows each tree using the entire training set as a bootstrap sample and uses an out-of-bag (OOB) set to estimate the model's generalization performance.The CART algorithm is used to choose the best split at each node among a random subset of the available input features.RF models do not perform pruning and have no tuning parameters.
The predictive ability of an RF model is evaluated using the Φ 2 abs determination coefficient on an external validation set.
RF models also offer useful features such as out-of-bag predictions for error estimation, natural closeness estimation of two substances, and variable importance metrics based on the difference in OOB error rate when a descriptor is permuted.
In conclusion, RF is a robust and effective technique for QSAR modelling, especially when the number of available input features is high.The ensemble nature of the method, along with its ability to handle noisy and correlated data, makes it a popular choice for many applications.

Support Vector Regression
Vladimir Vapnik is recognized as the pioneer of Support Vector Machines (SVMs), which are a type of supervised learning machine that can generalize well on a variety of learning patterns by using the structural risk minimization inductive principle.The structural risk minimization (SRM) approach aims to minimize both the empirical risk and the VC (Vapnik-Chervonenkis) dimension simultaneously.The theory was developed by Vapnik and his colleagues based on a separable bipartition problem.SVMs can recognize minor patterns in large volumes of data, making them an effective method for image reduction [39].
SVMs are divided into two categories: support vector classification (SVC) and support vector regression (SVR).SVMs are feature space-based learning techniques that operate in high dimensions, generating prediction functions based on a subset of support vectors.The SVM model for classification depends only on a subset of the training data since the cost function for constructing the model disregards any training points that are outside of the margin.Similarly, the SVR model depends only on a subset of the training data because the cost function for building the model rejects any training data that is close to or within a threshold, ε, of the model prediction.SVR uses kernels, sparse solutions, and VC control over the margin and number of support vectors, which is similar to classification.Support Vector Regression (SVR) is the most common use of SVMs.The basic concepts of support vector machines for regression and function estimation were outlined by Vapnik et al., [40] and Smola et al. [41].Furthermore, SVMs offer several training techniques for handling large datasets and quadratic or convex programming.The classic SV algorithm has been modified and extended with regularization and capacity control from an SV perspective.SVR is a supervised learning technique that uses a symmetrical loss function to penalize both high and low misestimates equally.To decrease the absolute values of errors, Vapnik's ε-insensitive method forms a flexible tube with a short radius symmetrically around the estimated function.

Mathematical Formulation
The input pattern space R N is used to represent the training data, which have been taken as {(x i , y i ), 1 . . .n} ⊂ R N × R. The goal of ε-SV regression is to find a function f (x) that is as flat as possible and that deviates from the objectives y i for all of the training data by at most ε [39].The description of the linear regression function f case is as follows: Here, (x) converts the input x to a vector in f , and K is a vector in f .By resolving an optimization issue, the K and b in equation 12 are produced: min When data points' y values depart from f (x) by more than ε, the optimization criterion penalizes those data points.φ i and φ * i , which stand for the size of the excess deviation for positive and negative deviations, respectively, are the slack variables [42].
Only one of σ i and σ * i will be nonzero, according to the KKT condition 19, and both of them can be nonnegative.Thus, the following is how a coefficient difference, µ i , might be written: and µ i determines σ i and σ * i For the ith sample x i , define a margin function p(x i ) as follows; Equations 19, 20, and 21 when combined give us, Equation 22 compares the three conditions in support vector classification which has five conditions, but just like those conditions, the samples in training set T can be classified into them using three subsets, as they can in equation 22, [44].And, two of the subsets (E sv and M sv ), depending on the direction of the error f (x i ) − y i , are each composed of two distinct components. The

Methodology
Quantitative Structure-Activity Relationship (QSAR) prediction studies aim to discover new drug-like molecules that can be used as lead compounds.This is achieved by selecting appropriate molecular descriptors and using feature-selection algorithms to predict the biological activities of designed compounds.With the rise of Big Data, there has been increased interest in the use of deep learning models, and studies have shown the effectiveness of a robust hybrid algorithm proposed by Liu et al. [45,46] .This two-step approach involves dividing the original data set into a training and test set, scaling the training data using its mean and standard deviation, and analyzing the training set using the sparse LTS algorithm or Lasso.The variables or molecular descriptors that are shrunk to zero are eliminated, while the variables with non-zero coefficients are selected as features for the new data set.Next, the new data set is divided into a training and test set, and the training set is further divided into k folds.Sets of hyper-parameter values for various machine learning algorithms, such as Random Forest, Ridge, Lasso, and Support Vector Regression, are tuned, and the hyper-parameter with the optimal metric is selected as the final model.Finally, the test metric, such as root mean squared error, is obtained for the final model.
In summary, QSAR prediction studies use various techniques to discover new drug-like molecules, and the robust hybrid algorithm proposed in this study is an effective approach for handling Big Data and predicting the biological activities of designed compounds.

Simulation Studies & Discussion
In this section, we have designed three distinct experiments to evaluate the performance of the proposed estimators and compared them [16,47].
The simulation model is based on the linear regression framework: where ε ∼ N(0, 1).Here, the response variable y is generated as a linear combination of the predictor variables X and the unknown coefficients β.The random error term ε follows a normal distribution with mean 0 and variance 1, and the predictors X are generated from a multivariate normal distribution with mean 0 and covariance matrix σ.The correlation between the predictors is specified by the parameter ρ.
To evaluate the performance of our proposed methods, we employed a standard approach of dividing each simulated dataset into three distinct parts: a training set, a validation set, and a test set.The data were split in a ratio of 60 percent for training, 20 percent for validation, and 20 percent for testing.We used the training set to fit the models and the validation set to tune the hyperparameters, which were chosen using a grid search.The test set was then used to provide an unbiased evaluation of the final model fit on the training data.
We conducted simulations under three distinct cases, each with varying degrees of dimensionality.In each case, we evaluated the estimators' performance using appropriate accuracy measures and compared their results.This approach allowed us to assess the effectiveness of our proposed method under different scenarios and make reliable conclusions about its performance.As per Alao et al. [48] and Lukman et al. [49][50][51][52][53] approach, the model was deliberately contaminated with outliers using the following equation: Here, m represents the magnitude of the outlier, which was set to 10 in this study.20 percent outlier was introduced to the response variable.The contamination allowed us to assess the robustness of the proposed methods and compare their performances in the presence of outliers.The following steps were taken to create the predictors X: x i ∼ N(0, 1), and where x i is independent identically distributed.
To evaluate the performance of our models, we used various metrics that are commonly used to measure prediction accuracy.Specifically, we calculated the test root mean square error (RMSE), mean absolute deviation (MAD), and median absolute error (MAE) using the following formulas: Four levels of multicollinearity ρ = 0.7, 0.9, 0.95, and 0.99 were taken into consideration with the sample sizes (n) 50, 100, and 300, respectively.RStudio was used to conduct the simulation investigation.
Effect of Sample Size (n) and Predictor Count:.Comparing scenarios 1 (n = 100, p = 150) to scenario 3 (n = 300, p = 400), it is evident that larger sample sizes and predictor counts result in lower RMSE, MAD, and MAE.This suggests that larger datasets with more predictors tend to lead to improved model performance.
Estimator Performance:.In the scenario with σ = 5 and n = 100, SLTS+RIG and SLTS+LASSO outperform other estimators in terms of RMSE, MAD, and MAE, indicating their robustness in capturing the underlying relationships in the data.In the scenario with σ = 10 and n = 300, SLTS+RF demonstrates competitive results in terms of RMSE, MAD, and MAE, suggesting its effectiveness in high-dimensional data settings.SLTS stands out for its low MAE, especially when σ = 5, which implies that it is well-suited for situations where the absolute magnitude of errors is crucial.SUMMARY:.From Tables 1 -5, for the given dataset characteristics (n=50, 100, 300; p=100, 150, 400; σ=3, 5, 10; ρ=0.7, 0.9, 0.95, 0.99 respectively), SLTS+LASSO appears to be the most suitable estimation method, providing the lowest prediction errors across all three metrics.SLTS+RIG also performs exceptionally well.These results offer valuable guidance for researchers and practitioners in selecting the most appropriate modeling approach for similar datasets with high-dimensional predictors and multicollinearity.
Likewise, we can clearly observe that Tables 1, 2 & 3 present the results for Case 1, where the sampling generation technique used allowed for an accurate representation of the degree of multicollinearity through ρ.The results indicate that there is no discernible pattern in prediction error as ρ increases, demonstrating the robustness of the estimation strategies when dealing with multicollinearity using Sparse LTS.The best overall estimation method is still SLTS+LASSO, which provides a lower prediction error compared to SLTS+RIG.As expected, the error increases with higher σ and larger p due to the increase in the number of variables.Since sparsity was taken into account in this scenario, the MAE performance parameter performs better than the RMSE after MAD.
Next, in Cases 2 and 3, the sparsity level was still considered, and different signals were used.Tables 4 & 5 show that, unlike in Case 1, the prediction error increases as multicollinearity increases for all values of n, p, and σ.The LASSO+RF estimator performs the worst of all.SLTS+LASSO remains the superior method, providing a lower test MAE.When the grouping effect is present in Case 3, the estimators LASSO+RF and SLTS+RF produce lower errors than LASSO+SVR and SLTS.

Real-life Analysis
This study selected 65 imidazo [4,5-b]pyridine derivatives exhibiting anticancer activity from previously published research [54][55][56].The biological activity of these compounds was measured using the IC50 value, which represents the concentration of the compound required to inhibit cell growth by 50 percent.To develop a quantitative structure-activity relationship (QSAR) model, the logarithmic scale of the IC50 values (pIC50 = log(IC50)) was used as the response variable.Molecular structures of the 65 compounds were created using CHEM3D software, optimized using the molecular mechanics (MM2) method and then by a molecular orbital package (MOPAC) module.Subsequently, 4885 molecular descriptors, including all 29 blocks based on the optimized molecular structures, were generated using DRAGON software (version 6.0) [2].To ensure consistency and usefulness of the molecular descriptors, several preprocessing steps were carried out, including the exclusion of descriptors that had constant values for all compounds, the removal of descriptors in which 60 percent of their values were zeros, and the discarding descriptors that had zero values for all compounds.Ultimately, 2540 molecular descriptors were selected for evaluating the QSAR model.
The data were split in a ratio of 70 percent for training, and 30 percent for testing.We used the training set to fit the models and tune the hyperparameters, which were chosen using a grid search.We choose the tuning parameters that minimize the cross-validation error.The molecular descriptors that are shrunk to zero are eliminated, while the descriptors with nonzero coefficients are selected as features for the new data set.The new data set is divided into a training and test set, and the training set is further divided into five-folds.Sets of hyperparameter values for various machine learning algorithms, such as Random Forest, Ridge, Lasso, and Support Vector Regression, are tuned, and the hyper-parameter with the optimal metric is selected as the final model.Finally, we obtained the root mean squared error, median absolute error and mean absolute error for the final model using the test.
The prediction result is presented in Table 6 and the prediction performance is displayed in Figure 2. It is obvious that selecting the molecular descriptors with Sparse LTS produced the most preferred prediction because the method is robust to outlying values.LASSO selected forty-eight descriptors, while Sparse LTS selected 15 active sets.Figure 2 serves as a visual representation of the estimator performances, specifically focusing on root mean squared error, mean absolute deviation and median absolute error metrics.It provides an intuitive way to assess how each estimator performs and complements the information presented in Table 6, offering a graphical perspective of their relative performance.Table 6 and Figure 2 demonstrate that SLTS+LASSO performs better in terms of the prediction metrics than the other six approaches.The results agree with the simulation study.

Conclusion
This study presents novel and robust methods for identifying potential drug compounds and predicting their biological activities.We utilized machine learning techniques to achieve this goal, specifically by using sparse LTS or Lasso algorithms to select important molecular descriptors.The selected descriptors were then divided into training and test sets, with further subdivision of the training set into training and validation sets.We employed various machine learning algorithms, including Random Forest, Ridge, Lasso, and Support Vector Regression, to tune hyper-parameter values for the final model.We evaluated the effectiveness of these algorithms using three standard metrics: root mean square error (RMSE), mean absolute deviation (MAD), and median absolute error (MAE).
To investigate the robustness of our methods, we conducted a simulation study exploring different scenarios.Our results demonstrated that the Sparse LTS or Lasso algorithms effectively handled multicollinearity and outliers.The SLTS+LASSO hybrid estimating approach was the most effective, followed by SLTS+RIG, due to their lower prediction errors.We found that MAE outperformed MAD and RMSE as performance metrics when sparsity was considered.We applied our methods to a QSAR example to validate our simulation results, and the results were consistent with the simulation study.
The findings of this study contribute to the field of highdimensional data analysis and modeling with multicollinear and outlier data in linear models.Our methods have the potential to be used in drug discovery and development, as they can help identify potential drug compounds and predict their biological activities.Further research in this area is warranted to enhance our understanding of these methods and their potential applications.

Table 4 .
Synthetic Simulation result for Case 2 & Case 3

Table 5 .
Synthetic Simulation result for Case 2 & Case 3

Table 6 .
Hyper-parameter Values of the molecular descriptor for the Hybrid Methods