Robust M-Estimators and Machine Learning Algorithms for Improving the Predictive Accuracy of Seaweed Contaminated Big Data

A common problem in regression analysis using ordinary least squares (OLS) is the e ﬀ ect of outliers or contaminated data on the estimates of the parameters. A robust method that is not sensitive to outliers and can handle contaminated data is needed. In this study, the objective is to determine the signiﬁcant parameters that determine the moisture content of the seaweed after drying and develop a hybrid model to reduce the outliers. The data were collected with sensors from the v-Groove Hybrid Solar Drier (v-GHSD) at Semporna, South-Eastern Coast of Sabah, Malaysia. After the second order interaction, we have 435 drying parameters, each parameter has 1914 observations. First, we used four machine learning algorithms, such as random forest, support vector machine, bagging and boosting to determine the signiﬁcant parameters by selecting 15, 25, 35 and 45 parameters. Second, we developed the hybrid model using robust methods such as M. Bi-Square, M. Hampel and M. Huber. The results show that there is a signiﬁcant improvement in the reduction of the number of outliers and better prediction using hybrid model for the contaminated seaweed big data. For the highest variable importance of 45 signiﬁcant drying parameters of seaweed, the hybrid model bagging M Bi-square performs better because it has the lowest percentage of outliers of 4.08 %.


Introduction
The purpose of regression analysis is to study the relationship between two or more independent variables and a dependent variable. Consider a multiple regression model: where y is an n × 1 vector of response variables, X is known as the design matrix of order n × p, β is a p × 1 vector of unknown parameters and ε is an n × 1 vector of identically and independent distributed errors.
The Ordinary Least Squares (OLS) is popularly used to estimate the unknown parameters in a regression model. According to [1,2], the ordinary least squares (OLS) estimator of β is Observations that deviate from the distribution's general shape or pattern are called outliers [3]. The relationship between the observed and the dependent variable can be estimated by OLS regression, by minimizing the sum of squares [4]. OLS also has limitations when the assumptions are violated [5]. Estimates from OLS are not precise due to the high variances and covariances [6]. The presence of outliers in the data makes the LS estimator unstable, inefficient, and unreliable [7]. Agricultural data has outliers because of factors that cannot be regulated, and these outliers will increase the standard errors [4,8]. The presence of outliers affects the performance of OLS, and a robust regression is used [9].
When modelling data using regression analysis, various assumptions are tested but these assumptions are violated. This model needs to be tested on the error structure for the necessary assumptions before prediction [10]. The researcher can transform the variables to fulfil the assumptions, but this cannot eradicate the outliers in the data that affect the forecast and estimate of the parameters [11]. Data with outliers is common in the field of agriculture [11,12].
To overcome this problem, robust estimators have been introduced. M-estimation is the most common method of robust regression, it was introduced by [13], it is a generality to the method of maximum likelihood estimation. Before we used the robust methods to reduce the outliers, four machine learning algorithms such as random forest, support vector machine, boosting and bagging are used to select the significant parameters that determine the moisture content of the seaweed.
The major contributions of this study are: i. To determine the significant parameters for the moisture content removal of seaweed during drying and reduce the number of outliers. ii. To propose a hybrid model that combines robust Mestimators and machine learning models to improve the prediction accuracy. Figure 1 shows the flowchart of the various stages in the study.

Stage I
This involves the inclusion of all possible models.
n! (n − r)!r! + number of single factor, where n is the number of single factors, r is the number of orders. Equation (3) can be used to compute the total number of all possible models.

Stage II
Test for the assumptions of linear regression. The residual vs fitted plot, normal Q-Q plot are Kolmogorov-Smirnov test are used to verify the assumptions. Next, each machine learning model is used to select 15, 25, 35 and 45 highest important variables for optimization and easy comparison, to determine the moisture content removal of the seaweed after drying. We selected the number of variables because features selection can only provide the rank of important variables and does not tell us the number of significant factors [14]. Similarly, there is no rule to decide the number of parameters to be included in a prediction model [15]. Furthermore, the algorithms cannot tell us the number of significant variables except the ranks [16].

Stage III
After the selection of the significant parameters, the prediction is done and the validation metrics such as MAPE, SSE, MSE and R-square are computed. The outliers are also computed, and the robust method is introduced to build the hybrid model.

Data Description
The data were collected from 8 th April 2017 to 12 th April 2017, between the hours of 8:00 am to 5:00 pm during the drying of seaweed by using v-Groove Hybrid Solar Drier (v-GHSD) at Semporna, South-Eastern Coast of Sabah, Malaysia. There are 435 parameters after the inclusion of the second order interaction in this study.

Machine learning algorithms
Machine learning can learn from data and use the algorithms to understand and forecast the future [17]. Machine learning algorithms can be used to determine the rank of significant explanatory variables that contribute significantly to the response variable. These high-ranking variables selected using variable importance can reduce the training time, complexity of the model and improve accuracy [18]. Four machine learning algorithms such as random forest, support vector machine, bagging and boosting are used in this study, to determine the significant parameters that determine the moisture content removal of the seaweed.

Random Forest
A random forest (RF) is a mixture of classification and regression trees (CARTs). It uses the highest number of votes (classification) or the mean forecasts (regression) of all the trees [19]. It uses the idea of bagging, and it is an ensemble learning method [20], [21].
If L is a learning set ,with a group of N pairs of features, with the output ( Replicate the steps 1-3 till the stopping conditions are satisfied.

Suppor Vector Machine (SVM)
Support vector machine can be used for regression and classification problems [22]. SVM has the capacity to reveal nonlinear connections with kernel function [20,23]. The SVM was developed by Cortes & Vapnik [24]. A good tutorial and explanations were given by [25,26]. In support vector regression, the loss function is usually minimized. Beyond this particular bound, a straightforward linear loss function is applied, and any loss less than is set to zero: For instance, suppose f (x) is a linear function f (x) = β 0 + x t i β, then the loss function is given as The is the tuning parameter and can be written as the constrained optimization problem: Minimize Subject to If there are observations who do not lie within the ε band around that regression line,then there is no solution to the problem. The slack variables ζ i and ζ * i are used ,this allows the observations to fall outside the ε band around that regression line. Minimize

Boosting
Boosting is used to improve the accuracy of algorithms [27]. Boosting starts with an algorithm or method to discover the rough rules of thumb. It is called the "base" or "weak" learning algorithm many times. The base learning algorithm creates a new weak prediction rule each time it is called, and after many rounds, the boosting algorithm must merge these weak rules into a singular forecast rule that, ideally, will be significantly more precise than any of the weak rules [28]. Suppose we have this model matrix X = X 1 , X 2 , . . . , X p R n×p , outcomes variable vector y ∈ R n×1 . The regression coefficients vector is given as β ∈ R p , the value of predicted for the outcome variable is denoted by Xβ, and the residuals are denoted by ε = y − Xβ. For regression purposes, least squares boosting (LSB(ε)) gives an accurate description of the data and regularization [27].
1. Do this for 0 ≤ k ≤ N 2. Establish the covariates u j k and j k as below: . Revise the present errors and regression coefficients as: r k+1 ←r k − ε u j k β k+1 j k ←β k j k + ε u j k andβ k+1 j ←β k j , j j k

Bagging
Breiman [29] introduced bagging (bootstrap aggregating) to decrease the variance of classification and regression tree models. It is used to improve the present method and leads to an improvement in the accuracy. Bagging is used as an intensive methods to enhance erratic estimation. For a high -dimensional data problems, bagging can be used to find a good model. Suppose we have a feature ϕ (x, L) to predict y from x, if there is a training sequence {L k } consisting of N objects , from L distribution, the aim here is to use the {L k } to build a more accurate predictor than ϕ (x, L) as a specific training set predictor ϕ (x, L) [29]. If y is not discreet and we put ϕ (x, L k ) with the mean of ϕ (x, L k ) over k. We get continually many samples via the bootstrap L (A) , an from L, and form ϕ x, L (A) . If y is continuous, then ϕ A as ϕ A (x) = averageϕ A x,L (A) . The L (A) will form replicate datasets with M cases are randomly chosen from L and by applying replacement. Each (y m , x m ) can appear many times in a any specific L (A) . The technique to construct ϕ is an important factor to know if bagging improves precision or reliability.
ii. Use the plug-in principle to ascertain the boot-

Robust Estimation Method
Outliers are common with contaminated data and how to determine the observations is a challenge. A robust method can deal with the influence of outliers. Contaminated data can be analyzed using robust estimation [6], [30,31,32]. A robust method is used to solve the problems of traditional methods because of these outliers. To know the best method for the robust estimation methods, M estimation methods M Huber, M Hampel and M Bi-Square are compared.
The M-estimation method attempts to minimise that the function ρ (•) operates on the residual. M-estimators define: ρ (e i (β)).
The ρ is ρ−type M-estimation. Assume σ is known and the residuals approximate β be e i = y i − β T x i . The β in M-estimate minimizes the objective function: where the β has the p×1 parameter vector, and then the function ψ yields: The function ψ (e) = ∂ρ(e) ∂(e) derivatives the influence function. 4 for |e| ≤ k k 2 6 for |e| > k Then the weight function defines: where function ψ (e) states: i w (e i ) e i ∂e i ∂β i = 0, for j = 1, 2, . . . , p (15) and the object becomes to obtain the following iterated reweighted least square problem: where k indicates the iterate number. Table 1 shows the summary of the M -estimators and their respective weight function.

Results and Discussion
From the plot in Figure 2a, the residuals vs fitted plot shows that there is no pattern since the residuals did not spread out. There is evidence of non-linearity and heterogeneity. Figure 2b shows the normal Q-Q plot, the residuals are not normally distributed, this also supports the result of Kolmogorov-Smirnov test in Table 2. The possible outliers are the observations 272 and 355. The observation 272 determine more the moisture content removal of the seaweed than the model predict. Though, it is an extreme case, but still affect the moisture content removal. The observation 355 has a negative residual and determine less the moisture content removal of the seaweed than the model predicts.
The normality assumption is checked with the Kolmogorov-Smirnov test for a two-taied test. From the results in Table 2, the p-value =2.2e-16, which is less than 0.05, it means we have enough evidence to say that the residuals do not come from a normal distribution. This also explains why we have this type of QQ plot in Figure 2.
The results in Table 3 are the evaluation of each machine learning algorithm for 15, 25, 35 and 45 high -ranking variables that determine the moisture content removal of the seaweed. Based on the mean absolute percentage error (MAPE), mean squared error (MSE), R 2 and sum of squared error (SSE), random forest outperforms support vector machine, bagging and boosting for the 15, 25, 35 and 45 significant parameters. This also confirms the results of [33], where random forest absolutely performed better than the other methods.
Random forest when 45 significant parameters that determine the moisture content of the seaweed were selected gave MAPE of 2.125891, MSE of 7.330011, R 2 of 0.9732063 and SSE of 14029.64 gave the best performance. All the validation measures such as MAPE, MSE, R-square and SSE imply that significantly better results are obtained by random forest to the determine the moisture content removal of the seaweed. Table 4 is the summary of the original model without using robust method and the hybrid models ,which combines machine learning models and robust estimation techniques. It also shows the number and percentage of outliers using 2-sigma limit.The percentage for the outliers is the number of observations outside the 2-sigma limit. It shows the percentage of outliers outside the 2-sigma limit for the original model without using robust method and the hybrid model. This sigma limit can improve the outputs quality and eliminate the source of deficiencies [34].
Based on the results in Table 4 for the original model, for  Based on the results in Table 4 for the hybrid model, for the  15 highest important

Conclusion
The aim of this study is to develop a hybrid model, to forecast seaweed drying parameters that determine the moisture content removal that would enhance the quality of the seaweed. Four predictive models such as random forest, support vector machine, bagging and boosting were built with M Huber, M Hampel and M Bi-Square to develop a hybrid model that can improve the predictive accuracy of the seaweed contaminated data. In summary, the best model to determine the moisture content removal of the seaweed big data is the bagging M Bisquare, it gave the best performance because it had the lowest number of outliers of 78 and used the highest number of highranking variables. For future study, a hybrid model with imbalanced data or missing values can be investigated.