Robust M-estimators and Machine Learning Algorithms for Improving the Predictive Accuracy of Seaweed Contaminated Big Data

Olayemi Joshua Ibidoja; Fam Pei  Shan; Mukhtar; Jumat  Sulaiman; Majid Khan  Majahar Ali

doi:10.46481/jnsps.2023.1137

Authors

O. J. Ibidoja
[email protected]

Department of Mathematics, Federal University Gusau, Gusau, Nigeria; School of Mathematical Sciences, Universiti Sains Malaysia 11800 USM, Penang, Malaysia
F. P. Shan
School of Mathematical Sciences, Universiti Sains Malaysia 11800 USM, Penang, Malaysia
Mukhtar
I-CEFORY (Local Food Innovation), Universitas Sultan Ageng Tirtayasa Indoneia
J. Sulaiman
School of Science and Technology, Universiti Malaysia Sabah, Kota Kinabalu, Sabah, Malaysia
M. K. M. Ali
School of Mathematical Sciences, Universiti Sains Malaysia 11800 USM, Penang, Malaysia

Keywords:

Robust method, Hybrid model, Machine learning, Outliers, Big data

Abstract

A common problem in regression analysis using ordinary least squares (OLS) is the effect of outliers or contaminated data on the estimates of the parameters. A robust method that is not sensitive to outliers and can handle contaminated data is needed. In this study, the objective is to determine the significant parameters that determine the moisture content of the seaweed after drying and develop a hybrid model to reduce the outliers. The data were collected with sensors from the v-Groove Hybrid Solar Drier (v-GHSD) at Semporna, South-Eastern Coast of Sabah, Malaysia. After the second order interaction, we have 435 drying parameters, each parameter has 1914 observations. First, we used four machine learning algorithms, such as random forest, support vector machine, bagging and boosting to determine the significant parameters by selecting 15, 25, 35 and 45 parameters. Second, we developed the hybrid model using robust methods such as M. Bi-Square, M. Hampel and M. Huber. The results show that there is a significant improvement in the reduction of the number of outliers and better prediction using hybrid model for the contaminated seaweed big data. For the highest variable importance of 45 significant drying parameters of seaweed, the hybrid model bagging M Bi-square performs better because it has the lowest percentage of outliers of 4.08 %.

Dimensions

REFERENCES

D. N. Gujarati & D. N. Porter, Basic econometrics, 4th ed. New York, USA: The McGraw-Hill Companies, (2004).

O. G. Obadina, A. F. Adedotun, & O. A. Odusanya, “Ridge Estimation’s Effectiveness for Multiple Linear Regression with Multicollinearity: An Investigation Using Monte-Carlo Simulations”, Journal of the Nigerian Society of Physical Sciences 3 (2021) 278, doi: 10.46481/jnsps.2021.304. DOI: https://doi.org/10.46481/jnsps.2021.304

A. B. Yusuf, R. M. Dima, & S. K. Aina, “Optimized Breast Cancer Classification using Feature Selection and Outliers Detection,” Journal of the Nigerian Society of Physical Sciences 3 (2021) 298, doi:10.46481/jnsps.2021.331. DOI: https://doi.org/10.46481/jnsps.2021.331

H. Y. Lim, P. S. Fam, A. Javaid, & M. K. M. Ali, “Ridge regression as efficient model selection and forecasting of fish drying using v-groove hybrid solar drier”, Pertanika J Sci Technol. 28 (2020) 1179, doi:10.47836/pjst.28.4.04. DOI: https://doi.org/10.47836/pjst.28.4.04

A. Javaid, M. T. Ismail, & M. K. M. Ali, “Comparison of Sparse and Robust Regression Techniques in Efficient Model Selection for Moisture Ratio Removal of Seaweed using Solar Drier”, Pertanika J. Sci. & Technol 28 (2020) 609. DOI: https://doi.org/10.18187/pjsor.v17i3.3641

A. Javaid, M. T. Ismail, & M. K. M. Ali, “Efficient Model Selection of Collector Efficiency in Solar Dryer using Hybrid of LASSO and Robust Regression”, Pertanika J. Sci. & Technol 28 (2020) 210.

I. Dawoud & M. R. Abonazel, “Robust Dawoud–Kibria estimator for handling multicollinearity and outliers in the linear regression model”, J. Stat. Comput. Simul. 91 (2021) 3678, doi:10.1080/00949655.2021.1945063. DOI: https://doi.org/10.1080/00949655.2021.1945063

A. Rajarathinam & B. Vinoth, “Outlier Detection in Simple Linear Regression Models and Robust Regression-A Case Study on Wheat Production Data”, International Journal of Scientific Research 3 (2014) 531. DOI: https://doi.org/10.15373/22778179/FEB2014/179

S. L. Jegede, A. F. Lukman, K. Ayinde, & K. A. Odeniyi, “Jackknife Kibria-Lukman M-Estimator: Simulation and Application”, Journal of the Nigerian Society of Physical Sciences 4 (2022) 251, doi: 10.46481/jnsps.2022.664. DOI: https://doi.org/10.46481/jnsps.2022.664

B. T. Tan, P. S. Fam, R. B. R. Firdaus, T. Mou Leong, & M. S. Gunaratne, “Impact of climate change on rice yield in malaysia: A panel data analysis”, Agriculture (Switzerland) 11 (2021), doi: 10.3390/agriculture11060569. DOI: https://doi.org/10.3390/agriculture11060569

Y. Susanti, H. Pratiwi, H. Sulistijowati, & T. Liana, “M Estimation, s estimation, and mm estimation in robust regression”, International Journal of Pure and Applied Mathematics 3 (2014) 349, doi: 10.12732/ijpam.v91i3.7. DOI: https://doi.org/10.12732/ijpam.v91i3.7

Y. Susanti & D. Pratiwi, “MODELING OF SOYBEAN PRODUCTION IN INDONESIA USING ROBUST REGRESSION”, Bionatura 14 (2012) 148.

P. J. Huber, “Robust Estimation of a Location Parameter”, The Annals of Mathematical Statistics 35 (1964) 73. DOI: https://doi.org/10.1214/aoms/1177703732

F. Drobnic, A. Kos, & M. Pustisek, “On the interpretability of machine learning models and experimental feature selection in case of multicollinear data”, Electronics (Switzerland) 9 (2020), doi: 10.3390/electronics9050761. DOI: https://doi.org/10.3390/electronics9050761

M. Z. I. Chowdhury & T. C. Turin, “Variable selection strategies and its importance in clinical prediction modelling”, Fam Med Community Health 8 (2020), doi: 10.1136/fmch-2019-000262. DOI: https://doi.org/10.1136/fmch-2019-000262

H. Kaneko, “Examining variable selection methods for the predictive performance of regression models and the proportion of selected variables and selected random variables”, Heliyon 7 (2021) 1, doi:10.1016/j.heliyon.2021.e07356. DOI: https://doi.org/10.1016/j.heliyon.2021.e07356

Mukhtar, M. K. M. Ali, M. T. Ismail, M. H. Ferdinand, & Alimuddin, “Machine learning-based variable selection: An evaluation of Bagging and Boosting”, Turkish Journal of Computer and Mathematics Education 12 (2021) 4343.

Mukhtar, M. K. M. Ali, M. T. Ismail, M. H. Ferdinand, Alimuddin, N. Akhtar, & A. Fudholi, “Hybrid model in machine learning–robust regression applied for sustainability agriculture and food security”, International Journal of Electrical and Computer Engineering 12 (2022) 4457, doi: 10.11591/ijece.v12i4.pp4457-4468. DOI: https://doi.org/10.11591/ijece.v12i4.pp4457-4468

S. Georganos, T. Grippa, A.N. Gadiaga, C. Linard, M. Lennert, S. Vanhuysse, N. Mboga, E. Wolff., & S. Kalogirou, “Geographical random forests: a spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling”, Geocarto Int 36 (2021) 121, doi:10.1080/10106049.2019.1595177. DOI: https://doi.org/10.1080/10106049.2019.1595177

D. O. Oyewola, E. G. Dada, N. J. Ngozi, A. U. Terang, & S. A. Akinwumi, “COVID-19 Risk Factors, Economic Factors, and Epidemiological Factors nexus on Economic Impact: Machine Learning and Structural Equation Modelling Approaches”, Journal of the Nigerian Society of Physical Sciences 3 (2021) 395, doi:10.46481/jnsps.2021.173. DOI: https://doi.org/10.46481/jnsps.2021.173

V. Umarani, A. Julian, & J. Deepa, “Sentiment Analysis using various Machine Learning and Deep Learning Techniques”, Journal of the Nigerian Society of Physical Sciences 3 (2021) 385, doi: 10.46481/jnsps.2021.308. DOI: https://doi.org/10.46481/jnsps.2021.308

R. Gandhi, “Support Vector Machine — Introduction to Machine Learning Algorithms”, Towards Data Science, (2018).

H. H. Rashidi, N. K. Tran, E. V. Betts, L. P. Howell, & R. Green, “Artificial Intelligence and Machine Learning in Pathology: The Present Landscape of Supervised Methods”, Acad Pathol 6 (2019) 1 doi: 10.1177/2374289519873088. DOI: https://doi.org/10.1177/2374289519873088

C. Cortes & V. Vapnik, “Support-Vector Networks”, Mach Learn 20 (1995) 273. DOI: https://doi.org/10.1007/BF00994018

A. J. Smola, B. Scholkopf, & S. Scholkopf, “A tutorial on support vector regression”, Kluwer Academic Publishers, (2004). DOI: https://doi.org/10.1023/B:STCO.0000035301.49549.88

N. Guenther & M. Schonlau, “Support vector machines”, The Stata Journal 3 (2016) 917. DOI: https://doi.org/10.1177/1536867X1601600407

Y. Freund, “Boosting a weak learning algorithm by majority”, Inf Comput 121 (1995) 256. DOI: https://doi.org/10.1006/inco.1995.1136

R. E. Schapire, “The Boosting Approach to Machine Learning an Overview”, MSRI Workshop on Nonlinear Estimation and Classification, (2002). DOI: https://doi.org/10.1007/978-0-387-21579-2_9

L. Breiman, “Bagging Predictors”, Mach Learn 24 (1996) 123. DOI: https://doi.org/10.1007/BF00058655

Ö. G. Alma, “Comparison of Robust Regression Methods in Linear Regression”, Int. J. Contemp. Math. Sciences 6 (2011) 409.

A. E. Mohamed, H. M. Almongy, & A. H. Mohamed, “Comparison Between M-estimation, S-estimation, And MM Estimation Methods of Robust Estimation with Application and Simulation”, International Journal of Mathematical Archive 9 (2018) 55.

Mukhtar, M. K. M. Ali, A. Javaid, M. T. Ismail, & A. Fudholi, “Accurate and Hybrid Regularization - Robust Regression Model in Handling Multicollinearity and Outlier Using 8SC for Big Data”, Mathematical Modelling of Engineering Problems 8 (2021) 547, doi: 10.18280/mmep.080407. DOI: https://doi.org/10.18280/mmep.080407

R. C. Chen, C. Dewi, S. W. Huang, & R. E. Caraka, “Selecting critical features for data classification based on machine learning methods”, J Big Data 17 (2020) 1, doi: 10.1186/s40537-020-00327-4. DOI: https://doi.org/10.1186/s40537-020-00327-4

C. Njeru & A. Amayo, Evaluation of Quality Control in Clinical Chemistry Using Sigma Metrics, (2022). DOI: https://doi.org/10.31730/osf.io/5gjc2