Identifying heterogeneity for increasing the prediction accuracy of machine learning models

Paavithashnee Ravi Kumar; Majid Khan Majahar Ali; Olayemi Joshua Ibidoja

doi:10.46481/jnsps.2024.2058

Authors

Paavithashnee Ravi Kumar
School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM Pulau Pinang, Malaysia
Majid Khan Majahar Ali
[email protected]

School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM Pulau Pinang, Malaysia
Olayemi Joshua Ibidoja
Department of Mathematics, Federal University Gusau, Gusau, Nigeria, School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM Pulau Pinang, Malaysia

Keywords:

Machine learning, Agriculture, Variable Selection, seaweed, heterogeneity

Abstract

In recent years, the significance of machine learning in agriculture has surged, particularly in post-harvest monitoring for sustainable aquaculture. Challenges like heterogeneity, irrelevant variables and multicollinearity hinder the implementation of smart monitoring systems. However, this study focuses on investigating heterogeneity among drying parameters that determine the moisture content removal during seaweed drying due to its limited attention, particularly within the field of agriculture. Additionally, a heterogeneity model within machine learning algorithms is proposed to enhance accuracy in predicting seaweed moisture content removal, both before and after the removal of heterogeneity parameters and also after the inclusion of single-eliminated heterogeneity parameters. The dataset consists of 1914 observations with 29 independent variables, but this study narrows down to five: Temperature (T1, T4, T7), Humidity (H5), and Solar Radiation (PY). These variables are interacted up to second-order interactions, resulting in 55 variables. Variance inflation factor and boxplots are employed to identify heterogeneity parameters. Two predictive machine learning models, namely random forest and elastic net are then utilized to identify the 15 and 20 highest important parameters for seaweed moisture content removal. Evaluation metrics (MSE, SSE, MAPE, and R-squared) are used to assess model performance. Results demonstrate that the random forest model outperforms the elastic net model in terms of higher accuracy and lower error, both before and after removing heterogeneity parameters, and even after reintroducing single-eliminated heterogeneity parameters. Notably, the random forest model exhibits higher accuracy before excluding heterogeneity parameters.

Dimensions

REFERENCES

T. Panch, P. Szolovits & R. Atun, “Artificial intelligence, machine learning and health systems”, Journal of Global Health 8 (2018) 020303. https://doi.org/10.7189/jogh.08.020303.

A. Sharma, A. Jain, P. Gupta & V. Chowdary, “Machine Learning Applications for Precision Agriculture: A Comprehensive Review”, IEEE Access 9 (2021) 4843. https://doi.org/10.1109/access.2020.3048415.

O. J. Ibidoja, F. P. Shan, M. E. Suheri, J. Sulaiman & M. K. M. Ali, “Intelligence system via machine learning algorithms in detecting the moisture content removal parameters of seaweed big data”, Pertanika Journal of Science & Technology 31 (2023) 2783. http://dx.doi.org/10.47836/pjst.31.6.09.

S. Arjasakusuma, S. S. Kusuma & S. Phinn, “Evaluating variable selection and machine learning algorithms for estimating forest heights by combining lidar and hyperspectral data”, ISPRS International Journal of Geo-Information 9 (2020) 1. https://doi.org/10.3390/ijgi9090507.

H. Y. Lim, P. S. Fam, A. Javaid & M. Ali, “Ridge Regression as Efficient Model Selection and Forecasting of Fish Drying Using V-Groove Hybrid Solar Drier”, Pertanika Journal of Science and Technology 28 (2020) 1179. https://doi.org/10.47836/pjst.28.4.04.

J. Echave, P. Otero, P. Garcia-Oliveira, P. E. Munekata, M. Pateiro, J. M. Lorenzo, J. Simal-Gandara & M. A. Prieto, “Seaweed-Derived Proteins and Peptides: Promising Marine Bioactives”, Antioxidants 11 (2022) 176. https://doi.org/10.3390/antiox11010176.

M. K. M. Ali, J. Sulaiman, S. Md Yasir & M.H. Ruslan, “Cubic Spline as a Powerful Tools for Processing Experimental Drying Rate Data of Seaweed Using Solar Drier”, Article in Malaysian Journal of Mathematical Sciences 11 (2017) 159. https://mjms.upm.edu.my/fullpaper/2017-February-11(S)/Ali,%20M.%20K.%20M.-159-172.pdf.

S. Lomartire, J. C. Marques & A. C. Gonc¸alves, “An Overview to the Health Benefits of Seaweeds Consumption”, Marine Drugs 19 (2021) 341. https://doi.org/10.3390/md19060341.

J. Venkatesan, S. Anil, S. Kim, & M. S. Shim, “Seaweed PolysaccharideBased Nanoparticles: Preparation and Applications for Drug Delivery”, Polymers 8 (2016) 30. https://doi.org/10.3390/polym8020030.

A. Nunes, T. Trappenberg & M. Alda, “The definition and measurement of heterogeneity”, Translational Psychiatry 10 (2020) 299. https://doi.org/10.1038/s41398-020-00986-0.

O. J. Ibidoja, F. P. Shan, J. Sulaiman & M. K. M. Ali, “Detecting heterogeneity parameters and hybrid models for precision farming”, Journal of Big Data 10 130 (2023). https://doi.org/10.1186/s40537-023-00810-8.

J. Y. Chan, S. M. H. Leow, K. T. Bea, W. K. Cheng, S. W. Phoong, Z. Hong & Y. Chen, “Mitigating the multicollinearity problem and its machine learning approach: A Review”, Mathematics 10 (2022) 1283. https://doi.org/10.3390/math10081283.

P. Marenya, G. G. Gebremariam, M. Jaleta & D. B. Rahut, “Sustainable intensification among smallholder maize farmers in Ethiopia: adoption and impacts under rainfall and unobserved heterogeneity”, Food Policy 95 (2020) 101941. https://doi.org/10.1016/j.foodpol.2020.101941.

K. M. Rhodes, R. M. Turner, J. Savovi?, H. E. Jones, D. Mawdsley & J. P. T. Higgins, “Between-trial heterogeneity in meta-analyses may be partially explained by reported design characteristics”, Journal of Clinical Epidemiology 95 (2018) 45. https://doi.org/10.1016/j.jclinepi.2017.11.025.

Z. Wang, Z.Liang, R. Zeng, H. Yuan & R. S. Srinivasan, “Identifying the optimal heterogeneous ensemble learning model for building energy prediction using the exhaustive search method”, Energy and Buildings 281 (2023) 112763. https://doi.org/10.1016/j.enbuild.2022.112763.

G.Lamberti, Modelling with Heterogeneity, Ph.D dissertation, Facultat de Matematiques i Estad` ´?stica Universitat Politecnica de Catalunya,` Barcelona, Spain, 2015. https://upcommons.upc.edu/bitstream/handle/2117/95733/TGL1de1.pdf.

G. Cagil, S. N. G¨ uler, A.¨ Unl¨ u,¨ O. B¨ oy¨ ukdibi & G. T¨ uccar, “Compara-¨ tive analysis of Multiple linear Regression (MLR) and Adaptive NetworkBased fuzzy Inference Systems (ANFIS) methods for vibration prediction of a diesel engine containing NH3 additive”, Fuel 350 (2023) 128686. https://doi.org/10.1016/j.fuel.2023.128686.

O. J. Ibidoja, F. P. Shan, Mukhtar, J. Sulaiman & M. K. M. Ali, “Robust M-estimators and Machine Learning Algorithms for Improving the Predictive Accuracy of Seaweed Contaminated Big Data”, Journal of Nigerian Society of Physical Sciences 5 (2023) 1137. https://doi.org/10.46481/jnsps.2023.1137.

C. Jiehong, J. Sun, K. Yao, X. Min & C. Yan, “A variable selection method based on mutual information and variance inflation factor”, Spectrochimica Acta Part A: Molecular and Biomolecular spectroscopy 268 (2022) 120652. https://doi.org/10.1016/j.saa.2021.120652.

K. Kirasich, T.Smith & B. Sadler, “Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets”, SMU Data Science Review 1 (2018) 9. https://scholar.smu.edu/datasciencereview/vol1/iss3/9.

G. Louppe, Understanding Random Forests: From Theory to Practice, Ph.D. dissertation, Faculty of Applied Sciences, Department of Electrical Engineering & Computer Science, Universite de Liege, Belgium, 2014. https://doi.org/10.13140/2.1.1570.5928.

N. Donges, “Random Forest: A Complete Guide for Machine Learning”. [Online]. Available: on the World Wide Web: https://builtin.com/data-science/random-forest-algorithm#procon.

J. C. Laria, L. K. H. Clemmensen & B. K. Ersbøll, “A Generalized Linear Joint Trained Framework for Semi-Supervised Learning of Sparse Features”, Mathematics 10 (2020) 3001. https://doi.org/10.3390/math10163001.

A. S. Al-Jawarneh, M. T. Ismail & A. M. Awajan, “Elastic Net Regression and Empirical Mode Decomposition for Enhancing the Accuracy of the Model Selection”, International Journal of Mathematical, Engineering and Management Sciences 6 (2021) 564. https://doi.org/10.33889/ijmems.2021.6.2.034.

M. K. Mukhtar, B. M. Ali, A. Javaid, M. T. Ismail & A. Fudholi, “Accurate and hybrid regularization - robust regression model in handling multicollinearity and outlier using 8SC for big data”, Mathematical Modelling of Engineering Problems 8 (2021) 547. https://doi.org/10.18280/mmep.080407.

N. Deanna, Schreiber-Gregory, Jackson Foundation & Karlen Bader, Regulation Techniques for Multicollinearity: Lasso, Ridge, and Elastic Nets, in Proceedings of the SAS Conference Proceedings: Western Users of SAS Software, 2018, pp. 1–23. https://api.semanticscholar.org/CorpusID:189925961.

J. Moreno, A. L. P. Pol, F. Garc´?a-Labiano & B. C. Blasco, “Using the R-MAPE index as a resistant measure of forecast accuracy”, PubMed 25 (2013) 500. https://doi.org/10.7334/psicothema2013.23.

Z. Arsad, “ Multiple Linear Regression”, Regression Analysis, School of Mathematical Sciences, Universiti Sains Malaysia, Pulau Pinang, Malaysia, 2023, pp. 10–31.

H. Pham, “A New Criterion for Model Selection”, Mathematics 7 (2019) 1215. https://doi.org/10.3390/math7121215.

C. S. Morales, R. Giraldo & M. E. Torres, “Boxplot fences in proficiency testing”, Accreditation and Quality Assurance 26 (2021) 193. https://doi.org/10.1007/s00769-021-01474-8.

K. Eberhard, “The effects of visualization on judgment and decisionmaking: a systematic literature review”, Management Review Quarterly 73 (2021) 167. https://doi.org/10.1007/s11301-021-00235-8.

F. Almeida, D. Faria & A. Queiros, “Strengths and Limitations of Qualita-´ tive and Quantitative Research Methods”, European Journal of Education Studies 3 (2017) 369. https://doi.org/10.5281/zenodo.887089.

N. Hao & H.H. Zhang, “A Note on High-Dimensional Linear Regression with Interactions”, The American Statistician 71 (2017) 291. https://doi.org/10.1080/00031305.2016.1264311.

T. H. Tulchinsky & E. A. Varavikova, Measuring, Monitoring, and Evaluating the Health of a Population, Elsevier eBooks, 2014, pp. 91–147. https://doi.org/10.1016/b978-0-12-415766-8.00003-3.

J. Frost, “Confounding Variables Can Bias Your Results”, Statistics by Jim. [Online]. Available: on the World Wide Web: https://statisticsbyjim.com/regression/confounding-variables-bias/.

H. Khanum, A. Garg & M. I. Faheem, “Accident severity prediction modeling for road safety using random forest algorithm: an analysis of Indian highways”, F1000Research 12 (2023) 494. https://doi.org/10.12688/f1000research.133594.1.

A. Callens, D. Morichon, S. Abadie, M. Delpey & B. Liquet, “Using Random Forest and Gradient boosting trees to improve wave forecast at a specific location”, Applied Ocean Research 104 (2020) 102339. https://doi.org/10.1016/j.apor.2020.102339.

M. K. Mukhtar, B. M. Ali, M. T. Ismail, F. M. Hamundu, Alimuddin, N. Akhtar & A. Fudholi, “Hybrid model in machine learning–robust regression applied for sustainability agriculture and food security”, International Journal of Electrical and Computer Engineering 12 (2022) 4457. https://doi.org/10.11591/ijece.v12i4.pp4457-4468.

C. M. Yesilkanat, “Spatio-temporal estimation of the daily cases of COVID-19 in worldwide using random forest machine learning algorithm”, Chaos Solitons & Fractals 140 (2020) 110210. https://doi.org/10.1016/j.chaos.2020.110210.