Development of Predictive Model for Radon-222 Estimation in the Atmosphere using Stepwise Regression and Grid Search Based-Random Forest Regression

This work develops predictive models for estimating radon (222Rn) activity concentration in the regression (SWR). The developed models employ meteorological parameters which include the temperature, pressure, relative and absolute humidity, wind speed and wind direction as descriptors. Experimental data of radon concentration and meteorological parameters from two observatories of the Korea Polar Research Institute in Antarctica (King Sejong and Jang Bogo) have been employed in this work. The performance of the developed models was assessed using three different performance measuring parameters. On the basis of root mean square error (RMSE), the GS-RFR shows better performance over the SWR. An improvement of 64.09 % and 15.19 % was obtained on the training and test datasets, respectively at King Sejong station. At the Jang Bogo station, an improvement of 75.04 % and 28.04 % was obtained on the training and test datasets, respectively. The precision and robustness of the developed models would be of significant interest in determining the concentration of radon (222Rn) activity concentration in the atmosphere for various physical applications especially in regions where field measuring equipment for radon is not available or measurements have been interrupted. DOI:10.46481/jnsps.2021.177


Introduction
The importance of radon (Rn-222), the only gaseous member of the U-238 series, has been of interest to scientists since the twentieth century when it was first suspected to be a caus -ative agent for lung cancer among miners. The radioactive gas has been a significant subject of research among health and environmental scientists having been characterized as a potential indoor source of air pollution. Its subsequent classification as a carcinogen has led to investigation and monitoring of the indoor concentration of the gas in several countries of the world [1][2][3][4][5][6][7][8]. The source of the noble gas is from the decay of Ra-226 in bedrock and soil and migrates through soil pores by gas-phase diffusion and advection to the surface and its sink process is by radioactive decay [9,10].
Due to some important characteristics of radon as a tracer of atmospheric processes, there has been a growing interest in recent decades in monitoring environmental radon. Being a noble gas, it is not chemically reactive with other elements. 132 Its relative solubility in water and non-attachment to aerosols makes it highly insusceptible to dry or wet atmospheric removal processes. Its half-life of 3.82 days is comparable to the life times of short-lived environmental pollutants (e.g NO x , SO 2 , CO, O 3 , CH 4 ) and atmospheric residence times of water and aerosols [10].
The noble gas has become very useful as a tracer of the influence of the terrestrial environment on air mass composition. Some areas of application of ground-based radon observation include atmospheric, pollution studies and climatic studies [11][12][13][14][15][16]. Observed anomalous behaviour of radon in soil and groundwater during earthquake events has been employed as a precursor for impending earthquakes [17,18].
Despite the progress that has been made in radon instrumentation, access to data on atmospheric radon concentration is still to a large extent, lacking in the public domain. Africa for instance, has only one mention of a radon observatory in the literature; an ANSTO-developed detector installed at a Global Atmospheric watch (GAW) station at Cape Point, South Africa [10]. Ground based radon measurement methods have not been applied to study atmospheric processes as have been done in Europe. As a matter of fact, the only radon time series characterization to have been reported was published recently for the first time on the continent [19]. In the unavailability of measuring equipment, a theoretical approach to developing predictive models for radon concentration in the atmosphere may be a viable step in generating synthetic data for the noble gas, using machine learning to train available experimental data. Theoretical models have been developed by several researchers in the literature to predict radon behaviour and concentration for various conditions and applications [17,18,[20][21][22]. Several studies in the literature have reported the variation of atmospheric radon and its progenies with changes in meteorological parameters like temperature, pressure, humidity and windspeed [23][24][25]. [26] used these meteorological variables as independent predictors in the development of a multiple linear regression model for estimation and prediction of the time series of radon and thoron progeny concentrations.
Random forest (RF) methodology is a machine learning technique developed by Leo Breiman and is useful for classification and prediction problems [27]. Its algorithm operates by sampling small divisions of the data, grows a tree predictor that is randomized on each small division, then aggregates these predictors together. It applies bootstrap aggregation and random feature selection to individual classification or regression trees for prediction [28]. Apart from the speed and ease of implementation of random forests, their predictions are remarkably accurate, with the ability to process a very large input data whilst dealing with overfitting. They also perform well with small to medium data [29]. Their good predictive abilities have made them highly applicable to regression and classification problems in the atmospheric sciences [30][31].
The Grid Search (GS) is one of the algorithms for hyperparameter optimisation and tuning of models with an expectation of the most accurate results. With a specified subset of the hyperparameters space of the training algorithm, the algorithm conducts a search with the aim of producing the best combination of parameters yielding the most remarkable results. To apply a grid search, boundaries need to be specified because some parameters within the algorithm's parameter space may contain unlimited values. The high dimensional space problem with grid search algorithms is easily resolved with parallelization of the of the process since the hyperparameters are usually not dependent on each other [32].
Multiple stepwise regression is efficient in the selection of contributing factors used in establishing models that can do statistical prediction. The critical objective it sets to achieve is to discover the most cordial relationship between predictor variables that would accurately forecast the predicted variable. The regression process begins with the input of the mostly contributing predictor variable to the prediction model. Additional variables are continuously added as long as they are of any essence to the regression equation. [33,34].
This present work develops stepwise regression (SWR) and grid search-based random forest regression (GS-RFR) models through which radon concentration can be estimated and predicted using much more available meteorological parameters (air temperature (AT), atmospheric pressure (AP), absolute humidity (AH), relative humidity (RH), wind direction (WD) and wind speed (WS) as predictors. A comparison is also made between both models in terms of performance. The proposed model will help not only to predict radon concentration, it may also help to generate estimated or synthetic radon data that can approximate measured data, for regions that lack measuring instruments for atmospheric radon but have access to meteorological data. It will also help estimate radon data for sites where measurements have been interrupted.

Description of Random Forest Regression
A random forest is described, according to [35], to consist of N regression trees that are randomized also referred to as a family. For any individual (i-th) tree, the predicted value at the query point y can be represented as m n (x; Θ i , D n ), where Θ 1 , . . . , Θ N are independent random variables that are not dependent on D n . Resampling of the training set is first done using Θ before individual trees are grown. The finite forest estimate for regression as a result of the combination of the trees is In the case of classification, the random forest classification makes use of the majority vote among the classification trees. The forest estimate for classification is

Description of Grid Search optimization
The implementation of the grid search technique involves upper and lower bound vectors V = V 1 , V 2 , . . . ,V q and W = W 1 , W 2 , . . . ,W q respectively, defined for each component of hyperparameters H = H 1 , H 2 , . . . , H q where q is the number of hyperparameters. The optimization and parameter search procedure involves taking n equally spaced points within the search space over an interval of the form [V i ,W i ] which includes of V i and W i . The algorithm searches through n q possible points and a selection of the optimum values results, following the evaluation of each grid point in space [36].

Stepwise Regression
Based on the forward and backward selection, stepwise regression is a self-determining process for in the selection of independent variables. Multiple linear regression (MLR) has the form In equation (3), Y is the output variable and X 1 , X 2 , X 3 . . . are predictor variables. β i are regression parameters, β o is an intercept and ε is the random error term. The process is summarised below; 1. If after the performance of simple multiple linear regression of n predictor variables, all the variables show remarkable significance, then the whole model containing all n variables is adopted. 2. If results show otherwise, simple n-variable linear regression is performed with each of the predictor variables and the process selects the variable which gives lowest p-value for t-test. 3. A subsequent n−1 variable regression is performed taking the selected variable in step 2 as common. 4.
Step 3 is repeated with each significant variable becoming added to the model in a stepwise manner. The test for significance by stepwise regression can be applied at two levels. The first being for addition of variables and the second, for removal of variables [37].

Performance measuring parameters
Three performance measuring parameters were used to assess the developed models namely correlation coefficient (CC), root mean square error (RMSE), mean absolute error (MAE). Correlation coefficient is defined as where where Y i * and Y i are the mean values of the predicted and actual outputs. RMSE is defined as: N represents the number of samples contained in the dataset MAE is defined as:

Description of sites
The data used in this work was published by [38], being data measured in two Korea Antarctic Research Program stations namely King Sejong (KSG) and Jang Bogo (JBS). Measurements have been done jointly with the Australian Nuclear Science and Technology Organisation (ANSTO). The Korea Polar Research Institute (KOPRI) operates the KSG station (62.217 • S, 58.783 • W). The station functions as a regional World Meteorological Organisation (WMO) Global Atmospheric Watch (GAW) station. The JBS is 10 m above sea level with coordinates (74.623 • S, 164.228 • E). A detailed geographical description of the sites is seen in [39].

Radon and Meteorological Data
At JBS, radon measurements have been made using a 1200 L two-filter dual-flow-loop radon detector custom built by ANSTO. Installed approximately 40 m east of the main station, air is sampled at 55 L min −1 through 50 mm high-density polyethylene pipe from approximately 6 m above ground level. In order to avoid thoron ( 220 Rn; t 0.5 = 55.6 s) from entering into the pipe and contaminating sampled air, a 400 L delay volume is coupled within the sampling line. At approximately 170 m from the radon detector, meteorological data was collected from a 10 m tower with instrumentation composed of a sonic anemometer, temperature and humidity probe, barometer and a windspeed logger. In post processing, all observations are aggregated to hourly values [39]. A radon detector similar in operation to that in JBS but with a volume of 1500 L was used for radon data collection in KSG with meteorological data collected from a nearby observation system [40]. The dataset used was measured between December and February with 1818 and 1955 datapoints for JBS and KSG respectively. Table 1 shows the statistical analysis from JBS and KSG. The mean, standard deviation and range are presented. While the mean and range describe the content of the dataset, the correlation coefficients depict the level of linear relationship between the target and predictor variables. Both tables indicate weak correlation between A Rn and the descriptors indicating that a purely linear regression model may be insufficient to represent the relationship between the descriptors and target.

SWR model
A flow chart of the stepwise process is presented in figure  1. Whenever a variable x is added in each step, all the predictor variables in the model are assessed for their significance p. If it has been reduced below a specified threshold.

Building of GS-RFR model
The computation of the GS-RFR model development was achieved using PYTHON software. The radon concentration and the descriptors, which include (air temperature (AT), atmospheric pressure (AP), absolute humidity (AH), relative humidity (RH), wind direction (WD) and wind speed (WS), after randomization, was partitioned into training and testing sets in the ratio of 8:2 respectively. The RFR model development was done with the training set, while the general predictive ability of the model was assessed using the 20% test set. A helpful purpose for randomization is that it enhances computation efficiency by ensuring unbiased spread of data during both the training or testing phase. The performance algorithm was optimized through an optimum selection of hyperparameters using grid search (GS) with cross validation. Table 3 below shows the hyperparameters that were tuned as suggested in the literature [41,42]. During the hyperparameters tuning, the 5-fold cross validation was used as the fitness function. Verification of the RF model with the optimum hy-perparameters was carried out on the testing set. *

Comparison of Performance between the SWR and GS-RFR
For the two datasets, the performance of the two models developed by SWR and GS-RFR is depicted in figure 1. The predictive capabilities of the two models were assessed using the performance measuring parameters: correlation coefficient (CC), root mean square error (RMSE) and mean absolute error (MAE). Tables 4 and 5 shows the estimated predictive performance for the two regression methods based on correlation coefficient, root mean square error and mean absolute error. Figure 2 compares the performance of the test set on the models developed using the data from KSG and JBS. The figures show better performance by the GS-RFR model over the more traditional SWR. Considering RMSE, an improvement of 64.09 % and 15.19 % was obtained on the training and test datasets, respectively at KSG while at JBS, an improvement of 75.04 % and 28.04 % was obtained on the training and test datasets, respectively ( Table 6). The optimum hyperparameters for the RFR algorithm for each dataset is summarized in Table 7. Table 4. Predictive performance of the two regression models in terms of Correlation Coefficient (CC), Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) for KSJ dataset.
The GS-RFR model presents the smallest RMSE on the two datasets employed. It also achieved the highest correlation coefficient on both training and test sets. The plots in figure 3 show the correlation between predicted and measured values of radon concentration. It can be seen that the GS-RFR model  made a potential success in describing the non-linear relationship between atmospheric radon concentration and influencing meteorological parameters considering strong correlation coefficients it achieved.

Conclusion
In this work, modelling of atmospheric radon as done using the more traditional stepwise regression (SWR) and a novel 136   grid search based random forest regression (GS-RFR). Datasets from two radon stations in Antarctica were used in the building of the models. Important factors such as air temperature, atmospheric pressure, absolute humidity, relative humidity, wind direction and wind speed were used as predictors. Comparing both models, the results show that the GS-RFR model performed better on both datasets in the training and testing phases. It presents a respective training and test improvement of 64.09 % and 15.19 % on one dataset and 75.04 % and 28.04 % on the other. Atmospheric radon data, which is finding more relevance today in the atmospheric sciences, is still scarce and not readily available. The precision and robustness of the developed models would be of significant interest in determining the concentration of radon ( 222 Rn) activity concentration in the atmosphere for various physical applications especially in regions where field measuring equipment for radon is not available but have meteorological parameters are.