Bayesian Multilevel Models for Count Data

The traditional Poisson regression model for fitting count data is considered inadequate to fit overor under-dispersed count data and new models have been developed to make up for such inadequacies inherent in the model. In this study, Bayesian Multi-level model was proposed using the No-U-Turn Sampler (NUTS) sampler to sample from the posterior distribution. A simulation was carried out for both over-and under-dispersed data from discrete Weibull distribution. Pareto k diagnostics was implemented, and the result showed that under-dispersed and over-dispersed simulated data has all its k value to be less than 0.5, which indicate that all the observations are good. Also all WAIC were the same as LOO-IC except for Poisson in the over-dispersed simulated data. Real-life data set from National Health Insurance Scheme (NHIS) was used for further analysis. Seven multi-level models were fitted and the Geometric model outperformed other model. DOI:10.46481/jnsps.2021.168


Introduction
Count data contains non-negative integers and zero obtained within a fixed period. Various studies have been carried out on count data and modelling, [1] applied it to medical data [2] applied it to model microbiome data and many others. Frequentist and Bayesian estimation have equally been used to model count data, and the widely used is the Bayesian estimation technique. There are three major types of count data, the over-dispersed, under-dispersed and over-dispersed. More about types of count data can be found in the study by [3,4]. There are models dedicated to modelling under-dispersion due to their suitability while a model such as Negative Binomial is suitable for fitting over-dispersion. Models such as Dirichlet Process Prior, Negative Weibull can be used to fit both under-dispersion and over-dispersion. Models such as zero inflated and hurdle models can effectively handle over and under-dispersed count data with many zeros. The zero-truncated regression models are specifically designed to fit count data with no zero count. The categorized regression model is designed to fit count data that its response variable is categorized. Some of the improved techniques relative to Poisson regression model can be found in [5], [6], [3], amongst others. [7] carried out a on hidden markov model in multiple testing on dependent count data, [8] showed that the exponentiatedexponential Geometric distribution can be applied to fit underdispersed or over-dispersed count data, in the same manner [4] demonstrated that Dirichlet Process Mixture Prior of Generalized Linear Mixed Models (DPMglmm) can fit either overdispersed or under-dispersed count data well. [9] sufficiently showed that multi-level zero-inflated Poisson (ZIP) regression model can adequately fit both over-and underdispersed count data that have zero counts. The authors adopted 224 EM algorithm along with the penalized likelihood and restricted maximum likelihood (REML). [10] adopted multilevel Zeroinflated Poisson (ZIP) regression and zero-inflated negative binomial (ZINB) and applied the models to fit count data relating to decay, missing and filled teeth of children aged 12 years old. A related and recent study was carried out by [11], the authors proposed multilevel zero-inflated generalized Poisson (ZIGP) that is suitable in fitting both over-and under-dispersed count data and compared with multilevel models of zero-inflated Poisson, zero-inflated negative binomial. The result showed that the multilevel ZIGP produced more accurate parameter estimates, particularly for under-dispersed data. In this study, Bayesian multilevel modelling was proposed and implemented for some basic distributions used in fitting count data (Poisson, Negative Binomial and Geometric), zero inflated and hurdle models, and identify a most suitable model for fitting under-and over-dispersed respectively. The remaining part of this paper is sectionalized as follows; multi-level modelling is described in section 2, parameter estimation and model selection can be found in 3. Section 4 is the results of simulation study and that of real-life data. Lastly, summary and conclusion in section 5.

Multilevel Modelling
The multilevel modelling technique follows a similar process involved when fitting the Generalized Linear Model. In GLM, a link function links the response y variable to the predictor(s), same with multilevel modelling. Let ( f ) be an inverse link function that links the response variable y to the predictor, then Υ is the linear combination of the predictors transformed by the inverse link function ( f ), and d be a parametric model (distribution), the model can be simply be written as And linear predictor: The data is made up of the response variable y, X and K, while β, and θ are the model parameters to be estimated. while β is the fixed effect coefficient at population level, while is the coefficient at group-level. The Bayesian estimation technique by Monte Carlo Markov Chain (MCMC) procedure considers as a parameter relative to maximum likelihood that considers as error term [12].

Zero-Inflated distribution
If Yfollows zero-inflated Poisson (ZIP) distributions, given by Where is ω in the range0 < ω < 1 , in order to accommodate more zeros than those allowed under the Poisson assumption(ω = 0), and the case of ω < 0imply zero inflated. [9] estimated multi-level parameters of ZIP regression in the generalized linear mixed models (GLMMs) context. The authors generalized the ZIP model so that the model will be able to withstand more complex correlation structure. Zero-inflated negative binomial for counts is formed from ZIP, the mean and variance defined as: The use of regression models based on ZIP was established by [13][14][15], [5]. Following [5] we have: log(λ) = Xβ and where X and Z are matrices of covariates, β and Υare vector of parameters. Assuming two linear predictors are related in some ways, [16] provided a simplest form of (3) which is refers to the ZIP(τ)model as follows: Where τ is a scalar parameter, which implies thatω = (1 + λ −r ) −1 . Following equation (3) in multi-level case, [9] identified the extension of ZIP model to include random components w i and u i within logistic and Poisson linear predictors to take care of dependence of observations in given clusters. The random effects w i and u i are specific to the i th cluster. In a three-level hierarchical situation of Y i jk , the k th observation of the j th individual within the i th clusters is measured through random effects associated with the linear predictors as follows: The covariates a T i jk and x T i jk are not always the same α and β are the corresponding vectors of regression coefficients. s i j and v i j are variations at subject level.

Hurdle Models
If the distribution of Yfollows zero-truncated Poisson distribution it follows that: Reparametizing the zero-inflated Poisson model in equation (3) withπ [16], gave Poisson hurdle regression model is given as; Where π + = 1 − π 0 is the probability of clearing the "hurdle" and generating a non-zero count. 225

Prior Distributions
Prior distribution is specified at population and group-level. At population-level parameters have an improper prior [17]. At group level it is assumed that parameters ε comes from a multivariate normal distribution having zero mean and unknown covariance matrixΣ.
Covariances are between group-level parameters are generally of different groupings factors and assumed to be zero. By implication, Kand ε can be divided to form several matrices K i and parameter vectors ε i , where i indexes grouping factors, thus, the model can be simplified to Sometimes, it can be assumed that group-level parameters for different levels of the same grouping factors are not dependent. If the other level is indexed by j, (11) leads to: The covariance matrices M j will become the model parameters.
No-U-Turn Sampler (NUTS) by (2014) [18] is used for M j as instead of the Inverse-Wishart prior distribution used in most studies and packages. Inverse-Wishart distribution is used because it has good conjugacy characteristics for Gibbs-Sampler. The choice of Inverse-Wishart prior distribution was criticized in the studies by [19], and [20]. The parameters of M j is selected in terms of correlation matrix Ω j and a vector of standard deviations σ j through, and D σ j imply the diagonal matrix with diagonal elements σ j . Then, prior would be specified for D σ j Ω j D σ j . In the case of Ω j , LKJ-Correlation prior by [21] is used, with ζ > 0. That is, Ω j ∼ LK J (ζ) sectionParameter Estimation and Model Selection Sampling from the posterior require appropriate sampling procedure, two basic sampling procedures are discussed here. First is the Hamiltonian Monte-Carlo (HMC) Sampler, also known as Hybrid Monte-Carlo [22][23]. [18] extended HMC to No-U-Turn Sampler (NUTS), because HMC has some drawbacks as discussed by [17]. The NUTS Sampler allows setting parameters, and eliminates the need for hand-tuning, [18] stated that setting the parameters automatically makes it least efficient as compared to a well-tuned Hamiltonian Monte-Carlo. Software package by R core team (2020) was used to fit the model with brms package by [17] along with Stan processor without which the analysis cannot be run, it can be assessed on http://cran.rproject.org/bin/windows/Rtools/. The Watanabe-Akaike Information Criteria proposed (WAIC) by [24] and Leave-one-out cross validation LOO-CV by [25,26] were used for model selection in this study. The WAIC was used for estimating the out-of-sample expectation and considered an improvement upon the DIC, with WAIC, correction for effective number of parameters to adjust over-fitting is added. According to [27], WAIC can be computed in two possible ways, first is calculated using simulation θ s , s = 1, .......S and given as For the second WAIC computation approach, the variance of individual terms in the log predictive density is added up over the ndata points and express as follows: The advantages of WAIC over AIC and DIC was adequately discussed by [27] In the case of Leave-one-out cross-validation (LOO-CV) in Bayesian analysis, the data are repeatedly subdivided into a training set y train and a holdout set y holdout with the objective of fitting y train yielding a posterior distribution p train (θ) = p train (θ|y train ). The Bayesian LOO-CV estimate of out-of-sample predictive fit is and estimated as Lower WAICs and LOOs suggest better model fit.

Pareto-k-diagnostics
The shape parameter k of the generalized Pareto distribution can be used to assess the reliability and approximate convergence rate of the Pareto smoothed importance sampling (PSIS). It follows that if, k < 0.5(that is, 'good') then the central limit theorem holds. Similarly, If0.5 ≤ k < 1, (that is, 'ok') then the variance of the raw importance ratios is infinite, but the mean exists. In the same manner, If k > 0.7(that is, 'bad'), unreasonable convergence rates is observed and unreliable Monte Carlo error estimates, and finally, if k ≥ 1 (that is, 'very bad'), then neither the variance nor the mean of the raw importance ratios exists.

Simulation Study
Simulation of over and under-dispersed count data was carried out and the response count variable was obtained from Discrete Weibull distribution. On simulating count data from Discrete Weibull (DW) distribution, [28] identified that the parameter β of DW should contain the range0 ≤ β ≤ 1; irrespective the value of parameterq. For under-dispersed count data, β should be specified such that β ≥ 2, irrespective of the value of q. Analysis for simulation study was carried using software package by 226  [29], and R package "DWreg" by [30]. Random numbers consisting of 500 observations were generated and two predictors were uniformly generated in interval (0, 1) and (0, 2) for overand under-dispersion respectively. For over dispersed, beta=0.8 and for under-dispersed beta=2.1. The parameter of discrete Weibull follows that θ 0 = 0.25, θ 1 = 0.35, θ 2 = 0.5 and the corresponding equation for logit link is as follows All Pareto k estimates are good (k < 0.5) If p waic estimates greater than 0.4. We recommend trying loo Instead. Table 1 shows results for Under-dispersed simulated count data. The best two performed model are Geometric and Hurdle Poisson distribution indicated with ** and * respectively with lowest WAIC and LOO, following implementation using brms, a package for Bayesian multilevel modelling in R. from Table 1, it shows that WAIC=LOO for all the models. Table 2 shows results for Over-dispersed simulated count data. The best two performed model are negative Binomial and zero inflated negative binomial distribution indicated with ** and * respectively with lowest WAIC and LOO, following implementation using brms, a package for Bayesian multilevel modelling in R. All the model shows that WAIC=LOO expect for Poisson model.

Data Description
The data set was obtained from National Health Insurance Scheme (NHIS), and it contains excess zero count. Sample of 116 users of NHIS users was obtained from September 2016 to July 2017. Response variable is number of encounter (Encounter), out of 116 observed, eighty two (82) persons made claims. Encounter is the time a user of National Health Insurance Scheme (NHIS) visits the health facility, and possibly makes claims. The predictors include type of Encounter (type), which is either primary or (secondary= 0, primary= 1)   primary are users who registered primarily to use the health facility, while secondary are users who were referred from another health facility to that of State Hospital Ota, because of availability of specialists. Another predictor Sex (male=1 and female=0), Biological age of patients (Age). Number of drugs administered (DrugsAdm) that is, both oral and injection.
Drugs-out-of-stock (DrugsOS) is another predictor. The   idea of National Health Insurance Scheme (NHIS) is that treatments and drugs administered attract 10 percent of the total cost. It becomes disadvantage to patients having cases of Drug-sOS. Out of 116 users, 97 are primary, while 19 are secondary users, 70 out of 97 (72.16%) primary users did not make claims within the period observed. 4 out of 19 secondary users did not make claims (21.05%). Out of 116 NHIS users of the health facility, 64 are females and 52 are males. 41 females had zero claims (64.06%), while 33 males had zero claims (63.46%). 11 females and 8 males are secondary users, while 53 females and 44 males are primary users. The data can be obtained https://data.mendeley.com/drafts/6hcf5mf7fy.
Mean of the response variable (Encounter) is 0.7068, while variance is 1.7916 which potentially suggests over-dispersion.
Further test on the data shows that the dispersion parameter is δ=1.49799, indicating that the data set is over-dispersed with δ>1.The dispersion test was performed using AER package in R by [31]. From Table 3 shows results for the real-life Over-dispersed count data. The best two performed model are Geometric and Negative binomial distribution indicated with ** and * respectively with lowest WAIC and LOO, following implementation using brms, a package for Bayesian multilevel modelling in R, In all the models Waic LOO as shown in Table 6. As earlier stated; for any observation for k > 0.7indicate unreasonable convergence rates is observed and unreliable Monte Carlo error estimates. Table 4 shows that hurdle negative binomial (4) has the highest of such observations, Poisson, negative binomial, and hurdle Poisson model has 3 of such observations while Geometric, zero inflated poisson and zero inflated negbin has two (2) each. Each parameter is presented in Table 5 with the posterior mean as the 'Estimate' and the 'Est.Error' as the standard deviation of the posterior distribution, the Table also contain a two-sided 95% Credible intervals (l-95% CI and u-95% CI) established on quantiles. From Table 5, the 'intercept', 'type' and 'type:Sex"interaction has a negative posterior mean. For "type", the model predicts longer periods for encounter for secondary users than primary users; 'Sex' (0.19) accounts for more Encounter than 'Age' (0.01). "drugsOS (0.41) tells us that there significant cases of drugs-out-stock which suggest ineffectiveness of Nigeria Health Insurance Scheme.
Drawing samples from (NUTS) follows that for each parameter, Efficient Sample is a real measure of effective sample size, while Rhat is the would-be scale reduction factor on split chains (at convergence, Rhat = 1). Figure 1 shows that Encounter has positive relation with type; the figure suggests that the effect of Encounter on type is higher for secondary user of the facility since it is higher on zero end than 1. Figure 2 shows that Encounter has positive relation with Age; the figure suggests that as age increases Encounter increases Figure 3 shows that Encounter has positive relation with Sex; the figure suggests that male account for more Encounter than female. Figure 4 shows that Encounter has positive relation with type; Number of DrugsAdm increases with number of Encounter Figure 5 shows that Encounter has positive relation with type; Number of DrugsOS increases significantly with number of Encounter Figure 6 shows that as primary users of the facility increases, the number of Encounter increases across ages  Figure 7 shows that as primary users of the facility increases, the number of Encounter increases across Sexes Figure 8 shows that the density plots for the tail area of the distribution, which corresponds to l-95% CI and u-95% CI in Table 5 and trace plot, the Trace plot shows that the chains are stable.

Summary and Conclusion
In this study Bayesian multi-level model was proposed and implemented. The No-U-Turn Sampler (NUTS) sampler was used to sample from the posterior distribution, and implementations were done using the 'package brms' in R. Simulation study was carried out for both over-and under-dispersed and response variables were generated through discrete Weibull distribution while the predictors generated from Uniform distribution. Results from under-dispersed revealed that Geometric distribution is the most appropriate model to fit count data using multi-level approach. While for over-dispersed simulated data, negative binomial shows to outperform the Poisson, Geometric, hurdle-Poisson, hurdle-negbin, zero-inflated-Poisson, zero-inflated-negbin. Pareto K diagnostics shows that for under-dispersed and overdispersed simulated data all k are less than 0.5, which makes all the observations to be good, also all WAIC were the same as LOO-IC except for Poisson in the over-dispersed simulated data. Real-life data set of count of Encounter of patients from National Health Insurance Scheme was used for further analysis. The model that performs best was the Geometric distribution followed by negative binomial model. Contrary to the simulated data not all WAIC were the same as LOO-IC, except for Poisson model in the over-dispersed simulated data. The need to carry out LOO-IC was informed by having observa-   Figure 7 contains marginal plots to identify the relationship between the response variable (Encounter) and covariates; type, Sex, DrugsAdm, Age, and DrugsOS. As identified by [8] that Geometric family have the ability to model count data effectively, the same has been demonstrated in this study using Bayesian multi-level regression approach. Future direction can consider fitting multi-level regression model to fit count data using distribution such as the Weibull-exponential distribution and Exponentiated Generalized Weibull proposed by [32] and [33] respectively.