Model Fitness and Predictive Accuracy in Linear Mixed-E ﬀ ects Models with Latent Clusters

In clustered data, observations within a cluster show similarity between themselves because they share common features di ﬀ erent from observations in the other clusters. In a given population, di ﬀ erent clustering may surface because correlation may occur across more than one dimension. The existing multilevel analysis techniques of the primal linear mixed-e ﬀ ect models are limited to natural clusters which are often not realistic to capture in real-life situations. Therefore, this paper proposes dual linear mixed models (DLMMs) for modeling unobserved latent clusters when such are present in data sets to yield appreciable gains in model ﬁtness and predictive accuracy. The methodology explored the development and analysis of the dual linear mixed models (DLMMs) based on the derived latent clusters from the natural clusters using multivariate cluster analysis. A published data set on political analysis was used to demonstrate the e ﬃ ciency of the proposed models. The proposed DLMMs have yielded minimum values of the models’ assessment criteria (Akaike information criterion, Bayesian information criterion, and root mean squared error), and hence, outperformed the classical PLMMs in terms of model ﬁtness and predictive accuracy.


Introduction
The multilevel modeling technique follows a similar process involved when fitting the Generalized Linear Model [1]. In particular, a Linear mixed model (LMM) is one of the approaches in modeling normally distributed clustered data [2]. In clustered data, observations within a cluster show similarity between themselves because they share common features different from observations in the other clusters. In a given popu-lation, different clustering may surface because correlation may occur across more than one dimension [3]. They further argued that clustering is in essence a design problem, either a sampling design or an experimental design issue. Even if data is collected in an unclustered way, there is still natural clustering in the population.
As an illustration from Nigeria s crime analysis, the initial dataset comprised 36 states grouped into six geo-political zones and 12 Police Zonal Commands that share spatial and socioethnic similarities. However, the optimal number of clusters provided new structure classifications based on crime rates sim- ilarities different from the initial spacial and socio-ethnic similarities [4]. It is posited here that the observations in the newly formed clusters based on multivariate clustering similarities are more correlated than in the natural clusters based on sampling and experimental design similarities. The former would better account for the differences between the clusters and improve model fitness and predictive accuracy.
The natural clusters and latent clusters are respectively described as 'primal clusters' and 'dual clusters'. The linear mixed-effects models (LMEMs or simply LMMs) on the primal clusters and dual clusters are respectively described as primal linear mixed models (PLMMs) and dual linear mixed models (DLMMs). This paper proposes efficient DLMMs for modeling data with latent clusters with appreciable gains in model fitness and predictive accuracy.

The Concept of LMEMs on Latent Clusters and Model Assessment Criteria
The general concept of LMEMs on latent clusters is to maximize correlation of observations within clusters, model fitness and predictive accuracy. The latent clusters were formed from natural clusters using the multivariate cluster analysis. Both the natural and dual clusters contain the same observations, although the cluster structures differ. The argument for comparing models formed from same data set with differing data structures was demonstrated by [14]. Agglomerative algorithm is a common approach in cluster analysis for classifying observations that share common properties into groups. The algorithm starts by calculating the distances between all pairs of observations followed by stepwise agglomeration of close observations into groups. Euclidean distance is the most commonly used distance measure in numerical data, while Ward method is the most frequently used linkage method [5].
LMM is a linear model with an extension of accounting for dependency among clustered observations. In biological and social sciences, a model-based cluster analysis utilizes LMM in the grouping of individuals into one of two or more clusters according to their longitudinal behaviour similarities. [6,7]. In contrast to those studies that utilize Expectation-Maximization algorithms in cluster formations, this study conjoins LMEMs and multivariate cluster analysis to develop efficient techniques for modeling unobserved latent groupings in a data set.
The degree of clustering in a data set is measured by the intraclass correlation (ICC). The ICC is the proportion of total variance in the data that is due to the clusters [8]. The argument in the multilevel analysis on latent clusters is that the increasing clustering in DLMM would simultaneously reduce the indices of the model assessment criteria. Many indices are available to measure the performances of competing models [9], however, the models' assessment criteria in this work are the root mean squared error (RMSE), Akaike information criterion (AIC) and the Bayesian information criterion (BIC).
The RMSE indicates the absolute fit of the model to the data. Smaller values of RMSE indicate better fit results. The LMM improves model fitness and predictive performance, and this is because a multilevel model produces fitted values,ŷ, that are on the average closer to the observed y than those obtained by fitting simply the fixed part of the model. Again, even in the multilevel analysis, when the estimated random effects tend to be biased towards zero; it pulls the fitted values in the direction of those of the fixed part of the model that results in bias estimates. Furthermore, simpler models such as random intercept models produce larger bias relatively than the complex models such as random intercept and slope models [10]. Therefore, by analogy, the more the estimated random effects tend to be larger, the more theŷ moves closer to y, and hence the lower the RMSE.
The simplest information criterion widely applicable to nonnested models is the AIC [11,12]. This traditional AIC is not appropriate in clustered data, and therefore marginal AIC (mAIC) is the most widely used in model selection in LMMs [12]. A related criterion to the mAIC in their marginal likelihoods is the BIC [13]. The presence of random effects in LMMs results in smaller AIC and BIC than in the LMs [14].

The Linear Mixed Model
Consider a vector y of data from J clusters, the LMM as define by [2,15] is where y j is the n j vector for cluster j, where j = 1, 2, · · · , J is the cluster index, β is the p-vector of fixed effects, υ j is the qvector of random effects for cluster j, X j and Z j are respectively the n j × p and n j × q matrices of covariates for the fixed and random effects of full rank. It is assumed that υ j and j follow independent and multivariate Gaussian distributions such that [16]: were T is q×q positive definite covariance matrix of the random effects, υ j is assumed independently for each j, j associated with different clusters are assumed independent of each other, and that j is assumed independent of υ j [15]. In a marginal model with a marginal likelihood given as

Cluster Effects
The cluster effect or the dependency among clustered observations is measured by the ICC, and is defined as where σ 2 υ and σ 2 e are random effects variance and random error variance respectively.

Root Mean Squared Error
The RMSE is the difference between observed data and the predicted values from the model, and it is defined as where J is the number of clusters, n j is the number of observations in the jth cluster, y i j andŷ i j are the ith observed and estimated y in jth cluster, respectively [17].

Marginal Akaike Information Criterion
The commonly used information criterion is the AIC [11]. This criterion which is based on Kullback-Leibler distance is defined as where f (y/ψ(y)) is the maximized likelihood, and k is the number of parameters. This AIC is not appropriate in clustered data, and hence the mAIC is widely used in the clustered data [12].
The mAIC in the LMM uses the likelihood of the implied marginal model y ∼ N(Xβ, V) with V = I n + ZT Z . The number of estimable parameters then is p + q, with β = (β 1 , · · · , β p ) and q the number of unknown parameters θ in V. Thus, the mAIC is defined as where f (y/β,θ) is the maximized marginal likelihood. However, the mAIC is positively biased, and favours smaller models without random effects [18].

Bayesian Information Criterion
The is obtained by taking the mAIC (9) and replacing the constant 2 in the penalty by log(n) to obtain This definition ensures that BIC bears the same relationship to mAIC for model (1) as BIC bears to AIC in regression and so should inherit some of its properties [13].

Cluster Effects on Model Assessment Criteria
Cluster effects in mixed models are explained by the random effects variance of the models, and including the random effects has an effect on the covariance matrix, V j . As an illustration, consider a random intercept model from a data where five observations are taken on each cluster, so that n j = 5 for all j. Therefore, Z j is a matrix of dimension 5 × 1 and R j = σ 2 × I 5×5 . Then The elements in the diagonal, σ 2 υ + σ 2 , are correlations between two observations from the same cluster, and the elements at off diagonal, σ 2 υ , are the covariances between any two units on the same cluster. By relating the two terms, the intraclass correlation between two observations from the same cluster is σ 2 υ /(σ 2 υ + σ 2 ) [14]. Denoting V j as V j(P) and V j(D) for the PLMM and DLMM respectively, therefore, if the ICC (D) > ICC (P) as in (6) and (7), then V j(D) > V j(P) . From (4), we respectively ascribe the marginal likelihood for the PLMM and DLMM as l(y j /β,θ) (P) and l(y j /β,θ) (D) , such that l(y j /β,θ) (P) = − 1 2 log|V j(P) |− 1 2 (y j −X jβ ) V −1 j(P) (y j −X jβ ) (12) and Similar to the basic concept of fraction that a negative fraction increases with the increase in the denominator, the second term in (13), − 1 2 (y j − X jβ ) V −1 j(D) (y j − X jβ ), is relatively higher than in (13) because of the increase of the inverse of the matrix V −1 j(D) relative toV −1 j(P) . Again, in the basic concept of logarithm that a negative logarithm decreases with the increase of a number, the first term in (13), − 1 2 log|V j(D) |, is relatively lower than in (12) because of the increase of the inverse ofV j(D) relative tô V j(P) . Although, the first term and second term in (13) decrease and increase respectively, the increase outweights the decrease, such that l(y j /β,θ) (D) > l(y j /β,θ) (P) (14) The presence of negative sign in the −2log[ f (y/β,θ)] for the information criteria in (9) and (10) has changed the direction of the inequality in (14), such that l(y j /β,θ) (D) < l(y j /β,θ) (P) . The p and q are the same in both the PLMM and DLMM, and therefore mAIC (D) < mAIC (P) and The LMM improves model fitness and predictive performances because it incorporates clustering effects when estimating the fixed parameter. This adjustment enhances it to produce fitted values,ŷ, that are on the average closer to the observed y than those obtained by fitting simply the fixed part of the model [10]. In our proposal, DLMM has higher clustering effect than PLMM that enhances it to produce fitted values,ŷ, that are on the average closer to the observed y than those produced by the PLMM.

Cluster Algorithm: The Agglomerative Algorithm
The agglomerative procedure depends on the definition of the distance between two clusters. For a particular case where metric A = (S −1 X 1 X 1 , · · · , S −1 X p X p ) is used for the standardization of the variables, the Euclidean distance d i j between two cases i and j with variable values x i = (x i1 , x i2 , · · · , x ip ), x j = (x j1 , x j2 , · · · , x jp , ) is defined by where S X k X k is the variance of the kth component [19].
Ward algorithm computes the distance between groups and joins the ones that do not increase a given measure of heterogeneity "too much"so the resulting groups are as homogeneous as possible. If two objects or groups say, P and Q, are united, one computes the distance between this new group (object) P + Q and group R using the following distance function d(R, P + Q) = n R + n P n R + n P + n Q d(R, P) + n R + n Q n R + n P + n Q d(R, Q)− n R n R + n P + n Q d(P, Q) The heterogeneity of group R is measured by the inertia inside the group. This inertia is defined as [20].

Illustration and Analysis
A published data set on political analysis was used to demonstrate the efficiency of the proposed models. The dataset dcese provided with the ceser R package came from [21]. It contains information on 299 (i = 1, 2, · · · , 299) observations across 47 countries ( j = 1, 2, · · · , 47). The outcome variable is the effective number of electoral parties (enep). The explanatory variables are the number of presidential candidates (enpc), the proximity of presidential and legislative elections (proximity); the effective number of ethnic groups (eneg), the logarithm of average district magnitudes (logmag), and an interaction term between the logarithm of the district magnitude and the number of ethnic groups (logmag eneg = logmag × eneg).

Comparison between Primal and Dual Linear Mixed Models
We begin with a preliminary comparison of the PLMM and DLMM using primal and dual cluster data sets with J = 47 number of groups, and subsequently test the significance of the comparison. The comparison is in terms of the variancecovariance components and their impact on model fitness and predictive accuracy. The summary outputs of the models are presented in Table 1.
It reveals from the summary in Table 1 that while σ 2 e is higher under PLMM, the σ 2 υ and ICC are relatively higher under DLMM. There is a 61 percent decrease of σ 2 e from PLMM to DLMM, and respectively 64 and 38 percent increase in σ 2 υ and ICC from PLMM to DLMM. The AIC, BIC and RMSE are lower in DLMM than under PLMM by 18, 17 and 38 percent, respectively. Hence, the proposed DLMM has increased the homogeneity of the observations within clusters and the heterogeneity of the clusters, which in turn increased the model fitness and predictive accuracy.
The PLMM and DLMM in Table 1 are described as 'full models' because they compose of significant and nonsignificant explanatory variables. We shall now determine if we can obtain similar gains in the model assessment criteria when only significant variables are included in the models. The 4 models with only significant variables are described as 'reduced models'. The enpc is the only significant variable in both the PLMM and DLMM. The summary of the reduced models is in Table 2.
The ICC in DLMM has increased by 38 percent from PLMM, and this increase is the same as it was in the full model. The AIC, BIC, and RMSE have smaller values under DLMM than under PLMM by 18, 18, and 38 percent, respectively. Similarly, the percentage decrease is almost the same as it was in the full model. Although the magnitudes of the AIC and BIC have reduced when non-significant explanatory variables are excluded in both the full PLMM and DLMM; however, the percentage difference between the PLMM and DLMM is almost the same in both the full and reduced models.
The PLMM and DLMM in Table 2 are random intercept models, we recast them to random intercept and slope models to assess the effects of increasing complexity in DLMMs. The summary of the random intercept and slope models is in Table  3.
The ICC in DLMM has increased by 20 percent from PLMM, which is lower than in the random intercept model. The AIC, BIC and RMSE have lower values in DLMM than under PLMM by 17, 17 and 32 percents, respectively. A similar percentage differences are recorded between the PLMM and DLMM as were in the random intercept models; however, the difference is smaller in RMSE.
The comparison reveals a superiority of random intercept and slope DLMM over random intercept DLMM in terms of model fitness. The comparison reveals a superiority of random intercept and slope DLMM over random intercept DLMM in terms of model fitness. This coincides with the work of [14] when random intercept and slope model has smaller value of AIC than in the random intercept model. The model predictive accuracy is higher in DLMM than in PLMM, and also it is higher in random intercept and slope DLMM than in random intercept DLMM. Higher predictive accuracy signifies smaller RMSE.
The above comparison used single sample outcome, J = 47, and hence it has not satisfied statistical testing procedure. Therefore, we obtained fifteen sample combinations of the PLMMs and the corresponding DLMMs and compared their respective outcomes. Some sample combinations were replicated to explore possible outcome variability.

Assessing Clustering Effects between the PLMMs and DLMM
The ICC is a function of σ 2 υ and σ 2 e , and they are presented in Table 4 and Figures 1 and 2.
It shows that σ 2 e decreases and σ 2 υ increases significantly from PLMM to DLMM. The decrease in the σ 2 e signifies a homogeneity of observations; that is, the increase of correlations/ dependency of observations within the dual clusters. The increase in the σ 2 υ signifies heterogeneity of clusters; that is, the between cluster variations. The two variance components have greatly affected the ICC, which is significantly higher in the DLMMs. This signifies higher grouping structure in the dual clusters, and higher clustering effects in the DLMMs. 5

Assessing Model Fitness and Predictive Accuracy between PLMMs and DLMMs
The model assessment criteria are presented in Table 5 and Figures 4, 5 and 6.
The DLMMs have smaller AIC and BIC than PLMMs, this indicates a significant gain in model fitness in the DLMMs over PLMMs. In addition to the relative selection of the best-fitted model carried out using the AIC and BIC, we supplemented the selection with the assessment of the model s predictive accuracy. The RMSE is significantly lower in the DLMMs, and this 6

Conclusion
The paper proposed the development and analysis of DLMM on the dual clusters derived from the primal clusters. The clustering similarity in the dual clusters was based on the commonly occurring phenomenon or the experimental designs, and the similarity in the dual clusters was based on the multivariate clustering algorithms. Findings revealed that observations in the dual clusters are more correlated than in the primal clusters. The proposed DLMM is relatively more efficient than the classical PLMM based on the results of the models' assessment criteria (AIC, BIC, and RMSE) in which the DLMM yielded minimum values of the assessment criteria. Therefore, the proposed DLMM outperformed the classical PLMM in terms of model fitness and predictive accuracy.