 Research
 Open Access
 Published:
Simulation study for analysis of binary responses in the presence of extreme case problems
Genetics Selection Evolutionvolume 43, Article number: 41 (2011)
Abstract
Background
Estimates of variance components for binary responses in presence of extreme case problems tend to be biased due to an underidentified likelihood. The bias persists even when a normal prior is used for the fixed effects.
Methods
A simulation study was carried out to investigate methods for the analysis of binary responses with extreme case problems. A linear mixed model that included a fixed effect and random effects of sire and residual on the liability scale was used to generate binary data. Five simulation scenarios were conducted based on varying percentages of extreme case problems, with true values of heritability equal to 0.07 and 0.17. Five replicates of each dataset were generated and analyzed with a generalized prior (gprior) of varying weight.
Results
Point estimates of sire variance using a normal prior were severely biased when the percentage of extreme case problems was greater than 30%. Depending on the percentage of extreme case problems, the sire variance was overestimated when a normal prior was used by 36 to 102% and 25 to 105% for a heritability of 0.17 and 0.07, respectively. When a gprior was used, the bias was reduced and even eliminated, depending on the percentage of extreme case problems and the weight assigned to the gprior. The lowest Pearson correlations between true and estimated fixed effects were obtained when a normal prior was used. When a 15% gprior was used instead of a normal prior with a heritability equal to 0.17, Pearson correlations between true and fixed effects increased by 11, 20, 23, 27, and 60% for 5, 10, 20, 30 and 75% of extreme case problems, respectively. Conversely, Pearson correlations between true and estimated fixed effects were similar, within datasets of varying percentages of extreme case problems, when a 5, 10, or 15% gprior was included. Therefore this indicates that a model with a gprior provides a more adequate estimation of fixed effects.
Conclusions
The results suggest that when analyzing binary data with extreme case problems, bias in the estimation of variance components could be eliminated, or at least significantly reduced by using a gprior.
Background
It is well known that when the binary responses (1 = cases and 0 = controls) associated with a particular level of an effect fall within the same category, being either all ones or all zeros (known as extreme case problems or ECP), the likelihood is underidentified [1] and variance components tend to be biased. Sorensen et al. [2] noted that with an increasing number of fixed effects for a constant number of observations, a larger proportion of fixed effect levels will contain ECP. The authors further noted that no information is present in the data to estimate these fixed effects and the likelihood is illconditioned. Moreno et al. [3] reported that marginal maximum likelihood yielded biased inferences about the variance component, and that the direction of the bias depended on the amount of information associated with either fixed or random effects. Furthermore, when the Gibbs sampler was used to perform the marginalization, positively biased inferences were reported when the amount of data per fixed effect was small [3]. The authors concluded that the bias persisted when fixed effects were poorly estimated, despite the large amount of information about the variance component.
Sorensen et al. [2] hypothesized that the choice of the prior distribution for the fixed effects could be the most critical element for dealing with ECP. Hoeschele and Tier [4] conducted a simulation study which indicated that a portion of the bias could be alleviated by assuming that fixed effects follow a priori a Gaussian distribution. Likewise, Moreno et al. [3] reported a reduction in the bias when a Gaussian probability density function was assigned to the prior distribution of the fixed effects. However, this strategy may not always work as Moreno et al. [3] reported that assigning a Gaussian probability density function to the prior distribution of the fixed effects did not reduce the bias for very sparse data structures. The objective of the current study was to investigate the effect on inferences of assigning a generalized prior (gprior) with varying weights to fixed effects, first proposed by Zellner [5], for the analysis of binary data in the presence of varying percentages of ECP. A simulation using a sire model was used to test the effect of the gprior in the presence of ECP.
Methods
The threshold model is becoming a standard tool to analyze of discrete data in the field of animal breeding and genetics and extensive literature on its theoretical basis, implementation and application has been generated in the last twenty years [2, 6, 7].
Let Y_{ ijk } represent the observed binary trait taking values of 1 for cases or 0 for controls, and l_{ ijk } is an underlying continuous variable that relates to Y_{ ijk } through the following relationship:
Assuming the following mixed linear model on the liability scale:
where l, β and s are the vectors of liabilities, systematic and random effects, respectively. I is the identity matrix and X and Z are known incidence matrices.
To complete the Bayesian formulation, prior information for the model parameters must be specified. For the random effects, a multivariate normal distribution was assumed:
and a scaled inverted chisquared density was assumed for ${\sigma}_{u}^{2}$
with ν_{ u } and ${\sigma}_{0}^{2}$ set equal to 2 and the true value of the sire variance (0.05 or 0.02), respectively.
For the systematic effects, let $\mathbf{\beta}={\left[{\mathbf{\beta}}_{ECP}^{\prime},{\mathbf{\beta}}_{N\text{\_}ECP}^{\prime}\right]}^{\prime}$ where β_{ ECP }and β_{ N_ECP }are the ECP and nonECP classes. For β_{ N_ECP }, the following prior was assumed:
For β_{ ECP }, a generalized or gprior was assumed. It consists of a special form of a natural conjugate prior distribution for the parameters of multiple regression models, and can be viewed as a reference informative prior [5]. Its advantage in dealing with the ECP is that it can restrict location parameters from taking extreme large values due to an illdefined likelihood function. In its original form, a gprior is defined as a normal distribution with a zero mean and a variance proportional to the standard errors of the least square (LS) estimators of the model parameters. However, a more generalized gprior allows for a nonzero mean. The general form of a gprior is given by:
and
where ${\sigma}_{e}^{2}$ is the residual variance and was set to one, β_{ 0 } is the prior mean for the ECP classes, which could be set equal to the average of the non ECP classes, g is the relative weight given to the prior and ranges between 0 and 1, and X_{ ECP }is the incidence matrix corresponding to the ECP classes.
Following Albert and Chib [8] and Sorensen et al. [2], all conditional posterior distributions are in closed form, normal for β and s, truncated normal for l and scaled inverted Chi square for ?4?. Specifically, for β_{ ECP }, the conditional distribution is:
where ${\hat{\mathbf{\beta}}}_{ECP}$ is the estimate of the effects of the ECP classes based on data and assuming that the ${\mathit{X}}_{ECP}^{\prime}{\mathit{X}}_{ECP}$ is full rank. The vector β_{0} is, as indicated before, the mean of the normal prior for the ECP classes. Although it could be set to any reasonable value, including zero, in our case it was set equal to the average of the solutions for the nonECP classes. Note that when g = 0, the conditional distribution in (5) reduces to that obtained when a flat prior is used.
Simulation
A simulation using a sire model was carried out to investigate methods of analyzing binary data with ECP. Four overlapping generations were simulated. The base population included 50 unrelated sires and subsequent generations consisted of 150 sires each. Thus, a total of 500 sires were generated. For animals in a given generation, their sires were selected at random from the pool of sires from all previous generations. The dataset consisted of 5000 daughter records from the 500 sires with an average of around ten records per sire. Classes of the fixed effects were randomly assigned to each record.
A linear mixed model which included a fixed effect and random effects of sire and residual was used to generate the binary responses. The fixed effect has 200 classes and was drawn from a uniform distribution with on average 25 observations per class. In order to obtain a desired percentage of ECP in the data, the bounds of the uniform distributions were adjusted. For example, by decreasing the lower bound of the uniform distribution (on the negative side of the real line), more liabilities will have negative values and thus their associate binary responses will be equal to zero which, in turn, will lead to more ECP classes. However, within a given percentage of ECP, only the seed for the random number generator was changed, not the bounds of the uniform distribution. The resulting incidence rate ranged from 9 to 34%, depending on the different simulation scenarios. Transmitting ability of individuals from the base population was sampled from a normal distribution,$N\left(0,\mathit{I}{\sigma}_{u}^{2}\right)$, where I was the identity matrix and ${\sigma}_{u}^{2}=0.05\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{or}}\phantom{\rule{2.77695pt}{0ex}}0.02$. The remaining sire effects were sampled from a normal distribution with a mean equal to one half of the grandsire's transmitting ability and a variance equal to $\left(\frac{3}{4}\right){\sigma}_{u}^{2}$. The residual term was sampled from a normal distribution, $N\left(0,\mathit{I}{\sigma}_{e}^{2}\right)$, where I was the identity matrix and ${\sigma}_{u}^{2}=1.0$, resulting in a heritability of 0.17 or 0.07.
For each daughter record, the liability was calculated as the sum of all effects included in the model. Binary responses were assigned so that if the liability was greater than zero, then y_{ i } = 1, otherwise y_{ i } = 0. For each one of the two values used for the sire variance, five datasets with a varying percentage of ECP were simulated: 5% (5E), 10% (10E), 20% (20E), 30% (30E), and 75% (75E). Five replicates were simulated for each level of ECP.
Four methods to account for ECP, using a standard threshold model, were investigated: normal prior (NP), 5% gprior (5G), 10% gprior (10G), and 15% gprior (15G). All datasets were analyzed using the four methods to account for ECP; i.e., the 5% ECP data were analyzed using the NP, 5%, 10%, and 15% gprior, which are denoted as 5ENP, 5E5G, 5E10G, and 5E15G, respectively. The remaining ECP datasets are similarly denoted with the percentage of ECP listed first and then the percentage of gprior used in the analyses.
Convergence
Convergence diagnostics were based on the method of Raftery and Lewis [9], as implemented in the CODA software [10]. The required burnin period was always less than 1 300 iterations for all parameters in the analyses. Thus, a total chain length of 75 000 iterations of the Gibbs sampler was run with a conservative burnin of 25 000 iterations. The remaining 50 000 iterations were retained without thinning for postGibbs analysis. Furthermore, point estimates of the mean and standard deviation, and the high posterior density 95% [HPD (95%)] interval were obtained from the CODA software [10] for all parameters.
Results
Summaries of the posterior mean, standard deviation and HPD (95%) interval averaged over five replicates, are presented in Table 1 (when true sire variance was equal to 0.05) and Table 2 (when true sire variance was equal to 0.02). Using NP, point estimates of sire variance for 5E, 10E, and 20E, were greater than the true values but did not seem to be biased as the true value (0.05 or 0.02) was within the HPD (95%) interval. For 30E and 75E, the sire variance was grossly overestimated and biased as the true value was outside or barely within the HPD (95%) interval. In general, the sire variances decreased with an increase in the percentage of gprior used in the threshold analysis. When true sire variance was equal to 0.05, for 5E, the point estimate of the sire variance was 0.068 when NP was used; however, the point estimate decreased to 0.048 when 15G was used. This same trend was also observed for 10E (0.072 decreased to 0.050), 20E (0.082 decreased to 0.055), 30E (0.094 decreased to 0.058); and 75E (0.109 decreased to 0.069). Similar results were observed when the true value of the sire variance was 0.02 (Table 2). Analysis 15G yielded point estimates extremely similar to the true value for the 5E (0.048) and 10E (0.050) data. Furthermore, 15G yielded point estimates with values most similar to the true value and estimates with the smallest amount of bias for the 20E (0.055), 30E (0.058) and 75E (0.069) data.
A summary of the Pearson correlations between true and estimated fixed and random effects, averaged over five replicates, are presented in Table 3. For a heritability of 17%, Pearson correlations between true and estimated fixed effects for 5E (0.82), 10E (0.75), 20E (0.71), 30E (0.67), and 75E (0.49) were lowest when NP was used. The Pearson correlations between true and estimated fixed effects were virtually the same for 5E5G (0.903), with 5E10G (0.909) and 5E15G (0.909). This same trend was also observed for the 10E, 20E, 30E, and 75E data. Similar results in magnitude and trend were observed when the heritability was equal to 7%. For the sire effect, Pearson correlations between true and predicted breeding values were virtually the same across gprior weights within a given percentage of ECP.
Discussion
The results of this study indicate that point estimates of sire variance (Tables 1 and 2) obtained from 5ENP, 10ENP, and 20ENP were slightly higher than the true values, although without any indication of noticeable bias. However, point estimates of sire variance obtained from 30ENP and 75ENP were severely biased. Depending on the percentage of ECP, biases ranged from 36 to 102% when true sire variance was equal to 0.05 and from 25 to 105% when true sire variance was equal to 0.02 when NP was used. In fact, the true value for sire variance was outside the HPD (95%) interval for 75ENP, and barely inside the interval for 30ENP for both heritability values.
When a gprior was used, the bias was reduced and even eliminated, depending on the percentage of ECP and the weight assigned to the gprior. With 5G, overestimation persisted for all five ECP datasets (Tables 1 and 2). Furthermore, the true value used in the simulation was closer to the lower bound of the HPD (95%) interval, especially for 30E5G and 75E5G, indicating that point estimates of sire variance are less likely to be similar to the true value. Thus, point estimates of sire variance are biased upward. In fact, point estimates of sire variance using both heritability values were biased upward by 25 to 85% when 5G was used. Conversely, when 10G was included in the analysis, the bias was eliminated for datasets 5E (0.054) and 10E (0.057) and significantly reduced for datasets 20E (0.063); 30E (0.066); and 75E (0.071). Using 15G, bias was completely eliminated, regardless of the percentage of ECP, and only a slight overestimation was observed for 75E15G. The true value for sire variance was well centered within the HPD (95%) interval for 5E15G, 10E15G, 20E15G, 30E15G; and 75E15G further indicating that a threshold model with a 15% gprior resulted in more accurate estimation of sire variance for binary data with 5, 10, 20, 30, and 75% ECP.
The results of this study prove that when analyzing binary responses in the presence of ECP, a standard threshold model with a normal prior could lead to biased variance components, especially when the percentage of ECP is greater than 30%. Depending on the percentage of ECP in the dataset and on the weight assigned to the gprior, bias in variance component estimation could be avoided. Although these results are promising, the univariate sire model is not the typical method to analyze a binary trait. Typically, binary traits are analyzed in combination with a continuous trait(s) using an animal model rather than a sire model. Thus, further testing is needed to apply this methodology to a more realistic scenario like the use of an animal model and joint analysis of binary and continuous traits. On the practical side, it is worth mentioning that the normal prior approach could be a viable alternative when the ECP rate is low (below 30%) as reported in previous studies [3, 4]. Additionally, the results obtained using the normal prior could change if different hyperparameters or model implementations are adopted.
When compared with NP, a substantial increase in the Pearson correlations between true and estimated fixed effects were observed, for all percentages of ECP data, when 5G, 10G or 15G was used in the analysis to account for ECP (Table 3). When a 15% gprior was used instead of a normal prior, Pearson correlations between true and fixed effects increased by 11, 20, 23, 27, and 60% for 5E, 10E, 20E, 30E, and 75E, respectively. These results suggest that a standard threshold model with a normal prior was not adequate to estimate fixed effects, especially when 30% or more ECP were present. Conversely, when a gprior was used, Pearson correlations between true and estimated fixed effects were virtually the same within datasets of varying percentages of ECP. This result suggests that a standard threshold model with a gprior provides more adequate estimation of fixed effects.
There were no observable differences in Pearson correlations (Table 3) between true and predicted breeding values between using NP, 5G, 10G, and 15G within datasets of varying percentages of ECP. Unlike the correlations for the fixed effects, the increase in correlations between true and predicted breeding values when a 5% gprior was used instead of no gprior was less than 1% for 5E, 10E, 20E, 30E, and 75E. This result suggests that the presence of ECP in the fixed effects of binary responses and the weight of the gprior assigned did not impact prediction of breeding values. However, the effect of ECP on breeding value prediction could be greater in situations where an animal has only one record, which would be the case in an animal model.
In this study, the weight of the gprior was assumed known. A better approach is to estimate this parameter in the model. However, we believe that in practical situations, trying to estimate "g" could lead to the original problem of an illdefined likelihood, unless a somewhat strongly informative prior is used.
Conclusions
The results of this study prove that estimates of sire variance using a normal prior could be severely biased when a substantial percentage of ECP is present in binary data. Depending on the percentage of ECP present in the data, these upward biases of estimates ranged from 25 to 105% when a normal prior was used for the fixed effects. However, when a generalized prior was used, the bias was reduced and even eliminated, depending on the percentage of ECP and the weight assigned to the generalized prior. Pearson correlations between true and estimated fixed effects were lowest when a normal prior was used for all percentages of ECP. Furthermore, the correlations were similar when a generalized prior was used for all percentages of ECP present in the binary data. The results suggest that when analyzing binary data in the presence of ECP, bias in the estimation of variance components can be eliminated, or at least significantly reduced by using a generalized prior for the fixed effects.
Abbreviations
 5E :

5% extreme case problems
 5G :

5% generalized prior
 10E :

10% extreme case problems
 10G :

10% generalized prior
 15G :

15% generalized prior
 20E :

20% extreme case problems
 30E :

30% extreme case problems
 75E :

75% extreme case problems
 ECP :

extreme case problems
 gprior :

generalized prior
 HPD (95%) :

high posterior density 95% interval
 NP :

normal prior
References
 1.
Gelman A, Carlin JB, Stern HS, and Rubin DB: Bayesian Data Analysis. 1995, Chapman and Hall, London, England
 2.
Sorensen D, Andersen AS, Gianola D, Korsgaard I: Bayesian inference in threshold using Gibbs sampling. Genet Sel Evol. 1995, 27: 229249. 10.1186/12979686273229.
 3.
Moreno C, Sorensen D, GarcíaCortés LA, Varona L, Altarriba J: On biased inferences about variance components in the binary threshold model. Genet Sel Evol. 1997, 29: 145160. 10.1186/12979686292145.
 4.
Hoeschele I, Tier B: Estimation of variance components of threshold characters by marginal posterior modes and means via Gibbs sampling. Genet Sel Evol. 1995, 27: 519540. 10.1186/12979686276519.
 5.
Zellner A: On assessing prior distributions and Bayesian regression analysis with gprior distributions. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. Edited by: Goel PK, Zellner A. 1986, NorthHolland:Elsevier, 233243.
 6.
Gianola D: Theory and analysis of threshold characters. J Anim Sci. 1982, 54: 10791095.
 7.
Gianola D, Foulley JL: Sire evaluation for ordered categorical data with a threshold model. Genet Sel Evol. 1983, 15: 201223. 10.1186/12979686152201.
 8.
Albert JH, Chib S: Bayesian analysis of binary and polychotomous response data. J Amer Statist Assoc. 1993, 88: 669679. 10.2307/2290350.
 9.
Raftery AE, Lewis SM: How many iterations in the Gibbs sampler?. Bayesian Statistics 4. Edited by: Bernando JM, Berger JO, David AP, Smith AFM. 1992, Oxford: University Press, 763773.
 10.
Best N, Cowles MK, Vines K: CODA Manual version 0.30. 1995, MRC Biostatistics Unit, Cambridge, UK
Acknowledgements
The two (co)editors and two anonymous referees suggestions and comments contributed to greatly improve the final manuscript.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Coauthors RR and JKB designed and supervised all aspects of the study. RLS conducted the majority of the analyses. RD and EHH contributed to the drafting, results interpretation and paper formatting. All authors contributed to the drafting and editing of the manuscript. All authors read and approved the final manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Fixed Effect
 Variance Component
 Binary Data
 Binary Response
 Sire Model