Bayesian estimation in animal breeding using the Dirichlet process prior for correlated random effects

In the case of the mixed linear model the random effects are usually assumed to be normally distributed in both the Bayesian and classical frameworks. In this paper, the Dirichlet process prior was used to provide nonparametric Bayesian estimates for correlated random effects. This goal was achieved by providing a Gibbs sampler algorithm that allows these correlated random effects to have a nonparametric prior distribution. A sampling based method is illustrated. This method which is employed by transforming the genetic covariance matrix to an identity matrix so that the random effects are uncorrelated, is an extension of the theory and the results of previous researchers. Also by using Gibbs sampling and data augmentation a simulation procedure was derived for estimating the precision parameter M associated with the Dirichlet process prior. All needed conditional posterior distributions are given. To illustrate the application, data from the Elsenburg Dormer sheep stud were analysed. A total of 3325 weaning weight records from the progeny of 101 sires were used.


INTRODUCTION
In animal breeding applications, it is usually assumed that the data follows a mixed linear model. Mixed linear models are naturally modelled within the Bayesian framework. The main advantage of a Bayesian approach is that it allows explicit use of prior information, thereby giving new insights in problems where classical statistics fail.
In the case of the mixed linear model the random effects are usually assumed to be normally distributed in both the Bayesian and classical frameworks.
According to Bush and MacEachern [3] the parametric form of the distribution of random effects can be a severe constraint. A larger class of models would allow for an arbitrary distribution of the random effects and would result in the effective estimation of fixed and random effects across a wide variety of distributions.
In this paper, the Dirichlet process prior was used to provide nonparametric Bayesian estimates for correlated random effects. The nonparametric Bayesian approach for the random effects is to specify a prior distribution on the space of all possible distribution functions. This prior is applied to the general prior distribution for the random effects. For the mixed linear model, this means that the usual normal prior on the random effects is replaced with a nonparametric prior. The foundation of this methodology is discussed in Ferguson [9], where the Dirichlet process and its usefulness as a prior distribution are discussed. The practical applications of such models, using the Gibbs sampler, has been pioneered by Doss [5], MacEachern [16], Escobar [7], Bush and MacEachern [3], Lui [15] and Müller, Erkani and West [18]. Other important work in this area was done by West et al. [24], Escobar and West [8] and MacEachern and Müller [17]. Kleinman and Ibrahim [14] and Ibrahim and Kleinman [13] considered a Dirichlet process prior for uncorrelated random effects.
Escobar [6] showed that for the random effects model a prior based on a finite mixture of the Dirichlet processes leads to an estimator of the random effects that has excellent behaviour. He compared his estimator to standard estimators under two distinct priors. When the prior of the random effects is normal, his estimator performs nearly as well as the standard Bayes estimator that requires the estimate of the prior to be normal. When the prior is a two point distribution, his estimator performs nearly as well as a nonparametric maximum likelihood estimator.
A mixture of the Dirichlet process priors can be of great importance in animal breeding experiments especially in the case of undeclared preferential treatment of animals. According to Strandén and Gianola [19,20] it is well known that in cattlebreeding the more valuable cows receive preferential treatment and to such an extent that the treatment cannot be accommodated in the model, this leads to bias in the prediction of breeding values. A "robust" mixed effects linear model based on the t-distribution for the "preferential treatment problem" has been suggested by them. The t-distribution, however, does not cover departures from symmetry while the Dirichlet process prior can accommodate an arbitrarily large range of model anomalies (multiple modes, heavy tails, skew distributions and so on). Despite the attractive features of the Dirichlet process, it was only recently investigated. Computational difficulties have precluded the widespread use of Dirichlet process mixtures of models until recently, when a series of papers (notably Escobar [6] and Escobar and West [8]) showed how Markov Chain Monte Carlo methods (and more specifically Gibbs sampling) could be used to obtain the necessary posterior and predictive distributions.
In the next section a sampling based method is illustrated for correlated random effects. This method which is employed by transforming the numerator relationship matrix A to an identity matrix so that the random effects are uncorrelated, is an extension of the theory and results of Kleinman and Ibrahim [14] and Ibrahim and Kleinman [13] who considered uncorrelated random effects. Also by using Gibbs sampling and data augmentation a simulation procedure is derived for estimating the precision parameter M associated with the Dirichlet process prior.

MATERIALS AND METHODS
To illustrate the application, data from the Elsenburg Dormer sheep stud were analysed. A total of 3325 weaning records from the progeny of 101 sires were used.

Theory
A mixed linear model for this data structure is thus given by y = Xβ +Zγ + ε (1) where y is a n × 1 data vector, X is a known incidence matrix of order n × p, β is a p × 1 vector of fixed effects and uniquely defined so that X has a full column rank p, γ is a q×1 vector of unobservable random effects, (the breeding values of the sires). The distribution of γ is usually considered to be normal with a mean vector 0 and variance-covariance matrix σ 2 γ A.Z is a known, fixed matrix of order n × q and ε is a n × 1 unobservable vector of random residuals such that the distribution of ε is n-dimensional normal with a mean vector 0 and variance-covariance matrix σ 2 ε I n . Also the vectors ε and γ are statistically independent and σ 2 γ and σ 2 ε are unknown variance components. In the case of a sire model, the q × q matrix A is the relationship (genetic covariance) matrix. Since A is known, equation (1) can be rewritten as This transformation is quite common in animal breeding. A reference is Thompson [22]. The reason for making the transformation u = Bγ is to obtain independent random effects u i (i = 1, . . . , q) and as will be shown later the Dirichlet process prior for these random effects can then be easily implemented. The model for each sire can now be written as (2) where y i is n i × 1, the vector of weaning weights for the lambs (progeny) of the ith sire. X i is a known incidence matrix of order n i × p, Z i = 1 n i z (i) is a matrix of order n i × q where 1 n i is a n i × 1 vector of ones and z (i) is the ith row of B −1 .
The model defined in (2) is an extension of the model studied by Kleinman and Ibraham [14] and Ibrahim and Kleinman [13] where only one random effect, u i and the fixed effects have an influence on the response y i . This difference occurs because A was assumed an identity matrix by them.
In model (2) and for our data set, "flat" or uniform prior distributions are assigned to σ 2 ε and β which means that all relevant prior information for these two parameters have been incorporated into the description of the model. Therefore: ε is a bounded flat prior [0, ∞] and β is uniformly distributed on the interval [−∞, +∞]. Furthermore, the prior distribution for the uncorrelated random effects u i (i = 1, . . . , q) is given by Such a model assumes that the prior distribution G itself is uncertain, but has been drawn from a Dirichlet process. The parameters of a Dirichlet process are G 0 , the probability measure, and M, a positive scalar assigning mass to the real line. The parameter G 0 , called the base measure or base prior, is a distribution that approximates the true nonparametric shape of G. It is the best guess of what G is believed to be and is the mean distribution of the Dirichlet process (see West et al. [24]). The parameter M on the contrary reflects our prior belief about how similar the nonparametric distribution G is to the base measure G 0 . There are two special cases in which the mixture of the Dirichlet process (MDP) models leads to the fully parametric case. As M → ∞, G → G 0 so that the base prior is the prior distribution for u i . Also if the true values of the random effects are identical, the same is true. The use of the Dirichlet process prior can be simplified by noting that when G is integrated over its prior distribution, the sequence of u i 's follows a general Polya urn scheme, (Ferguson [9]), that is In other words, by analytically marginalising over this dimension of the model we avoid the infinite dimension of G. So marginally, the u i 's are distributed as the base measure along with the added property that p(u i = u j i = j) > 0. It is clear that the marginalisation implies that random effects (u i ; i = 1, . . . , q) are no longer conditionally independent. See Ferguson [9] for further details. Specifying a prior on M and the parameters of the base distribution G 0 completes the Bayesian model specification. In this note we will assume that Marginal posterior distributions are needed to make inferences about the unknown parameters. This will be achieved by using the Gibbs sampler. The typical objective of the sampler is to collect a sufficiently large enough number of parameter realisations from conditional posterior densities in order to obtain accurate estimates of the marginal posterior densities, see Gelfand and Smith [10] and Gelfand et al. [11]. If "flat" or uniform priors are assigned to β and σ 2 ε , then the required conditionals for β and σ 2 ε are and The proof of the following theorem is contained in Appendix A.

Theorem 1
The conditional posterior distribution of the random effect u is where φ(.|µ, σ 2 ) denotes the normal density with mean µ and variance σ 2 , u ( ) denotes the vector of random effects for the subjects (sires) excluding subject , δ s is a degenerate distribution with point mass at s and Each summand in the conditional posterior distribution of u given in (6) is therefore separated into two elements. The first element is a mixing probability, and the second is a distribution to be mixed. The conditional posterior distribution of u can be sampled according to the following rule: is the prior distribution of u . For the procedure described in equation (8),the weights are proportional to From the above sampling rule (equation (8)) it is clearer that the smaller the residual of the subject (sire) , the larger the probability that its new value will be selected from the conditional posterior density h(u |β, σ 2 γ , σ 2 ε , u ( ) , y). On the contrary, if the residual of subject is relatively large, larger than the residual obtained using the random effect of subject j, then u j is more likely to be chosen as the new random effect for subject .
The newly generated random effects for each subject (sire) will be grouped into clusters in which the subjects have equal u 's. That is, after selecting a new u for each subject in the sample, there will be some number k, 0 < k ≤ q, of unique values among the u 's. Denote these unique values by δ r , r = 1, . . . , k. Additionally let r represent the set of subjects with a common random effect δ r . Note that knowing the random effects is equivalent to knowing k, all of the δ's and the cluster membership r. Bush and MacEachern [3], Kleinman and Ibrahim [14] and Ibrahim and Kleinman [13] recommended one additional piece of the model as an aid to convergence for the Gibbs sampler. To speed mixing over the entire parameter space, they suggest moving around the δ's after determining how the u 's are grouped. The conditional posterior distribution of the location of the cluster given the cluster structure is and the matrixZ(n × k) is obtained by adding the row values of these columns of Z that correspond to the same cluster. After generating δ ( +1) these cluster locations are then assigned to the u ( +1) according to the cluster structure.
When the algorithm is implemented without this step, we find that the locations of the clusters may not move from a small set of values for many iterations, resulting in very slow mixing over the posterior and leading to poor estimates of posterior quantities. For the Gibbs procedure described above, it is assumed that σ 2 γ and M are known. Typically the variance σ 2 γ in the base measure of the Dirichlet process is unknown and therefore a suitable prior distribution must be specified for it. Note that once this has been accomplished the base measure is no longer marginally normal.
For convenience, suppose p(σ 2 γ ) ∝ constant to present lack of prior knowledge about σ 2 γ . The posterior distribution of σ 2 γ is then an inverse gamma density The Gibbs sampling scheme is modified by sampling δ from (10) and σ 2 γ from (11). The precision or total mass parameter M of the mixing Dirichlet process directly determines the prior distribution for k, the number of additional normal components in the mixture, and thus is a critical smoothing parameter for the model. The following theorem can now be stated.

Theorem 2
If the noninformative prior p(M) ∝ M −1 is used, then the posterior of M can be expressed as a mixture of two gamma posteriors, and the conditional distribution of the mixing parameter x given M and k is a simple beta. Therefore (12) and The proof is given in the Appendix.
where the summands are simply the conditional gamma mixtures in equation (12).
Finally the correlated random effects γ as defined in equation (1) can be obtained from the simulated u's by making the transformation γ = B −1 u.
Convergence was studied using the Gelman and Rubin [12] method. Multiple chains of the Gibbs sampler were run from different starting values and the scale reduction factor which evaluates between and within chain variation was calculated. Values of this statistic near one for all the model parameters was confirmation that the distribution of the Gibbs simulation was close to the true posterior distribution.

Example: Elsenburg Dormer sheep stud
An animal breeding experiment was used to illustrate the nonparametric Bayesian procedure. The data are from the Dormer sheep stud started at the Elsenburg College of Agriculture near Stellenbosch, Western Cape, South Africa in 1940. The main object in developing the Dormer was the establishment of a mutton sheep breed which would be well adapted to the conditions prevailing in the Western Cape (winter rainfall) and which could produce the desired type of ram for crossbreeding purposes, Swart [21]. Single sire mating was practised with 25 to 30 ewes allocated to each ram. A spring breeding season (6 weeks duration) was used throughout the study. The season therefore had to be included as a fixed effect as a birth year-season concatenation. During lambing, the ewes were inspected daily and dam and sire numbers, date of birth, birth weight, age of dam, birth status (type of birth) and size of lamb were recorded. When the first lamb reached an age of 107 days, all the lambs 93 days of age and older were weaned and live weight was recorded. The same procedure was repeated every two weeks until all the lambs were weaned. All weaning weights were adjusted to a 100 day equivalent before analysis by using the following formula Weaning weight − Birth weight Age at weaning 100 + Birth weight.
As mentioned, a total of 3 325 weaning records from the progeny of 101 sires were used. In other words only a sample from the Elsenburg Dormer stud was used for calculation and illustration purposes. The model in this case is a sire model and the breeding values of the related sires are the random effects. Whenever appropriate, comparisons will be drawn in Section 3 between the Bayes estimates (using Gibbs sampling) obtained from a Matlab ® programme, as well as the restricted maximum likelihood (REML) estimates. The classical (REML) estimates were obtained by using the MTDFREML programme developed by Boldman et al. [2]. The sex of the lamb was male and female, birth status was individual, twins and triplets, age of dams were from 2 to 7 years and older and the years (season of birth) from 1980-1999.
The Gibbs sampler constructed to draw from the appropriate conditional posterior distributions is described in Section 2.1. Five different Gibbs sequences of length 404 000 were generated. The burn in period for each chain was 4000 and then every 250th draw was saved, thus giving a sample of 8000 uncorrelated draws. By examination of the scale reduction factor it was clear that convergence has been obtained. Draws 500 apart were also considered but no differences in the corresponding posterior distributions, parameter estimates or random effects could be detected.

Variance components
The estimates of the variance components obtained from REML, traditional and nonparametric Bayesian procedures are reported in Table I. Also given in Table I , the heritability coefficient. In Table II Tables I and II, it is clear that the point estimates and 95% credibility intervals of σ 2 ε using REML or Bayesian methods are for all practical purposes the same. This comes as no surprise since the posterior density of the error variance is not directly influenced by the Dirichlet process prior.    The situation for the sire variance was however, somewhat different. The posterior distributions of σ 2 γ are displayed in Figure 1. The differences in the densities are evident from the figure. These differences were expected, since the sire variance component is directly affected by the relaxation of the normal assumption.
The posterior densities of h 2 are given in Figure 2. The differences in the spread of these densities are similar to those of Figure 1, i.e. the posterior density in the case of the nonparametric Bayes method is more spread out than the corresponding density for the traditional Bayes procedure.

Fixed effects
The emphasis in breeding experiments is on the variance components and on the prediction of particular random effects, but estimation of the fixed effects is also important. Once estimates of the variance components have been obtained, they can be used to derive estimates of the fixed effects. These estimates are given in the REML/BLUE column of Table III where the estimates of the first ten fixed effects are given. The "Traditional Bayes" and "Nonparametric Bayes" estimates as well as the 95% credibility intervals were calculated using the Gibbs sampler. As expected the estimates of the fixed effects for the different methods are for all practical purposes the same. The Dirichlet process prior did not directly influence the posterior distributions of the fixed effects to the same extent as that for the random effects.
The fixed effect β 1 for example measures the expected difference in average weaning weight between male and female lambs, β 2 the expected difference between "single births" and "triplets" and β 3 between "twins" and triplets. β 6 , however, measures the expected difference in average weaning weight of lambs for dams 4 and 7 years of age and β 9 measures the expected difference in average weaning weight between the lambs born in 1980 and 1999.

Random effects
Rather than reporting the results for all 101 sires, we will focus our discussion on the worst five and best five sires. The rankings of these ten sires were the same for the REML and traditional Bayes procedures but differed somewhat from the rankings of the nonparametric Bayesian method. For example sire No. 63 had the fifth best ranking according to the traditional Bayes method but was only ranked eighth using the Dirichlet process prior method. On the contrary, sire No. 59 was ranked the fifth best using the nonparametric procedure but was only the seventh best according to REML and traditional Bayes. The breeding values and 95% credibility intervals are listed in Table IV while the estimated posterior densities of the breeding values for sires No. 35 ("best sire") and 36 ("worst sire") are illustrated in Figures 3 and 4. Unlike the densities of the fixed effects, the Dirichlet process prior had a large effect on the posterior densities of the different breeding values. Also from the posterior distributions of the selected breeding values, different facts were revealed. Not only were the credibility intervals wider in the case of the nonparametric Bayesian procedure, but the central values (modes and means) were different as well. A key difference between REML/BLUP prediction and Bayesian inference is the treatment of the variance components. To obtain the BLUP estimates, the variance components are fixed at a single value, ignoring uncertainty associated with estimating their values. The Bayesian analysis incorporates this uncertainty by averaging over the plausible values of the variance components.

Precision parameter
Let us now turn to the important parameter of the Dirichlet process, M. Recall that the parameter M, a type of dispersion parameter for the Dirichlet    process prior, is a measure of the strength in the belief that G is G 0 . Although it may be hard to quantify, M is a positive scalar that is related to how "clumpy" the data are (often called a precision parameter). Clumpy data occur when the different subjects (sires) are concentrated into a few clusters. Ibrahim and Kleinman [13] chose a range of values for the parameter M to reflect small, moderate and large departures from normality for the distribution of the random effects. In practice it is, however, difficult to select appropriate values for this parameter. In this example, we extended the results of Ibrahim and Kleinman [13] by placing a vague prior on M and simulating its posterior distribution. Recall also that M determines the prior distribution of k, the number of additional normal components in the mixture and is a critical smoothing parameter for the mixed linear model. In Figure 5, the posterior distribution of k, the number of "sire clusters" for the Nonparametric Bayesian method is illustrated. The mean value is k = 70 and the mode of the posterior distribution of M is M 0 = 110.5. As mentioned, large values of M favour the base measure G 0 as our prior, i.e. the traditional Bayes procedure. For this example M 0 is relatively small which is an indication that the nonparametric Bayes procedure will give better estimates and predictions than the traditional method. The posterior distribution of M is given in Figure 6.

DISCUSSION
There is a general feeling among many statisticians that it is desirable in many contexts to make fewer assumptions about the underlying populations from which the data are obtained than is required for a parametric analysis. It is of practical interest, therefore, to study models that are less sensitive than Gaussian ones to departures from assumptions. For example, in dairy cattle breeding, it is known that more valuable cows receive preferential treatment (better housing, hormonal treatment, better feeding, etc.) and to such an extent that the treatment cannot be accommodated in the model, this leads to bias in the prediction of breeding values. Not much work so far has been undertaken on how to cope with a preferential treatment in practice. Strandén and Gianola [19,20] applied a Bayesian approach to fit mixed linear models with t-distributed random and residual terms both in the univariate and multivariate forms. The problem with the t-distribution, however, is that it does not cover departures from symmetry. It might also be that no single parametric model could fully capture the variability inherent in the response variable, suggesting that we average over a collection of densities (see Carlin and Louis [4]). Unfortunately inference using mixture distributions is sometimes difficult since the parameters of the models are often not fully identified by the data. An appealing alternative to discrete finite mixtures would be continuous mixtures but this creates the problem of choosing an appropriate mixing distribution.
In this note and for our example, the mixing distribution was specified nonparametrically by using the Dirichlet process prior, in fact we allowed the data to suggest the appropriate mixing distribution. The Dirichlet process prior for the random effects is a more flexible prior than the t-distribution because it can accommodate an arbitrarily large range of model anomalies (multiple modes, heavy tails, skew distributions and so on).
In our opinion the best solution (for the preferential treatment of the animals problem) will be to combine the t-distribution and the Dirichlet process prior, i.e. the student t-distribution for the errors and the Dirichlet process prior for the random effects. This will result in a model that is also robust to outlying observations. At the moment, it is not yet possible to apply these robust methods to large data sets in animal breeding. Hence, if it is established that such a nonparametric model improves the estimation of variance components and fixed effects and the prediction of random effects, simpler and faster computer programmes should be developed.
We must, however, state that the discussion of the "preferential treatment of animal problems" has no linkage to our particular data set problem. The data from the Elsenburg Dormer sheep stud was mainly included for illustration and comparison purposes: i.e. to illustrate the application of the Dirichlet process prior for correlated random effects (which as far as we know has never been done before) and to compare the corresponding results of the REML, traditional Bayes and nonparametric Bayesian procedures. We are, however, quite sure that the Dirichlet process prior will in the future play an important role in cases where undeclared preferential treatment of animals occurs.

CONCLUSION
In this note we have applied a Bayesian nonparametrics method to the mixed linear model. A Gibbs sampling procedure to calculate posterior distributions for the genetic parameters of this model has been presented. The important contribution of this paper revolves around the nonparametric prior distribution, the Dirichlet process prior, for the random effects and to correctly model and interpret the estimated effects and variance components.
It is clear from the example that the error variance and fixed effects are not directly influenced by the Dirichlet process prior. The situation for the sire variance and breeding values is, however, quite different. The posterior densities for the nonparametric Bayes method are more spread out. These differences are to be expected, since the sire variance component and random effects are directly affected by the relaxation of the normal assumption.
It is also well known that the value of the precision parameter M largely influences the posterior distribution of the random effects, the sire variance and the heritability coefficient. The value of M will further determine whether the estimates from the Dirichlet process behave like the standard Bayes estimate or like a nonparametric likelihood estimator. Further research and simulation experiments should be done to determine if the Dirichlet process prior can be used for solving the "undeclared preferential treatment of animals" problem.