Gibbs sampling, adaptive rejection sampling and robustness to prior specification for a mixed linear model

Echantillonnage de Gibbs, echantillonnage de rejet adapte et robustesse relative a l'a priori specifie pour un modele lineaire mixte. On utilise de plus en plus des methodes de Monte-Carlo basees sur des chaines de Markov pour faire des inferences sur des distributions marginales a posteriori des parametres de modeles genetiques quantitatifs. Cet article concerne l'application d'une telle methode, l'echantillonnage de Gibbs, a des inferences bayesiennes sur des parametres d'un modele lineaire mixte normal, quand on impose une contrainte sur les valeurs relatives des composantes de variance. Deux distributions a priori incorporant cette contrainte sont proposees. Pour l'une d'elles, la technique de l'echantillonnage de rejet adapte est employee pour echantillonner dans une distribution conditionnelle a posteriori d'un rapport de variance. Des donnees simulees dans un modele pere equilibre sont utilisees pour comparer differentes mises en ouvre de l'echantillonnage de Gibbs et aussi les inferences basees sur les deux distributions a priori. Les distributions marginales a posteriori des parametres sont donnees comme illustrations. Les exemples numeriques suggerent qu'il n'est necessaire d'eliminer aucun resultat d'iteration et que des inferences similaires sont faites en utilisant l'un ou l'autre des deux a priori.


INTRODUCTION
Parameters such as heritability, and predictors of individual breeding values, depend on the unknown variance components in quantitative genetic models. The wide array of methods available for estimating variance components includes ANOVA, likelihood-based methods and Bayesian methods. A difficulty with ANOVA methods which has concerned many authors is that ANOVA estimates may lie outside the parameter space, for example estimates of heritabilities which are negative or greater than one. The use of such estimates for genetic parameters can lead to very inefficient selection decisions.
Owing to rapid advances in computing technology, methods based on likelihood have become common in animal breeding; in particular REML (Patterson and Thompson, 1971) has been widely used. REML estimators are by definition always within the parameter space, and they are consistent, asymptotically normal and efficient (Harville, 1977). However, interval estimates for variances based on the asymptotic distribution of REML estimators can include negative values (Gianola and Foulley, 1990).
In recent years, Bayesian methods have been applied to variance component estimation in animal breeding (Harville, 1977;Foulley et al, 1987;Cantet et al, 1992;Wang et al, 1993Wang et al, , 1994. For normal models, the joint posterior distribution of the dispersion parameters can be derived, but numerical integration techniques are required to obtain the marginal distributions of functions of interest. Gibbs sampling is a Markov chain Monte-Carlo method which operates by generating samples from a sequence of conditional distributions; under suitable conditions, its equilibrium distribution is the joint posterior distribution. Gilks et al (1996) provide a thorough introduction to the Gibbs sampler. The method is applied to variance component estimation in a general univariate mixed linear model by Wang et al (1993). Also, Wang et al (1994) obtain marginal inferences about fixed and random effects, variance components and functions of them through a scalar version of the Gibbs sampling algorithm. The objective of this paper is to broaden the application of Gibbs sampling in an animal breeding context by considering prior distributions which incorporate a constraint on the relative values of variance components: such a constraint might arise from restricting heritability to values less than one. The next section defines the model to be considered and sets out two possible forms for the prior distribution: the second of these is in terms of one variance and a variance ratio. The following section describes how the techniques of Gibbs sampling and adaptive rejection sampling can be applied to inferences about model parameters under the two specifications. Simulated data from a balanced univariate one-way sire model are used to demonstrate these methods, to compare three implementations of Gibbs sampling and to examine the robustness of the inferences to the choice of prior.

Model
We consider inferences about model parameters for a univariate mixed linear model of the form where y and e are n-vectors representing observations and residual errors, 0 is a p-vector of 'fixed effects', u is a q-vector of 'random effects', X is a known n x p matrix of full column rank and Z is a known n x q matrix. Here e is assumed to have the distribution N n (0, Q e I), independently of fl and u. Also u is taken to be N 9 (0, Q u G) with G a known positive definite matrix.

Restrictions on variance ratios
In animal breeding applications, interest may be in making inferences about ratios of variances or functions thereof, rather than about individual variance components themselves. For example, if u and y represent vectors of sire effects and of phenotypic values recorded on the sires' offspring in unrelated paternal half-sib families, and 'Y denotes the variance ratio or 2/ 0 ,2, then the heritability, h 2 , of the trait is an increasing function of -y given by h! = 4/(1 + -y-1 ). The constraint that the heritability is between 0 and 1 restricts 'Y to lie in the interval (0, ! ) . A method of inference which ignores such restrictions may lead to ridiculous estimates of heritability. More generally, we might consider restricting -Y to an interval (0, g), corresponding to the constraint that 0 < Q! < g Q e, where g is a known constant.
Bayesian formulation with variance parameters Q e and Q! u 2 In the above model, the assumption made about the distribution of u must be supplemented by prior distributions for fl, Q e and or u 2. In specifying their form, we might aim to reflect the animal breeder's prior knowledge while permitting convenient calculations on the resulting posterior distributions. The prior distribution of fl is taken here to be uniform, corresponding to little being known about its value initially. Wang et al (1993,1994) make the same prior assumption about P (in a model with further random effects), and consider taking ae and a to be independent with scaled inverse-x 2 distributions X -2 (v e , (v e k e )-1 ) and X -2 (v u , (v u k u )-1 ); ke and k § ! can then be interpreted as prior expectations of or-2 and a : ;¡ 2 , respectively, and v. and v u as precision parameters analogous to degrees of freedom. This prior assumption may be modified to take account of an upper bound on the variance ratio, so that o,2and Q u are not independent, but have a joint prior density of the form The conditional density of y given !, u and afl for the model given in , so the joint posterior density is given by It is shown in the Appendix that if v u is positive and n + v e exceeds p then the marginal posterior distribution of q is proper and hence this joint distribution is also.
To implement the Gibbs sampling algorithm, we require the conditional posterior distributions of each of (3, u, u 2 and a given the remaining parameters, the socalled full conditional distributions, which are as follows (using the notation [. 1.] for a conditional distribution) (truncated at ga!). Apart from the use here of proper prior distributions for the variances, these are similar to the conditional distributions given in Wang et al (1993). Methods for sampling from these distributions are well known.
An alternative Bayesian formulation with prior independence of 0 ' e 2 and As Wang et al (1993) show for a model with a slightly simpler prior for Q e and o,,2,, the prior specification in terms of these two parameters is a convenient one for applying the method of Gibbs sampling. It is also easily generalized when several traits are recorded. It may not, however, be the most useful specification for eliciting the prior opinions of animal breeders about parameter values. The animal breeder may prefer to think in terms of the heritability of a trait, equivalent for a paternal half-sib model to the variance ratio 7 = a£ lafl defined above. A second function of the two variances must also be considered in order to specify their joint prior distribution, and or may be an appropriate choice because inferences about Q e 2 from previous data are likely to be much more precise than those about 0' u 2. To e specify a joint prior distribution for the sire and residual variance components, it may therefore be convenient to assign priors to Q e and the variance ratio y. We thus consider the alternative parameter vector (p, u, ae, q), but make the same prior assumptions for and u as before, noting that ou becomes -YU2 . We can incorporate an upper bound g on the range of q more naturally than with the previous formulation by using the family of Beta distributions on the interval (0, g): this is a fairly flexible family with probability density of the form If y is taken to be independent of P and a e a priori, and a § assumed to be x-2 (v e , ( Le k e )-1 ) then the joint posterior density is given by The full conditional posterior distributions for fl and u are the same as [4] and !5!, while that for <7! is The full conditional density for y is proportional to This family is not a well-known one, so it is not immediately clear how to sample from it. However, the adaptive rejection sampling technique of Gilks and Wild (1992) provides a reasonably efficient method for sampling from the distribution of ln 7 .
Given these two methods of including the restriction on &OElig;!/ &OElig;;, the question arises of how posterior inferences, especially about q, are affected when the second prior specification is used instead of the first. To answer this, we consider alternative specifications in which the priors for the variance parameters are similar: to match these distributions we choose a and b in [8] to give -y the same upper and lower quartiles under both. Details are given in the Appendix.

Gibbs sampling
Markov chain Monte-Carlo methods, particularly Gibbs sampling, are now widely used for making inferences from posterior distributions of parameters.  review the Gibbs sampler and other Monte-Carlo methods,  illustrate the use of the Gibbs sampler on a range of statistical models, including variance components models. Wang et al (1993,1994) give two versions of the Gibbs sampler for a univariate normal mixed linear model assuming prior independence of the variance components.
The Gibbs sampler algorithm generates samples iteratively from the full conditional distributions of disjoint subsets of the parameters given the current value of the complementary subset.
The Thus, subsets of the parameters are sampled in an arbitrary order and this cycle defines the transition from 0(°) to B( 1 ). Subsequent cycles replace 0( l ) to give further vectors B( 2 ), 0(!) , and so on.
Opinions differ on how the sequence of vectors used in subsequent analyses should be defined. The following three methods of implementation have been proposed, and are compared on simulated data in the next section.
(a) A single long chain: one generates a single run of the chain as practised by Geman and Geman (1984), ie, (vii) repeat (ii)-(vi) m times using updated values and store all the values.
A common modification is to discard vectors from the start of the sequence to allow it to approach equilibrium. The remaining vectors 0( l ), ... , , 0Cm ) approximate simulated values from the posterior distribution of 0, but successive vectors may be correlated and the sequence may be strongly influenced by the starting values.
(b) Equally spaced samples: to reduce serial correlations, one can choose suitable integers k and m, perform a single run of length km, and then form a sample from every kth vector, ie, use o( k ), 0( 2k ), ... , 0(!!) .
(c) Multiple short chains: by contrast,    perform m runs each of length l, with different starting values, forming a sample by using the last iterate from each run.
In support of implementation (c), Gelman and Rubin (1992) argue that, even if using one long run may be more efficient, it may still be important to use several different starting points. It has been common practice to discard a substantial number of initial iterations, and to use implementations (b) and (c) (especially (b), storing only every kth, usually 10th or 20th, iterate). The results of Raftery and Lewis (1992) suggest that in many cases this is rather profligate.
For any individual parameter, the collection of m values can be viewed as a simulated sample from the appropriate marginal distribution. This sample can be used to calculate summaries, such as the marginal posterior mean, or to estimate marginal posterior distributions of parameters or of functions of them using kernel density estimation. We use kernel density estimation below with normal kernels and window width suggested by Silverman (1986).

Adaptive rejection sampling
The adaptive rejection sampling technique introduced by Gilks and Wild (1992) is a method of rejection sampling intended for situations in which evaluation of a univariate density is computationally expensive. It uses a squeezing function (Devroye, 1986), and is adaptive in the sense that the rejection envelope and squeezing function (which are piecewise exponential) are revised as sampling proceeds, and approach the density function. In Gibbs sampling, the full conditional distribution for each parameter changes as the iteration proceeds, so that usually only one value is required from a particular density. Adaptive rejection sampling has the advantage that it does not require an optimization step to locate the mode of the density before a sample can be generated. The method requires that the density is log-concave, that is that the derivative of the logarithm of the density is monotone decreasing. It is shown in the Appendix that the density of y in expression !11! is guaranteed to be log-concave only if a is greater than !q 2 + 1, but that if we transform to 6 = In&dquo;( then the full conditional density of 6 is log-concave if b is at least one. The condition on a is not a reasonable one, since a prior with a maximum at zero must have a no greater than one, for example. However, a value of b less than one implies that the prior density in [8] has a maximum at its upper bound, g (which might correspond to a heritability of one), and it is not unreasonable to exclude such values. Hence we use adaptive rejection sampling with ln 7 .

APPLICATION TO A ONE-WAY CLASSIFICATION
The methods described above were applied to a one-way sire model of the form This is intended to represent observations of a trait on the d half-sib offspring of each of s sires: /3 and the u i are the overall expectation and the ith sire effect, respectively. It is a particular case of the model given in [1] in which p and q equal 1 and s, and X, Z and G become I sd (a vector of Is with length sd), diag(l d , ... , l d ) and Is. The full conditional distributions for {3, u, o,2 and Q u defined in expressions [4]-[7] become N ! I where y,, Y.. and u. denote the vector of family means and the overall means of the yg and the u i .
Four sets of data were generated using this model with parameter values 0 = 0, a2 = 0.975, er! = 0.025 and hence h 2 = 0.1 with 25 families and 20 offspring per sire. ANOVA estimates are summarized in table I: the data sets are ordered by the estimates of heritability, which range from -0.073 to 0.306. The first ANOVA estimate of a 2 is negative, making conventional inference about this parameter difficult.
Analysis using the prior for Q e and Q u u 2 Bayesian analyses of the four data sets are given using the prior distribution in [2] with v! = vu = I, k! = 0.975, ku = 0.025 and g 1 The prior distribution for with v, = v u = 1, k e = 0.975, k u = 0.025 and 9 = 3. The prior distribution for Q e and au is proper and leads to a proper joint posterior distribution (as shown in the Appendix), but the small values chosen for v, and Vu correspond to a very dispersed distribution; this is intended to reveal any problems with convergence of the Gibbs sampler algorithm. The sample data provide little information about Q u u as there are only 25 sires. The algorithm is carried out for the four sets of data with the following implementations, which each provide 1 000 Gibbs samples: (a) a single run of 1000 iterations; (b) taking every (1) 10th, (2) 20th and (3) 30th value in single runs of length 10 000, 20 000 and 30 000, respectively, using only one starting value for the parameter vector; (c) short runs of length (1) 10, (2) 20 and (3) 30, storing the last iterate and replicating this process 1 000 times in each case.
The parameter values used to generate the data were also employed as starting values for all three implementations. No iterates were discarded since the true parameter values were known.
The resulting marginal posterior means and standard deviations for {3, 0 &dquo;;, a2 e q and h 2 for the four data sets are shown in table II. This table reveals little difference between the implementations in the marginal posterior means, except for a slight tendency for standard deviations to be lower using (a) for {3, a u, 2 -y and h 2 Figures 1-4 show the estimated marginal posterior densities for {3, 0 &dquo;;, or2 and h 2 2 using density estimators with normal kernels under the prior specification of this subsection (with implementation (a)) and that considered in the next subsection.
Analysis assuming prior independence of 0 & d q u o ; ; e 2and 7 We now compare the above analyses with those based on the second prior specification, which assumes prior independence of Q e and q and prior density [8] for !y; we use adaptive rejection sampling on the full conditional posterior distribution of 6 = In q, as defined in the Appendix. The Appendix also describes how the values of a and b in the Beta prior for -y are determined from those of v,, v u , k, and k!,.
At each Gibbs sampling iteration, adaptive rejection sampling requires at least two points for the construction of the rejection envelope and squeezing function (Gilks and Wild, 1992). The parameter space for 6 is unbounded on the left, so Gilks and Wild (1992) advise that the smallest of these initial points be chosen where the derivative of the log posterior density is positive. The 15th and 85th percentiles of the sampling density from the previous iteration were tried as initial points, additional points being supplied if the derivatives at these values had the same sign. Three evaluations of the log posterior density of 6 were required on average at each iteration of adaptive rejection sampling.
Convergence of the Gibbs sampler, assessed by monitoring summary statistics for each parameter based on every ten iterations, had clearly occurred before 1 000 iterations. The marginal posterior summaries shown in table II and the density estimates for {3, o, 2, afl and h 2 in figures 1-4 are based on this number of iterations.
The results obtained using the two prior specifications agree very closely, especially those for /3 and ce 2 DISCUSSION Our results from using Gibbs sampling on a balanced one-way sire model with the prior density of expression [2] for 0 & d q u o ; ; and 0 & d q u o ; ; suggest that it is unnecessary to discard all but the 10th, 20th or 30th iterate; doing so is probably quite wasteful. A much larger proportion of the iterates should be used unless storage is an issue. Using 1000 successive iterates seems adequate to make inferences about the parameters of interest, although the slightly lower standard deviations obtained with implementation (a) may reflect the need to discard 'burn-in' iterations. We encountered no difficulties with the convergence of the Gibbs sampling algorithms used. With real data, we would use REML estimates as starting values for the Gibbs iterations and monitor convergence diagnostics to decide how many 'burnin' iterations to discard (Gilks et al, 1996). The prior distribution which assumes u2 and -y to be independent provides a potentially useful alternative to the usual prior in which U2 and Q u are taken to be independent. The adaptive rejection sampling algorithm of Gilks and Wild (1992) is found to work efficiently with the resulting full conditional distribution for In q.
The sensitivity of a Bayesian analysis to the choice of prior hyperparameters and to different functional forms of prior can be investigated. For our simulated data, the use of an alternative form of prior distribution led to very similar marginal posterior inferences on each parameter. The use of the alternative prior distribution for Q e e 2 and -y can be extended to models with further variance components, for example if [1] is generalized to where v is an r-vector of random effects, T is a known n x r matrix and v is taken to be N,(0, Q v H) independently of fl, u and e with H a known matrix. If !, !e and U2 v are taken to be mutually independent a priori, with uniform, x-2 (v e , (veke)-1) and x-2 ( L v, (v v k v )-1 ) distributions, respectively, and q is independent of them with the Beta distribution defined in [8], then the full conditional distributions for !, u, v, !e e and Q v are, respectively, where e denotes a 2/ 0 ,2, while that for q is the same as !11!.