Bayesian inference about dispersion parameters of univariate mixed models with maternal effects: theoretical considerations

Summary - Mixed linear models for maternal effects include fixed and random elements, and dispersion parameters (variances and covariances). In this paper a Bayesian model for inferences about such parameters is presented. The model includes a normal likelihood for the data, a "flat" prior for the fixed effects and a multivariate normal prior for the direct and maternal breeding values. The prior distribution for the genetic variancecovariance components is in the inverted Wishart form and the environmental components follow inverted prior distributions. The kernel of the joint posterior density of the dispersion parameters is derived in closed form. Additional numerical and analytical methods of interest that are suggested to complete a Bayesian analysis include MonteCarlo Integration, maximum entropy fit, asymptotic approximations, and the TierneyKadane approach to marginalization. maternal effect / Bayesian method / dispersion parameter


INTRODUCTION
Mixed linear models for the study of quantitative traits include, in addition to fixed and random effects, the necessary dispersion parameters. Suppose one is interested in making inferences about variance and covariance components. Except in trivial cases, it is impossible to derive the exact sampling distribution of estimators of these parameters (Searle, 1979) so, at best, one has to resort to asymptotic results. Theory (Cramer, 1986) indicates that the joint distribution of maximum likelihood estimators of several parameters is asymptotically normal, and therefore so are their marginal distributions. However, this may not provide an adequate description of the distribution of estimators with finite sample sizes. On the other hand, the Bayesian approach is capable of producing exact joint and marginal posterior distributions for any sample size (Zellner, 1971; Box and Tiao, 1973), which give a full description of the state of uncertainty posterior to data.
In recent years, Bayesian methods have been developed for variance component estimation in animal breeding (Gianola and Fernando, 1986;Macedo and Gianola, 1987;Carriquiry, 1989;Gianola et al 1990a, b). All these studies found analytically intractable joint posterior distributions of (co)variance components, as Broemeling (1985) has also observed. Further marginalization with respect to dispersion parameters seems difficult or impossible by analytical means. However, there are at least 3 other options for the study of marginal posterior distributions: 1), approximations; 2), integration by numerical means; and 3), numerical integration for computing moments followed by a fit of the density using these numerically obtained expectations. Recent advances in computing have encouraged the use of numerical methods in Bayesian inference. For example, after the pioneering work of Kloek and Van Dijk (1978), Monte Carlo integration (Hammersley and Handscomb, 1964;Rubinstein, 1981) has been employed in econometric models (Bauwens, 1984;, seemingly unrelated regressions (Richard and Steel, 1988) and binary responses (Zellner and Rossi, 1984).
Maternal effects are an important source of genetic and environmental variation in mammalian species (Falconer, 1981). Biometrical aspects of the associated theory were first developed by Dickerson (1947), and quantitative genetic models were proposed by Kempthorne (1955), Willham (1963Willham ( , 1972 and Falconer (1965). Evolutionary biologists have also become interested in maternal effects (Cheverud, 1984;Riska et al, 1985;Kirkpatrick and Lande, 1989;Lande and Price, 1989).
There is extensive animal breeding literature dealing with biological aspects and with estimation of maternal effects (eg, Foulley and Lefort, 1978;Willham, 1980;Henderson, 1984Henderson, , 1988. Although there are maternal sources of variation within and among breeds, we are concerned here only with the former sources. The purpose of this expository paper is to present a Bayesian model for inference about variance and covariance components in a mixed linear model describing a trait affected by maternal effects. The formulation is general in the sense that it can be applied to the case where maternal effects are absent. The joint posterior distribution of the dispersion parameters is derived. Numerical methods for integration of dispersion parameters regarded as &dquo;nuisances&dquo; in specific settings are reviewed. Among these, Monte Carlo integration by &dquo;importance sampling&dquo; (Hammersley and Handscomb, 1964;Rubinstein, 1981) is discussed. Also, fitting a &dquo;maximum entropy&dquo; posterior distribution (Jaynes, 1957(Jaynes, , 1979 using moments obtained by numerical means (Mead and Papanicolaou, 1984;Zellner and Highfield, 1988) is considered. Suggestions on some approximations to marginal posterior distributions of the (co)variance components are given. Asymptotic approximations using the Laplace method for integrals  are also described as a means for obtaining approximate posterior moments and marginal densities. Extension of the methods studied here to deal with multiple traits is possible but the algebra is more involved.

THE BAYESIAN MODEL
Model and prior assumptions about location parameters The maternal animal model (Henderson, 1988) considered is: where y is an n x 1 vector of records and X, Z o , Z m and E m are known, fixed, n x p, n x a, n x a and n x d matrices, respectively; without loss of generality, the matrix X is assumed to have full-column rank. The vectors !, a o , a m and Cm are unknown fixed effects, additive direct breeding values, additive maternal breeding values and maternal environmental deviations, respectively. The n x 1 vector e o contains environmental deviations as well as any discrepancy between the &dquo;structure&dquo; of the model (XR+ Z o a o + Z m a m + E m e m ) and the data y. As in Gianola et al (1990b), the vectors p,a o , a m and e m are formally viewed as location parameters of the conditional distribution yl P, a o , a m , e m , but a distinction is made between 13 and the other 3 vectors depending on the state of uncertainty prior to observing data. It is assumed a piiori that P follows,a uniform distribution, so as to reflect vague prior knowledge on this vector. Polygenic inheritance is often assumed for a = [a!, a!]' (Falconer, 1981;Bulmer, 1985) so it is reasonable to postulate a prio 7 i that a follows the multivariate normal distribution: where G is a 2 x 2 matrix with diagonal elements o, Ao 2 and a A 2 & d q u o ; & d q u o ; the variance components for additive direct and maternal genetic effects, respectively, and off diagonal elements QAoA.&dquo;,,, the covariance between additive direct and maternal effects. The a x a positive-definite matrix A has elements equal to Wright's coefficients of additive relationship or twice Melecot's coefficients of co-ancestry (Willham, 1963). Maternal environmental deviations, presumably caused by the joint action of many factors having relatively small effects are also assumed to be normally, independently distributed (Quaas and Pollak, 1980;Henderson, 1988) as: where u5! is the maternal environmental variance. It is assumed that a priori p, a and Cm are mutually independent. For the vector y, it will be assumed that: where (1'!o is the variance of the direct environmental effects. It should be noted that [1-4J complete the specification of the classical mixed linear model (Henderson, 1984), but in the latter, distributions [2] and [3] have a frequentist interpretation.
A simplifying assumption made in this model, for analytical reasons, is that the direct and maternal environmental effects are uncorrelated.

Prior assumptions about variance parameters
Variance and covariance components, the main focus of this study, appear in the distributions of a, e m and e o . Often these components are unknown. In the Bayesian approach, a joint prior distribution must be specified for these, so as to reflect uncertainty prior to observing y. &dquo;Flat&dquo; prior distributions, although leading to inferences that are equivalent to those obtained from likelihood in certain settings (Harville, 1974(Harville, , 1977 can cause problems in others (Lindley and Smith, 1972;Thompson, 1980;Gianola et al, 1990b). In this study, informative priors of the type of proper conjugate distributions (Raiffa and Schlaiffer, 1961) are used. A prior distribution is said to be conjugate if the posterior distribution is also in the same family For example, a normal prior combined with a normal likelihood produces a normal posterior (Zellner, 1971;Box and Tiao, 1973). However, as shown later for the variance-covariance structure under consideration, the posterior distribution of the dispersion parameters is not of the same type as their joint prior distribution. This was also found by Macedo and Gianola (1987) and by Gianola et al (1990b) who studied a mixed linear model with several variance components employing normal-gamma conjugate prior distributions.
An inverted-Wishart distribution (Zellner, 1971;Anderson, 1984; will be used for G, with density: where G * = !c9Gh. The 2 x 2 matrix G h of &dquo;hyperparameters&dquo;, interpretable as prior values of the dispersion parameters, has diagonal elements s2 0 and s 2 M , and off-diagonal elements 5!!,.. The integer !a9 is analogous to degrees of freedom and reflects the &dquo;degree of belief&dquo; on G h (Chen, 1979). Choosing hyperparameter values may be difficult in many applications. Gianola et al (1990b) suggested fitting the distribution to past estimates of the (co)variance components by eg a method of moments fit. For traits such as birth and weaning weight in cattle there is a considerable number of estimates of the necessary (co)variance components in the literature (Cantet et al, 1988). Clearly, the value of G h influences posterior inferences unless the prior distribution is overwhelmed by the likelihood function (Box and Tiao, 1973).
Similarly, as in Hoeschele et al (1987) the inverted x 2 distribution (a particular case of the inverted Wishart distribution) is suggested for the environmental variance components, and the densities are: The prior variances s2 m and s 20 are the scalar counterparts of G n , and no and nm are the corresponding degrees of belief. The marginal distribution of any diagonal element of a Wishart random matrix is X 2 (Anderson, 1984). Likewise, the marginal distribution of the diagonal of an inverted-Wishart random matrix is inverted X Z (Zellner, 1971). Note that the 2 variances in [6] and [7] cannot be arranged in matrix form similar to the additive (co)variance components in G to obtain an inverted Wishart density, unless no = n, n . Setting ng I n o and n m to zero makes the prior distributions for all (co)variance components &dquo;uniformative&dquo;, in the sense of Zellner (1971).

Joint posterior density of all parameters
The posterior density of all parameters (Zellner, 1971;Box and  Posterior density of the (co)variance components To obtain the marginal posterior distribution of G, u5! and O'!o, 0 must be integrated out of (10). This can be accomplished noting that the second exponential term in [10] is the kernel of the (p + 2a + d)-variate normal distribution and the variance-covariance matrix is non-singular because X has full-column rank.
The remaining terms in [10] do not depend on 0. Therefore, with R o being the range of 0, using properties of the normal distribution we have: The marginal posterior distribution of all (co)variance components then is: The structure of [11] makes it difficult or impossible to obtain by analytical means the marginal posterior distribution of G, o,2 . E or or2 E,,,. Therefore, in order to make marginal posterior inferences about the elements of G or the environmental variances, approximations or numerical integration must be used. The latter may give accurate estimates of posterior moments, but in multiparameter situations computations can be prohibitive.
There are 2 basic approaches to numerical integration in Bayesian analysis. The first one is based on classical methods such as quadrature Smith, 1982, 1988;Wright, 1986). Increased power of computers has made Monte Carlo numerical integration (MCI), the second approach, feasible in posterior inferences in econometric models (Kloek and Van Dijk, 1978;Bauwens, 1984;Bauwens and Richard, 1985; and in other models (Zellner and Rossi, 1984;Geweke, 1988;Richard and Steel, 1988). In MCI the error is inversely proportional to N l/2 , where N is the number of points where the integrand is evaluated (Hammersley and Handscomb, 1964;Rubinstein, 1981). Even though this &dquo;convergence&dquo; of the error to zero is not rapid, neither the dimensionality of the integration region nor the degree of smoothness of the function evaluated enter into the determination of the error (Haber, 1970). This suggests that as the number of dimensions of integration increases the advantage of MCI over classical methods should also increase. A brief description of MCI in the context of maternal effects models is discussed next.  (Kloek and Van Dijk, 1978;Bauwens, 1984, Zellner andRossi, 1984;Richard and Steel, 1988) is called &dquo;importance sam-pling&dquo; (Hammersley and Handscomb, 1964;Rubinstein, 1981). Let I(r) be a known probability density function defined on the space of T; I(r) is called the importance sampling function. Following Kloek and Van Dijk (1978)  where the expectation is taken with respect to the importance density I(r).
Using a standard Monte Carlo procedure (Hammersley and Handscomb, 1964;Rubinstein, 1981), values of r are drawn at random from the distribution with density I(r). Then the function M(r) is evaluated for each drawn value of r, r j (i = 1, ... , m) say. For sufficiently large m: The critical point is the choice of the density function 7(F). The closer I(r) is to p(r ! y, H), the smaller is the variance of M(r), and the number of drawings needed to obtain a certain accuracy (Hammerley and Handscomb, 1964;Rubinstein, 1981).
Another important requirement is that random drawings of r should be relatively simple to obtain from 7(F) (Kloek and Van Dijk, 1978;Bauwens, 1984). For location parameters, the multivariate normal, multivariate and matric-variate t and poly-t distributions have been used as importance functions (Kloek and Van Dijk, 1978;Bauwens, 1984;Bauwens and Richard, 1985;Richard and Steel, 1988;. Bauwens (1984) developed an algorithm for obtaining random samples from the inverted Wishart distribution. There are several problems yet to be solved and the procedure is still experimental (Richard and Steel, 1988). However, results obtained so far make MCI by importance sampling promising (Bauwens, 1984;Zellner and Rossi, 1984;Richard and Steel, 1988; 7 i (r) = prior density of G ((5! times k l , the integration constant), I 2 (r) = prior density of U2 E ((6! times k 2 , the integration constant), 1 3 (r) = prior density of u5! [7] times k 3 , the integration constant).

Then:
where k o is the constant of integration of (11!. Evaluating E(r ! y, H) then entails the following steps: a) draw at random the elements of r from distributions with densities h (r) (inverted Wishart), 1 2 (r) (inverted x 2 ) and I 3 (r) (inverted X Z ). This can be done using, for example, the algorithm of Bauwens (1984 (Misztal, 1990) can be employed to advantage in the calculation of the determinant. This procedure can be used to calculate any function of r. For example, the posterior variance-covariance matrix is: so the additional calculation required would be evaluating M'(T) = rr'M o .

MAXIMUM ENTROPY FIT OF MARGINAL POSTERIOR DENSITIES
A full Bayesian analysis requires finding the marginal posterior distribution of each of the (co)variance components. Probability statements and highest posterior density intervals are obtained from these distributions (Zellner, 1971;Box and Tiao, 1973). Marginal posterior densities can be obtained using the Monte Carlo method (Kloek and Van Dijk, 1978) but it is computationally expensive. An alternative is to compute by MCI some moments (for instance, the first 4) of each parameter, and then fit a function that approximates the necessary marginal distribution. A method that gives a reasonable fit, &dquo;Maximum entropy&dquo; (ME), has been used by Mead and Papanicolaou (1984) and Zellner and Highfield (1988). Choosing the ME distribution means assuming the &dquo;least&dquo; possible (Jaynes, 1979), ie, using information one has but not using what one does not have. An ME fit based on the first 4 moments implies constructing a distribution that does not use information beyond that conveyed by these moments. Jaynes (1957) set the basis for what is known as the &dquo;ME formalism&dquo; and found a role for this to play in Bayesian statistics.
The entropy (W) of a continuous distribution with density p(x) is defined (Shannon, 1948;Jaynes, 1957Jaynes, , 1979 to be: The ME distribution is obtained from the density that maximizes [20] subject to t'he conditions: r where p o = 1 (by definition of a proper density function) and JLi (i = 1, ... , 4) are the first 4 moments of the distribution of x. Zellner and Highfield (1988) expressed the function to be maximized as the Lagrangian: where the l i (i = 0,...,4) are Lagrange multipliers and I = !lo, ll, d2, d3, l4!'. Note that [22] involves integrals whose integrands depend on the unknown function p(x), 'and on functions of it (log p(x)). Rewrite (Hildebrand, 1972)  Hence, the condition for a stationary point is: plus the 5 constraints given in (21!. From (26!, the density of the ME distribution of x has the form: To specify the ME distribution completely 1 must be found. Zellner and Highfield (1988)  Expanding G i (l) with a Taylor series about 1 0 , a trial value for 1, and retaining the linear terms leads to: These derivatives are simply moments (with negative sign) of the maximum entropy distribution.
Putting in [29] and setting this equal to [28] one obtains the linear system in 1: This system can be solved for h ( j = 0,1, ... , 4) to obtain a new set of trial values and, thus, an iteration is established. Defining and observing that 0 <_ i + j <_ 8, the above system can be written in matrix notation as: This system is solved for 8 1 ' l to obtain IN = Ilt-11 + 1 ft] , the vector of new trial values. Iteration continues until 5 becomes appropriately small. Zellner and Highfield (1988) (1987), Gianola et al (1990b) and Macedo and Gianola (1987) showed how this could be done with a ' simpte algorithm based on first derivatives. Additional algorithms can be constructed using second derivatives, and the necessary expression are given in the Appen.dix. The solutions can be viewed as weighted averages of REML &dquo;estimators&dquo; of dispersion parameters and of the hyperparameters G h , sE o and si m' Let the modal values so obtained be 1i, !2 and !2 or i', in compact. ' .°& 2 E. and &2 E 'n, or r, in compact.
-Consider approximations to the marginal density of G because this matrix contains the parameters of primary interests. One can write: , where p(u5!, u5! y, H) is the posterior density of u5! , u5! obtained after integrating G out of [11]. It seems impossible to carry out this integration analytically. Following ideas in Gianola and Fernando (1986), we propose as first approximation: It would be better to use the modal values of p(u5! , u 5! y, H) rather than (TJ.:m and 3!, but finding this distribution does not seem feasible. Using [32] in [11] one obtains: It should be noted that now 6 = f (G, &' E &dquo;&dquo; a2 E !) and t = h(G, a2 E &dquo;&dquo; 1?2 E ).
Then, the MCI method can be used to compute moments of (33J. The additional degree of marginalization with respect to [11] achieved in this approximation may be small, but savings in computing accrue because drawing values of uk m and o-E o from I 2 (r) and I 3 (r), respectively, is no longer necessary.
In the second approximation, we write the expression in the exponent of [33] as: In the preceding, replace and using the preceding developments in [33] we write, after neglecting IW I W+EI-1/ 2 This density is in the inverted Wishart form, with parameters n. = ng + a and G*, provided G* is positive definite. If not, one can &dquo;bend&dquo; this matrix following the ideas of Hayes and Hill (1981 A third approximation would be writing [34] as so we would have an inverted Wishart distribution with hyperparameters n9 = n 9 + a and G. If G is obtained with an algorithm that guarantees positive semidefiniteness such as EM (Dempster et al, 1977), this would circumvent the potential problem posed by G* in (34!. The fourth approximation involves the matrix of second derivatives (C, say) of the logarithm of [11] with respect to the unique elements of G, u5! and o, 2 E o and then evaluating C at 1i, %5! and %5!. The second derivatives are in the Appendi!. Invoking the asymptotic normality property of posterior distributions (Zellner, 1971), one would approximately have : where it is assumed that the matrix -C = f(l#, %5! , %5!) has full rank. The method of Tierney and Kadane (1986) continues as follows. Let r&dquo;! be the posterior mode (which is also the maximum of L), L'(r) and L&dquo;(r) be the first and second derivatives of L with respect toT and let ( 1 &dquo;2 = -1/ LI/ (r m ) ' Using a Taylor series expansion for nL(r) about r&dquo; L we have: Noting that and retaining terms up to second-order, the expansion becomes: Using this, the denominator in [45] can be approximated as: In the same way, if r* is the maximum of L * and Q *2 = -1/ L * &dquo; (r : n) An interesting aspect of this approximation is that only first and second order derivatives are needed, and this is less tedious than other approximations suggested by eg, Mosteller and Wallace (1964) and Lindley (1980), requiring evaluation of third derivatives. The posterior variance can also be approximated by finding the posterior mean of g 2 . The only modification needed is to define L * as The multiparameter case When r is a vector, as in this paper, [48] generalizes to: where r! and I'm maximize L * and L, respectively, and H * and H are minus the inverse matrices of second derivatives of L * and L with respect to r, evaluated at r! and I' m , respectively.

Marginal posterior densities
The method can also be used to approximate marginal posterior densities of individual parameters of T. Partition r' as !T1, r ,2 j. If the order of r is p, say, then T 2 is of order p -1 (4 in our case). The marginal posterior density of T l is: where 1 r(r l, r 2 ) is the joint posterior density of r. From preceding developments, the denominator in [51] is expressible as: where I 'm is the mode of the posterior distribution of T, and L&dquo;(r .. ) is the matrix of second derivatives of L with respect to 7 r. Then: where l(r m ) is the log-likelihood evaluated at r m . Hence, [53] becomes: Consider now the numerator of (51!, and write it as: is a function where T 1 is fixed. Define r 2m (r l ) to be the (p -1) x 1 vector that maximizes this function. This maximizer can be found employing the derivatives in the appendix. Then, similar to (53!, we can write. where B&dquo;(r l , r 2m (r¡)) is the (p &mdash; 1) x (p -1) matrix of second derivatives of B with respect to r 2 .
Taking the ratio between [56] and [54] the posterior density of T 1 in [51] is approximately: The moments of the posterior distribution of I'i, must be found numerically . employing the methods discussed in earlier sections.

Remarks
It has been shown that the method of  has less error than the usual normal approximation centered at the posterior mode with the order of approximation being O(n-2 ). However, it also requires that the functions to be expanded be either unimodal or dominated by a single mode, so sample size must be sufficiently large for this to hold.
The requirement that g(T) be a positive function is restrictive. Tierney and Kadane (1986) pointed out that for the approximation to be accurate for a function g taking both positive and negative values, the posterior distribution of g must be concentrated almost entirely on one side of the origin. However, Tierney et al (1988) extended the method to apply to expectations and variances of non-positive functions. To obtain a second-order approximation to E[g(r)], they used the method of  to approximate the moment generating function E{exp [!(F)]}, whose integrand is positive, and then the result was differentiated.
Another difficulty arises in the approximation [49] to the posterior variance of g(r). Unless computations are made with sufficient precision, [49] can have a large error or turn up negative. Similar problems can arise in the computations of posterior covariance, ie as a covariance matrix computed from [58] may not be positive semi-definite.

CONCLUSION
This paper presents theory and techniques for carrying out a Bayesian analysis of dispersion parameters in a univariate model for maternal effects. Hower, implementation of the methods suggested here poses difficulties to quantitative geneticists interested in analysis of large data sets. The development of feasible computing techniques is a challenge to researchers in the area of application of numerical methods to animal breeding.
Research is underway to identify more promising algorithms to approximate marginal moments of posterior distributions, a non-trivial problem as new techniques are developed and there is little indication on the choice to make for estimating (co)variance components under &dquo;non-exchangeability&dquo; of model [1]. Recently   and Gelfand et al (1990) described the Gibbs sampler, a potential competitor of the methods presented here.

APPENDIX
First and second derivatives of the log-posterior of all (co)variance components The log of (11! is: Let M' = (0 ( I 2a 0] be a 2a x (p + 2a + d) matrix such that M'i = a. In the same way, N' = [0 [ 0 [ Id] be a d x (p + 2a + d) matrix such that N'i = The 0 represents a matrix of appropriate order with all elements equal to zero.
To simplify the derivation, we will decompose (A.1! into components, take derivatives with respect to an element of G(g ij say), (Tkm or !Eo, and collect results to obtain the desired expressions. !, Derivatives of (y'y -0 W'y) The term y'y does not depend on r. The other term is 8'W't/ = y'WCW'y, so that: where E ij is a 2 x 2 matrix with all elements equal to zero, with the exception of a one in position i,j. Note that if e i (e j ) is a 2 x 1 vector with a 1 in the i-th (j-th) position E2! = eie!. The notation D(M l , ... , M,l stands for a block diagonal