Variance estimation from integrated likelihoods (VEIL)

A method of variance component estimation in univariate mixed linear models based on the exact or approximate posterior distributions of the variance components is presented. From these distributions, posterior means, modes and variances can be calculated exactly, via numerical methods, or approximately, using inverted X 2 distributions. Although particular attention is given to a Bayesian analysis with $at priors, informative prior distributions can be used without great additional difficulty. Implementation of the exact analysis can be taxing from a numerical point of view, but the computational requirements of the approximate method are not greater than those of REML. variance components / maximum likelihood / Bayesian methods Résumé-Estimation des composantes de la variance à partir de vraisemblances intégrées. Cet article présente une méthode d'estimation des composantes de la variance, pour des modèles linéaires mixtes univariés, qui est fondée sur des expressions exactes et approchées de la distribution a posteriori de ces composantes. On peut calculer les espérances, modes et variances de ces distributions, de façon exacte par des méthodes numériques ou de façon approchée en utilisant des lois de X 2 inverses. Bien que l'attention soit centrée sur une analyse bayésienne faisant intervenir des a priori uniformes, le recours à des lois a priori informatives est envisageable sans difficulté notoire additionnelle. La mise en &oelig;uvre de l'analyse exacte peut être contraignante d'un point de vue numérique, mais la méthode approchée ne demande pas plus de calculs que la résolution du REML. composantes de la variance / maximum de vraisemblance / méthodes bayésiennes.


INTRODUCTION
noted that likelihood-based methods of variance component estimation have gained favor among quantitative geneticists. In particular, the procedure known as "restricted maximum likelihood", or REML for short (Thompson, 1962;Patterson and Thompson, 1971), is now widely regarded in animal breeding as the method of choice. The approach is based on maximizing with respect to the variances only the part of the likelihood function (normality is assumed) that does not depend on fixed effects. In so doing, Patterson and Thompson (1971) obtained an estimator that &dquo;accounts for the degrees of freedom lost in the estimation of fixed effects&dquo; which, according to their reasoning, is not accomplished by full maximum likelihood (ML'. With balanced data, the REML estimating equations reduce to those used in estimation by analysis of variance (ANOVA) so that if the ANOVA estimates are within the parameter space, these are REML as well.
The basic idea underlying REML is obtaining a likelihood-based estimator (thus retaining the usual asymptotic properties) while reducing the bias of ML. However, REML estimators are biased as well and because these are constructed from a partial likelihood, one should expect larger variance of estimates than with ML. A more reasonable comparison is in terms of a loss function such as mean-squared error, but Monte Carlo studies (for example, Swallow and Monahan, 1984) have given ambiguous results. In fact, in a purely fixed model with a single location parameter and 1 variance component, the ML estimator has smaller mean-squared error than REML throughout the whole parameter space, ie, REML is inadmissible under quadratic loss for this model. Lacking evidence to the contrary, a similar sampling behavior can be expected at least in some other models. Searle (1988) puts it concisely : &dquo;It is difficult to be anything but inconclusive about which of ML and REML is the pneferred method&dquo;.
The 2 procedures have the same asymptotic properties (although their asymptotic variance is different) and unknown small sample distributions, a feature shared by all sampling theory estimators of variance components. Although ML and REML estimates are defined inside the appropriate parameter space, their asymptotic distribution (normal) can generate interval estimates that include negative values. This potentially embarrassing phenomenon is often overlooked in discussions of likelihood based methods. Harville (1974) showed that REML corresponds to the mode of the marginal posterior density of the variance components in a Bayes analysis with flat priors for fixed effects and variance components. The fixed effects (0) are viewed in REML as nuisance parameters and, therefore are eliminated from the full likelihood by integration so that the restricted likelihood is a marginal likelihood (Dawid, 1980), seemingly a more appropriate term. This likelihood does take into account the error incurred in estimating ,0 because it can be viewed as a weighted average of likelihoods of variance components evaluated at all possible values of ,Q, the weight function being the posterior density of the fixed effects (proportional to the marginal likelihood of ,0 in this context).  gave a Bayesian justification of REML when estimating breeding values with unknown variance components. The same, by virtue of the symmetry of the Bayesian analysis, applies to estimation of fixed effects.
There are 2 aspects of REML that have not received sufficient discussion. First, the method produces joint modes rather than marginal modes. If the loss function is quadratic, the optimum Bayes estimator is the posterior mean, and marginal modes provide better approximations to this than joint modes (O'Hagan, 1976). Second, in some problems not all variance parameters have equal importance. For example, suppose that there is interest in making inferences about the amount of additive genetic variance in a population and that the statistical description of the problem requires a model that, in addition to fixed effects, includes random herd, additive, dominance, permanent environmental and temporary effects. In this situation, the marginal or restricted likelihood of the variance components involves 5 dimensions, 1 for each of the variances, yet all components other than the additive one should be regarded as nuisance parameters. Carrying the logic of Patterson and Thompson (1971) 1 step further, REML would not take into account the error incurred in estimating the nuisance variances and, therefore, only the part of the likelihood that is a function of the additive genetic variance should be maximized. Construction of this likelihood employing classical statistical arguments seems impossible.
The objective of this note is to present a new method of variance component estimation in univariate mixed linear models based on the principle of finding the marginal posterior density of each of the variances. From this, means, modes, variances and higher order moments of the distribution of variance components can be calculated exactly, using numerical methods, or analytically via approximations. Particular attention is given to the Bayesian analysis with flat priors for fixed effects and variance components.

Model
Consider the mixed linear model: where : y: data vector; X, Z i : known incidence matrices. For i = c, Z i = I, an identity matrix of appropriate order; ( 3: p x 1 vector of uniquely defined fixed effects (X has full-column rank); u i : q i x 1 random vector distributed as N(O, Ai!2), where A i is a known square, positive-definite matrix and Q2 (i = 1, 2, ... , c) is a variance component. It is assumed that all u i and Uj are mutually independent. For i = c, A! is often taken to be an identity matrix of appropriate order; this will be assumed in the present paper.

The marginal distribution:
gives rise to the &dquo;full&dquo; likelihood function, from which ML estimators of P and of each of the variance components can be obtained.

Bayesian elements
In the Bayesian treatment of model (1), prior subjective uncertainty about P is often described via the &dquo;flat&dquo; prior density: and the normal distributions assigned independently to the u i 's are viewed as prior probability distributions as well. To complete the model, assignment of prior probability distributions to the variance components is necessary. In this paper, and for the sole purpose of showing how a higher degree of marginalization can be achieved than in REML, independent &dquo;flat&dquo; prior distributions are assigned to each of the variances !2. With this formulation, the joint posterior density of P and of variances becomes: so that there is no information about fixed effects and variance parameters beyond that contained in the likelihood function. If (4) is maximized jointly with respect to (3, u!, !z ! ! ! ! ! u § one obtains the maximum likelihood estimates of all parameters. If (4) is integrated with respect to p, the marginal (restricted) likelihood function of the variance components is obtained (Harville, 1974). This contains no information about !3 and maximization of this marginal likelihood with respect to the variances gives restricted maximum likelihood estimates of these parameters.

REPRESENTATION OF THE MARGINAL (RESTRICTED)
LIKELIHOOD As shown by Gianola et al (1990aGianola et al ( , 1990b, integration of (4) with respect to / 3 gives: where: and is the solution to Henderson's mixed model equations. Although (5) is a posterior density, it is strictly proportional to the marginal likelihood of the variance components.

POSTERIOR DENSITY OF THE RESIDUAL VARIANCE AND OF RATIOS OF VARIANCE COMPONENTS
Consider the one-to-one reparameterization from o, 2 ,0-22 ,...,0,2 to and Because maximum likelihood estimates are invariant under one-to-one reparameterizations, maximization of (5) with respect to the 0 -2 or to the a-parameters gives the same estimates. However, and arguing from a likelihood veiwpoint, it is unclear how (5) can be factorized into separate components for each of the old or new parameters. Hence, estimates of a particular uf or a i from (5) are obtained in the presence of the remaining or 2 1 or a's. To the extent that these are viewed as nuisance parameters, REML would not take into account the &dquo;degrees of freedom lost&dquo; in the estimation of such parameters.
In the Bayesian approach parameters are viewed as random variables, so a change from a a 2 to an a-parameterization must be made using theory of transformation of random variables. The absolute value of the determinant of the Jacobian matrix of the transformations indicated in (6a) and (6b) is: Using this in (5) gives as joint posterior density of the a's: The range of the a's depends on the model in question. The a! parameter is always larger than 0. In an &dquo;animal&dquo; model with 2 variance components, the ratio between the environmental and the additive genetic variances can vary between 0 and infinity. On the other hand, in a &dquo;sire&dquo; model, the corresponding ratio varies between 3 and infinity.

POSTERIOR DENSITY OF THE RATIOS OF VARIANCE COMPONENTS
With the above parameterization, a! can be integrated out analytically using inverted-gamma integrals (Box and Tiao, 1973) to obtain: Although the joint posterior density (8) is not in any obviously recognizable form, the components of the modal vector, or a-values maximizing (8) can be calculated. Let the logarithm of (8) be L, so that: where: The mixed model &dquo;residual&dquo; sum of squares B(.) depends on the variance ratios through the solution vector b. Following Macedo and Gianola (1987) and suppressing the dependence of B(.) on the a's in the notation, we have for i = 1,2,...,c-1: Setting (10) to zero and rearranging: with: Above: is a block-diagonal matrix, and Noting the special form of P i one obtains: where C ii is defined below. Using (12a) and (12b) in (11) and rearranging gives, for i = 1,2,...,c -1: Above, and Ci i is the partition of (W'W+E)-1 corresponding to the u i effects. Note that q = > 4 and n > !+2c are required. The expression in (13) can be interpreted as the ratio between an &dquo;estimator&dquo; of o,2 and an &dquo;estimator&dquo; of !2. Because (13) is not explicit in the a's, the expression must be iterated upon, thus generating a sequence of successive approximations. For example, starting with a!O] (i = 1, 2, ... , c&mdash;1) one can solve the mixed model equations, evaluate the right hand-side of (13), obtain a new set of a's and repeat the calculations until convergence.
In addition to first order algorithms such as the one suggested above, computing strategies based on second derivatives are possible. These are tedious and are not presented here. The important point is that the components of the mode of the joint posterior density of a l , a 2 , ... , a!_1 can be obtained without great conceptual di!culty. Hereafter, these components will be denoted as 5i,Q'2,5e-i. Contrary to REML, account is being taken here of the &dquo;degree of freedom&dquo; lost in estimating a! (a! in the original parameterization).

Residual component
Consider density (7) and write it as: The first conditional density in (14) is obtained from (7) by regarding a c as a variable and the remaining a's as constants. Hence: This is exactly the kernel of the density of an inverted x 2 random variable with parameters: and as before. The complete density (Zellner, 1971) is: and the mean, mode and variance of this distribution can be shown by standard techniques to be: ' provided that n &mdash; p &mdash; 2c > 4. From (14), the marginal density of a, is: This is a weighted average of an infinite number of inverted x 2 densities as in (16), the weight function being the posterior density (8). Whereas the marginal density of a c cannot be written explicitly, its mean and variance can be calculated by numerical means. For example, note that: because the expectation is calculated with respect to the distribution a l , a 2 , ..., -; a!-1 ! y. This is the Bayes estimator minimizing expected posterior quadratic loss. Evaluating the expectation in (19) requires solving Henderson's mixed model equations a large, technically an infinite, number of times. Monte-Carlo (Kloek and van Dijk, 1978) or numerical (Smith et al, 1987) integration procedures are needed to complete the calculation of (19).

Other components
Now consider the parameterization: The Jacobian matrix of the transformation from a parameterization in terms of all a-components to one in terms of !2 and the variance ratios as indicated above is a i . The joint posterior density of uf and a l , a 2 , ... , a!_1 is then from (7): As in the preceding section, given the ratios of variance components, and provided that each of the variances !2 takes values between 0 and oo, one obtains: where: Again, (22) is the density of an inverted X 2 random variable with the features: Using the argument of (18), the marginal density of a? is a weighted average of an infinite number of inverted X 2 densities, and the posterior density of the ratios of variance components, (8), is the weight function. The marginal posterior density cannot be written in explicit form, but its moments can be evaluated numerically. For example: where the expectation is taken with respect to the distribution c!l, a 2 , ... , a!_1 [ y.
As discussed earlier, carrying out this calculation requires numerical integration or Monte Carlo techniques.

APPROXIMATE MARGINAL DISTRIBUTION OF EACH OF THE VARIANCE COMPONENTS
Because of the numerical difficulties arising in the marginal analysis of the variance components, it would be desirable to develop approximations that permit an analytical treatment of the resulting distributions. If the posterior density of the ratios of variance components is peaked or symmetrical about its mode, one can resort to the well known approximations (Box and Tiao, 1973; and where the a's are the components of the mode of the posterior distribution of the variance ratios, with density as in (8). The marginal densities of the variance components can be approximated using densities (16) and (22), but with the parameters now being: where 0 is the solution to Henderson's mixed model equations evaluated at S i ,S 2 )&dquo;-) S c -i-Because the densities are in inverted x 2 form, one can readily obtain their moments and construct point estimates of the variance components in a Bayesian decision-theoretical context. For example, for a quadratic loss function, the estimator that minimizes expected quadratic loss is the posterior mean, or: and Estimators minimizing &dquo;all or none&dquo; posterior loss are: and The posterior variances can be approximated by evaluating expressions S ' in (17c) and sf in (23c) at the a values. In fact, all the approximate marginal distributions can be depicted graphically by plotting densities (16) and (22), evaluated at the a values, thus giving a full (approximate) solution to the problem of inferences regarding variance components in a univariate mixed linear model. Harville (1990) stated: &dquo;A more extensive use of Bayesian ideas by animal breeders and other practitioners is desirable and is more feasible from a computational standpoint than commonly thought... The Bayesian approach can be used to devise prediction procedures that are more sensiblefrom both a Bayesian and frequentist perspective -than those in current use&dquo;.

DISCUSSION
In this paper we have used the Bayesian mechanism of integrating nuisance parameters to construct marginal likelihoods from which marginal inferences about each of the variance components can be made. In this respect, the method advanced in this paper goes beyond REML in degree of marginalization. In REML, the fixed effects are viewed as nuisance parameters and are integrated out of the likelihood (Harville, 1974) so inferences about variance components are carried out jointly.
In the present paper fixed effects and other variance components also regarded as nuisances are integrated out so that inferences about individual variances of interest can be completed after taking into account, exactly or approximately, the error due to not knowing all nuisance parameters. Hence, practitioners that accept R.EML in terms of the argument advanced by Patterson and Thompson (1971), ie, taking into account degrees of freedom &dquo;lost&dquo;, should also feel comfortable with our approach because additional degrees of freedom (stemming from the nuisance variance components) are also taken into account.
When fixed effects are viewed as nuisance parameters, it is fortunate that the likelihood can be separated into a component that depends on the fixed effects plus an &dquo;error contrast&dquo; part (Patterson and Thompson, 1971). However, in likelihood inference it seems impossible to eliminate nuisance parameters other than by ad hoc methods or without recourse to Bayesian ideas. A possible method for dealing with nuisance variances would be treating all random effects other than that of interest as fixed, and then estimating the appropriate component by some translation invariant procedure such as REML. This suggests that the distribution of the resulting statistic would not depend on parameters of the distribution of the nuisance random effects. On the other hand, the variance of interest would be estimated jointly with the residual variance; in most instances this should cause little difficulty because of the large amount of information available about the residual component. It seems intriguing to examine the sampling properties of this method.
An interesting question is the extent to which the proposed method differs from REML. It is known that ML and REML can differ appreciably when the number of fixed effects is large relative to the number of observations. Likewise, the difference between estimates obtained with the new method and REML should be a function of the number of variance components in the model and of the amount of information about such components contained in the sample. For the sake of brevity, we will refer to the new method as &dquo;VEIL&dquo; (standing for &dquo;variance estimation from integrated likelihoods&dquo;).
The finite sampling properties such as bias, variance and mean squared error of VEIL are unknown, but the same holds for REML and ML, except for some trivial models. Differences among methods depend on data structures and parameter values, and can only be assessed via simulation experiments. When the approximate analysis described in this paper is used, the goodness of VEIL will depend on the extent to which the density Q'i,a2)-&dquo;)Q'c-i ! y is peaked; this assumption surely does not hold when there is limited information in the data about the variance ratios. In this case, the exact approach is strongly recommended over the approximate one. With respect to large sample properties, the work of Cox (1975) on partial likelihoods suggests that the properties of maximum likelihood estimates apply to maximum marginal likelihood estimates as well. A good illustration is precisely REML relative to ML. Hence, the large sample properties of estimates obtained from the highly marginalized likelihoods presented here should be the same as those of REML although, of course, the parameters of the asymptotic distribution must differ. For example, the asymptotic variance-covariance matrix of ML estimates of variance components is different from that of REML estimates.
From a Bayesian viewpoint, the method described in this paper is a marked improvement over ML and REML provided the objective of the analysis is making marginal inferences about individual components of variance, as the other parameters are treated as nuisances. As stated earlier, the marginal densities of the variance components are approximately inverted X 2 . This allows to generate a complete posterior distribution from which moments can be calculated as needed, and probability statements can be obtained at least by numerical procedures. The inverted x 2 distribution is defined only over the positive part of the real line so the embarrassing situations caused by encountering point estimates that fall outside of the parameter space, or negative values of the variances yielding a positive likelihood (Harville and Callanan, 1990) are avoided.
It is interesting to observe that the (approximate) marginal posterior densities of the variance components appear in the inverted x2 form. In a fixed model, the marginal posterior distribution of the residual variance is exactly inverted X Z and its modal value is the usual unbiased estimator (identical to REML in this case). This implies that this distribution is suitable for specifying marginal prior knowledge about variance parameters as done, for example, by Lindley and Smith (1972) and H6schele et al (1987). For instance, suppose that the prior distribution of Q2 is taken as inverted x2 with parameters v o and so. Then the marginal posterior density of Q2 can be written as: Taking into account that the integrated likelihood is approximately proportional to (22) when the a's are replaced by the a's, (29) can be put as: where: and Hence, the marginal posterior density of component a? remains in the inverted X 2 form. This &dquo;conjugate&dquo; property of the inverted x2 density can be used to advantage when computing the marginal density of a variance component from a large data set. If the data can be partitioned into independent pieces, repeated use of (30) with parameters updated as in (31a) and (31b) will lead to the (approximate) marginal posterior density of the variance components. Each of these pieces of data should have a relatively large amount of information about the variance ratios so that replacing the variance ratios a i by the modal values ix i can be justified. Note in (31a) and (31b) that while the &dquo;degree of belief&dquo; parameter accumulates additively, the parameter s 2 is updated in the form of a weighted average which is a common feature in Bayesian posterior analysis.
The computations required for completing the approximate analysis are similar to those of REML. As illustrated in (13), an iteration constructed on the basis of successive approximations requires at each round the solution vector and the inverse of the coefficient matrix of Henderson's equations. These are also the computing requirements arising in first and second order algorithms for REML (Harville and Callanan, 1990). Derivative free algorithms should be explored as well. On the other hand, carrying out the exact analysis requires numerical integration and the mixed model equations would need to be solved at each of the steps of integration. This may not be feasible in a large class of models. In summary, the exact method can be applied only in data sets of moderate size, but it is precisely in this situation where it would be most useful.
As stated by Searle (1988), the sampling distributions of analysis of variance and likelihood based estimates of variance components are unknown and will probably never be because of analytical intractability. It would be intriguing to study the extent to which the inverted x2 distributions with densitites as in (16) and (22) can provide a description of the variability encountered over repeated sampling among estimates of variance components obtained with the present method. In such a study the QZ 's would be estimates obtained in different (independent) samples and v and s! would be parameters of the sampling distribution. In other words, perhaps the small sample distribution of point estimates obtained with VEIL can be suitably fitted by an inverted X 2 distribution.