A link function approach to heterogeneous variance components

- This paper presents techniques of parameter estimation in heteroskedastic mixed models having i) heterogeneous log residual variances which are described by a linear model of explanatory covariates and ii) log residual and log u-components linearly related. This makes the intraclass correlation a monotonic function of the residual variance. Cases of a homogeneous variance ratio and of a homogeneous u-component of variance are also included in this parameterization. Estimation and testing procedures of the corresponding dispersion parameters are based on restricted maximum likelihood procedures. Estimating equations are derived using the standard and gradient EM. The analysis of a small example is outlined to illustrate the theory. &copy; Inra/Elsevier, Paris heteroskedasticity / mixed model / maximum likelihood / EM algorithm Résumé - Une approche des composantes de variance hétérogènes par les fonctions de lien. Cet article présente des techniques d’estimation des paramètres intervenant dans des modèles mixtes caractérisés i) par des logvariances résiduelles décrites par un modèle linéaire de covariables explicatives et ii) par des composantes u et e liées par une fonction affine. Cela conduit à un coefficient de corrélation intraclasse qui varie comme une fonction monotone de la variance résiduelle. Le cas d’une corrélation constante et celui d’une composante u constante sont également inclus dans cette paramétrisation. L’estimation et les tests relatifs aux paramètres de dispersion correspondants sont basés sur les méthodes du maximum de vraisemblance restreint (REML). Les équations à résoudre pour obtenir ces estimations


INTRODUCTION
A previous paper of this series [4], presented an EM-REML (or ML) approach to estimating dispersion parameters for heteroskedastic mixed models. We assumed i) a linear model on log residual (or e) variances, and/or ii) constant u to e variance ratios.
There are different ways to relax this last assumption. The first one is to proceed as with residual variances, i.e. hypothesize that the variation in log u-components or of the u to e-ratio depends on explanatory covariates observed in the experiment, e.g. region, herd, parity, management conditions, etc. This is the so-called structural approach described by San Cristobal et al. [23], and applied by Weigel et al. [28] and De Stefano [2] to milk traits of dairy cattle.
Another procedure consists in assuming that the residual and u-components are directly linked via a relationship which is less restrictive than a constant ratio. A basic motivation for this is that the assumption of homogeneous variance ratios or intra class correlations (e.g. heritability for animal breeders) might be unrealistic [19] although very convenient to set up for theoretical and computational reasons (see the procedure by Meuwissen et al. [16]). As a matter of fact, the power of statistical tests for detecting such heterogeneous heritabilities is expected to be low [25] which may also explain why homogeneity is preferred. The purpose of this second paper is an attempt to describe a procedure of this type which we will call a link function approach referring to its close connection with the parameterization used in GLM theory [3,14].
The paper will be organized along similar lines as the previous paper [4] including i) an initial section on theory, with a brief summary of the models and a presentation of the estimating equations and testing procedures, and ii) a numerical application based on a small data set with the same structure as the one used in the previous paper [4].

Statistical model
It is assumed that the data set can be stratified into several strata indexed by (i = 1, 2, ... , I) representing a potential source of heteroskedasticity. For the sake of simplicity, we will consider a standardized one-way random (e.g. sire) model as in Foulley [4] and Foulley and Quaas [5].
where y i is the (n i x 1) data vector for stratum i; j3 is a (p x 1) vector of unknown fixed effects with incidence matrix X i , and e i is the (n 2 x 1) vector of residuals. The contribution of the systematic random part is represented by O &dquo;uiZ iU * where u * is a (q x 1) vector of standardized deviations, Z i is the corresponding incidence matrix and < 7u , is the square root of the u-component of variance, the value of which depends on stratum i. Classical assumptions are made for the distributions of u * and e i , i.e. u * N N(0, A), e i N N(O, or 2 1,,,), and E(u * eD = 0.
The influence of factors causing the heteroskedasticity of residual variances is modelled along the lines presented in Leonard [13] and Foulley et al. [6,7] via a linear regression on log-variances: where 5 is an unknown (r x 1) real-valued vector of parameters and p' is the corresponding (1 x r) row incidence vector of qualitative or continuous covariates.
Residual and u-component parameters are linked via a functional relationship or equivalently where the constant T equals exp(a).
The differential equation pertaining to [3ab], i.e. (d C7u jC 7uJ -b(dC7ejC7eJ = 0 is a scale-free relationship which shows clearly that the parameter of interest in this model is b. Notice the close connection between the parameterization in equations [2] and [3ab] with that used in the approach of the 'composite link function' proposed by Thompson and Baker [24] whose steps can be summarized as follows: i ) (C7ui,C7eJ' = f(a,b,C7e J ; ii) C7ei = exp(? 7 i), and qi = (112)p' 6 . As compared to Thompson and Barker, the only difference is that the function f in i) is not linear and involves extra parameters, i.e. a and b.
The intraclass correlation (proportional to heritability for animal breeders) is an increasing function of the variance ratio p i = ou i /!e.. In turn p i increases or decreases with u, 2 depending on b > 1 or b < 1, respectively, or remains constant (b = 1) since dpi/p i = 2(b -l)do'e!/o'e!. Consequently the intraclass correlation increases or decreases with the residual variance or remains constant (b = 1). For b = 0, the u-component is homogeneous figure 1.

EM-REML estimation
The basic EM-REML procedure [1,18] proposed by Foulley and Quaas (1995) for heterogeneous variances is applied here. Letting / ' ' y' )' e=(e' e' e' .. e' )' and 'y = (6', T, b)', g i 1 the EM algorithm is based on a complete data set defined by x = (p , u * , e')' and its loglikelihood L(y; x). The iterative process takes place as in the following. The E-step is defined as usual, i.e. at iteration [t], calculate the conditional expectation of L(y; x) given the data y and y = y' l as shown in Foulley and Quaas [5], reduces to where E1 t ] (.) is a condensed notation for a conditional expectation taken with respect to the distribution of x in Q given the data vector y and y = advantageous to reduce the number of inner iterations, even up to only one. This is an application of the so called 'gradient EM' algorithm the convergence properties of which are almost identical to standard EM [12].
The algebra for the first and second derivatives is given in the Appendix. These derivatives are functions of the current estimates of the parameters y = ' Yl 'l, and of the components of E!t](eiei) defined at the E-step.
Let those components be written under a condensed form as: with a cap for their conditional expectations, e.g.
These last quantities are just functions of the sums X'y i , Z'yi, the sums of squares y §y i within strata, and the GLS-BLUP solutions of the Henderson mixed model equations and of their accuracy [11], i.e.
where ' 1 Thus, deleting [t] for the sake of simplicity, one has:  (7). For grouped data (n i observations in subclass i with the same incidence matrices X i = l ni x' and Zi = 1,,iz'), formulae (8) [22], A has an asymptotic chi-square distribution with r = dim ,f2dim S -2 0 degrees of freedom.
Under model (1), an expression of -2L(y; y) is: The theoretical and practical aspects of the hypotheses to be tested about the structural component 5 have been already discussed by Foulley et al. [6,7], San Cristobal et al. [23] and Foulley [4].
As far as the functional relationship between the residual and u-components is concerned, interest focuses primarily on the hypotheses i) a constant variance ratio (b = 1), and ii) a constant u-component of variance (b = 0) [2,16,22,28]. with Rank(P)-2 degrees of freedom.

Numerical example
For readers interested in a test example, a numerical illustration is presented based on a small data set obtained by simulation. For pedagogical reasons, this example has the same structure as that presented in Foulley [4], i.e. it includes two crossclassified fixed factors (A and B) and one random factor (sire).
The model used to generate records is: where a is a general mean, a i , 1 3 j are the fixed effects of factors A(i = 1, 2) and B(j = 1,2,3), sk the standardized contribution of male k as a sire and 1/2se ) that of male as a maternal grand sire.
Except for T = 0.001016 and b = 1.75, the values chosen for the parameters are the same as in Foulley [4]. The data set is listed in table I. The issue of model choice for location and log-residual parameters will not discussed again; they are kept the same, i.e. additive as in the previous analysis. 1) additive for both log Q e and log or 2 ; 2) additive for log u2 and log as = a + b log a,; 3) constant variance ratio (b = 1); 4) constant sire variance (b = 0). In this example, models (3) and (4) were rejected as expected whatever the alternatives, i.e. models (1) or (2). Model (2) was acceptable when compared to (1) thus illustrating that there is room between the complete structural approach and the constant variance ratio model. The corresponding estimates of parameters are shown in table IIL Estimates of the functional relationship are T = 0.001143 and b = 3.0121, this last value being higher than the true one, butnot surprisingly in this small sample -not significantly different (A = 1.5364 and P-value = 0.215).

DISCUSSION AND CONCLUSION
This paper is a further step in the study of heterogeneous variances in mixed models. It provides a technical framework to investigate how the u-component of variance and the intra-class correlation varies with the residual variance.
This has been an issue for many years in the animal breeding community. For instance for milk yield, the assumption of a constant heritability across levels of environmental factors (e.g. countries, regions, herds, years, management conditions) has generated considerable controversy: see Garrick and Van Vleck [8], Wiggans and VanRaden [29]; Visscher and Hill [26], Weigel et al. [28] and DeStefano [2]. Maximum likelihood computations are based, here, on the EM algorithm and different simplified versions of it (gradient EM, ECM). This is a powerful tool for addressing problems of variance component estimation, in particular those of heterogeneous variances [4,5,7,20,21]. It is not only an easy procedure to implement but also a flexible one. For instance, ML rather than REML estimators can be obtained after a slight modification of the E-step resulting for grouped data in where M uu is the u x u block of the coefficient matrix of the Henderson mixed model equations.
Such an extension will be easy to make via the ECM (expectation conditional maximization) algorithm [15] in its standard or gradient version along the same lines as those described in Foulley [4]. However caution should be exercised in applying the gradient ECM, for this algorithm no longer guarantees convergence in likelihood values. Other alternatives might be considered as well such as the average information-REML procedure [10,17].
In conclusion, the likelihood framework provides a powerful tool both for estimation and hypothesis testing of different competing models regarding those problems. However, additional research work is still needed to study some properties of these procedures especially from a practical point of view, for example the power of testing such assumptions as b = 1.