Inferences on homogeneity of between-family components of variance and covariance among environments in balanced cross-classified designs

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Inferences on homogeneity of between-family components of variance and covariance among environments in balanced cross-classified designs Jean Louis Foulley, Daniel Hébert, R.L. Quaas

Summary -Estimation and testing of homogeneity of between-family components of variance and covariance among environments are investigated for balanced cross-classified designs. The variance-covariance structure of the residuals is assumed to be diagonal and heteroskedastic. The testing procedure for homogeneity of family components is based on the ratio of maximized log-restricted likelihoods for the reduced (hypothesis of homogeneity) and saturated models. An expectation-maximization (EM) algorithm is proposed for calculating restricted maximum likelihood (REML) estimates of the residual and between-family components of variance and covariance. The EM formulae to implement this are iterative and use the classical analysis of variance (ANOVA) statistics, ie the between-and within-family sums of squares and cross-products. They can be applied both to the saturated and reduced models and guarantee the solutions to be in the parameter space. Procedures presented in this paper are illustrated with the analysis of 5 vegetative and reproductive traits recorded in an experiment on 20 full-sib families of black medic (Medicago lupulina L) tested in 3 environments. Application to pure maximum likelihood procedures, extension to unbalanced designs and comparison with approaches relying on alternative models are also discussed. genotype X environment interaction / heteroskedasticity / expectation-maximization / restricted maximum likelihood / likelihood ratio test La procédure de test d'homogénéité des composantes familiales repose sur le rapport des vraisemblances restreintes maximisées sous les modèles réduit (hypothèse d'homogénéité) et saturé. Un algorithme d'espérance-maximisation (EM) est proposé pour calculer les estimations du maximum de vraisemblance restreinte (REML) des composantes résiduelles et familiales de variance et de covariance. Les formules EM à appliquer sont itératives et utilisent les statistiques classiques de l'analyse de variance (ANOVA), c'est-à-dire les sommes de carrés et coproduits interet intrafamilles. Elles s'appliquent à la fois aux modèles réduit et saturé et garantissent l'appartenance des solutions à l'espace des paramètres. Les méthodes présentées dans cet article sont illustrées par l'analyse de 5 caractères végétatifs et reproductifs mesurés lors d'une expérience portant sur 20 familles de pleins frères testées dans 3 milieux chez la minette (Medicago lupulina L). L'application INTRODUCTION There is a great deal of interest today in quantitative and applied genetics in heterogeneous variances. Ignoring such heterogeneity, as is usually done, may substantially affect the reliability of genetic evaluation and thus reduce the efficiency of selection (Hill, 1984;Visscher and Hill, 1992).
There is concern not only about estimating dispersion parameters for heteroskedastic models, but also about testing hypotheses for the real degree of heterogeneity which can be expected from experimental results. In this respect, Visscher (1992) investigated the statistical power of the likelihood ratio test in balanced half-sib designs for detecting heterogeneity of phenotypic variance and intra-class correlation between environments.
In that approach, the (family) correlation between environments (p) is assumed to be equal to 1, and heterogeneity of between-family components of covariance among environments in only due to scaling of variances.
The aim of this paper is to extend that approach to the case of true genotype by environment interactions (p # 1). Our attention will be focused on: i) crossclassified balanced designs; and ii) the null hypothesis involving homogeneity of between-family components of variance and covariance between environments. This variance-covariance structure has been widely used for analyzing family data recorded in different environments, in particular due to its close link with a 2factor classification model (ie family and environment) with interaction (Mallard et al, 1983;Foulley and Henderson, 1989). Moreover, even for balanced designs, the estimation of the 2 parameters involved in this simple structure via maximum likelihood procedures has no analytical solution in the general case when no assumption is made about the residual variances. This motivated the proposal made in this study to use the expectation-maximization (EM) algorithm (Dempster et al, 1977) to solve the problem.

Generalities
Let us assume that the records from the balanced cross-classified layout family (or genotype) x environment can be written as: where y2!! is the performance of the kth progeny (or individual) (k = 1, 2, ... , n) of the jth family (or genotype) ( j = 1, 2, ... , s) evaluated in the ith environment (i = 1, 2, ... , p) ; b ij is the random effect of the jth family in the ith environment, assumed normally distributed, such that Var(b ij ) = a B i, Cov(bij,bi ' j) = O'Bii ,, for i ! i', and Cov(bi!,bi!!!) = 0 for j # j' and any i and i'; and e2!k is a residual effect pertaining to the kth progeny in the subclass ij, assumed A!77D(0, <7!) viz, normally and independently distributed with mean zero and variance U2 wi Actually, this approach consists of considering the expression of the trait in different environments (i, i') as that of 2 genetically related traits with a coefficient of correlation pii, = asi!!/!B!!s!&dquo; (Falconer, 1952).
In a given environment (i), this 1-way linear model generates the classical ANOVA statistics, ie the between-family (SB!, B i ) and within-family (Swi, W i ) sums of squares and mean squares, respectively, whose distributions are proportional to chi-squares: Due to the cross-classified structure of the design, one also has to consider a sum (S B ..,) and a mean (BZi!) between-family cross-product for each (ii') combination of environments: has a Wishart distribution, denoted W(r, s -1), with parameters (s &mdash; 1) and r = Eyv + nE B , thus generalizing to a matrix of between-family sums of squares and cross-products the (o,2 w i +naBi)X!s_1! distribution arising in (3a!. In the 1-dimensional case, the set of SB! and S w . are independent, location invariant sufficient statistics for a-1 and w i similarly the matrices S B and Syv = Diag{5w,} have the same property for  The null hypothesis consists of assuming the homogeneity of the between-family components of variance ('di, or 2 = CT 2 ) and covariance (Vi # i', a-Bii' = C B ) as postulated in many analyses of genotype by environment experiments (Dickerson, 1962;Yamada, 1962;Mallard et al, 1983 (Meyer, 1990 ;-! (auaas, 1992). The basic principle is to treat the unobservable random variables b ij and e2!! as missing data. Actually, the EM algorithm will not be applied directly to the model described in [1] and [2] but after a spectral decomposition of E B according to its eigenvalues and vectors, ie: In this formula, A = Diagf6 il is the (p x p) matrix of eigenvalues 6 i , with 6 i repeated as many times as its multiplicity order, and U = (U 1 , U 2 , ... , Ui, ... , Up) is the (p x p) matrix of the corresponding p normed eigenvectors U i of E B (U'U = I P ). Under the special form shown in !11!, E B has only 2 distinct eigenvalues: with multiplicity orders 1 and (p &mdash; 1) respectively. Moreover, the matrix U of eigenvectors does not depend on the values of 6 1 and 6 2 , U' being the Helmert matrix of order p, see for example Searle, 1982 (p 71 and 322) for more details about such matrices. For instance, for p = 3, Due to the invariance property of (RE)ML estimators, the one-to-one transformation in [14a] and [14b] allows us to change the parameterization from (< T1 ,C B ) to (6 1 ,b 2 ), or more conveniently to (<!i,T) where T = 6 1 + (p &mdash; 1)6 2 , the back transformation is: From the spectral decomposition of E $ , the model in [2] can be written as: where U is defined as before and the vector f j = { fi! is such that f j N N(0,A).
Using the Dempster et al (1977) terminology, a complete data set x can be constructed from / -1, f j , e jk for j = 1,2,...,s and k = 1, 2, ... , n, whereas the incomplete data set is the vector y of observations. ' Let us first consider the case of E B . If the f j 's were known, sufficient statistics for 6 1 and T would be, under the normality assumption: REML would then be obtained by equating the expectation of these sufficient statistics, ie: to their calculated values (M step). Actually, these sufficient statistics are not directly observable and the EM algorithm proceeds first by estimating them by taking their conditional expectation given the observed data set (E step). Since such an estimation depends on the value of the unknown parameters, the procedure is iterative and consists of implementing the 2 usual steps:

Testing procedures
Hypotheses of interest concern the vector 0 of parameters involved in the matrices of between-family (E B ) and within-family (Ey!) components of variance and covariance between environments. The theory of the generalized likelihood ratio can be applied to that purpose, as already proposed by Foulley et al (1990,1992), Shaw (1991) and Visscher (1992)  Here e contains p(p + 3)/2 parameters corresponding to p residual components of variance and p(p + 2)/2 between-family components of variance and covariance between environments whilst Oo has p+2 2 parameters only (p residual components, <T1 and C B ), so that r = !p(p+ 1)/2! -2.
In the Neyman-Pearson approach of hypothesis testing, H o is rejected at the a level if the calculated value of A(y) exceeds a critical value À c such that Pr(xr > À c ) = a. However, the likelihood ratio statistic .!(y) in [24] can also be interpreted as the difference in degree of fit via maximum likelihood procedures by 2 models: a reduced model(R) with parameter vector 0 E O o and a full model (F) with 0 E O encompassing both the null hypothesis and its alternative. In the theory of significance testing (Kempthorne and Folks, 1971), this statistic is also used as a measure of strength of evidence against the reduced model or the null hypothesis. The lower the probability under H o of exceeding this statistic evaluated from the data (also referred to as the P-value or significance level or size of the test), the stronger the evidence against H o .

Example
Data used here to illustrate the procedures are from an experiment carried out in Montpellier (south west of France) on 20 full-sib families of black medic (Medicago lupulina L) tested in 3 different environments (control, harvesting and competition treatments). The experimental design was described in detail by H6bert (1991). There were 2 replicates per environment and the 20 genotypes were randomly allocated to each replicate. Thirty-six traits were recorded and the variable used was the mean of the 5 plants cultivated in each replicate so that p = 3, s = 20 and n = 2. Basic ANOVA statistics for the between-family and within-family sums of squares and cross-products are given in table I for a subset of 5 traits. Firstly, the null assumption that the diagonal terms of E w were equal was tested via a Bartlett's test based on ANOVA mean squares statistics. P values were 0.007, 0.08, 1.4 x 10-7 , 8 x 10'! and 0.04 so that this assumption can be reasonably rejected (except perhaps for trait 2).
Test statistics about E B and estimates of E B and £w under both the reduced and saturated models are given in table II. P-values for vegetative yield traits, represented here by dry matter weight (trait No 3) and dry matter weight/max plant size diameter (trait No 4), were very low, indicating a large heterogeneity in genetic variation between evironments with full-sib variances substantially reduced in the harvesting (i = 1) and competition (i = 3) environments compared with the control (i = 2). In contrast, the harvesting and competition environments do not generate a meaningful level of stress compared with the control for the expression of genetic variation of days to 1st ripe pod (trait No 2) and relative pod weight (trait No 5). These 3 environments then behave as 'exchangeable', as statisticians would say. In this example, genetic correlations between environments were rather high and it would have been interesting to test for some traits (eg No 1 and 5) using the assumption that these correlations are equal to unity by Visscher's (1992) procedures.

DISCUSSION
This paper describes a further contribution to the solution of the problem of testing homogeneity of between-family components of variance and covariance between environments in the case of balanced cross-classified designs. The testing procedure is based on the likelihood ratio test as already advocated by Shaw (1987) in quantitative genetics. This study extends that of Visscher (1992), which was restricted to the case of pure scaling effects between environments (ie all genetic correlations between environments equal to one).
The choice of an EM algorithm for computing REML estimates of E B and E w under the null hypothesis allows us to make explicit the equations of the iterative process to implement via formulae based on the usual ANOVA statistics. This algorithm does not require any constraint on the value of these ANOVA statistics ( eg B -W can be non-positive definite) provided the starting values for E B are within the parameter space. A simple reason for this is that the E phase under the restricted model involves the conditional expectation (given the data) of sums of squares, eg £ f£ and e e !! as estimators of variance. Because E (L f l f ly, A = 8!t! ) is j 7 k j , always positive definite (Foulley et al, 1987), this property of the EM algorithm is also true under the saturated model; it can then be used to provide REML estimates of E B and E w when ANOVA estimators are not permissible (eg for traits 1, 3, 4 and 5 in the example). Some authors such as Anderson (1984) and Shaw (1987) [11]) not only by the assumption of homoskedastic residual variances but also by its restriction to a positive covariance (C B ) between environments.
For instance, for trait 1 (days to flowering), EM-REML estimates of variance and covariance components obtained from the algorithm described in [17]  The EM algorithm is a first-order procedure and therefore has close relationships with a maximization procedure based on zeroed first derivatives. As shown in Appendix B in the case of E B , the difference between the formulae to implement in the 2 iterative procedures consists of replacing (s -1)B + r!t! in [19] by sB in the 1st derivative algorithm (B being a sufficient statistic for r in the saturated model). Again, the use of EM guarantees staying in the parameter space whereas there is no obvious proof of that for the 1st derivative algorithm. Moreover, the EM approach turns out to be easier to understand and to use than the other one, which relies mainly on algebraic tricks. As far as REML estimates of £w are concerned, a functional iteration algorithm based on first derivatives was also proposed in Appendix B due to the lack of an obvious analogue of the EM formulae.
Finally, this EM reasoning can be extended to an unbalanced structure of data and to additional nuisance fixed effects cross-classified with family effects, ie to an extended version of model [1] such as where 13 is a vector of fixed effects including the ith classification for environment and effects of other factors to be adjusted for, and x'ij k is the corresponding incidence (row) vector (1990,1992) and San Cristobal et al (1993) for calculating REML estimates of variances in the presence of heterogeneous residual components. However, the procedure derived in this paper remains definitively more general, for instance, it can also be easily applied to a non-diagonal structure of E w using formulae [17] to [22] unchanged and [23] slightly modified into E!+l] = S2!t!/ns. This paper deals with a null hypothesis of constant between-family variance and covariance. In some instances, a more appropriate null hypothesis would be a constant between-family correlation (p) between environments (a-B ii ' = /9<?'B,<!B,/) and/or of constant intraclass correlation [a-1i = t(a-1i + or2 wi )]. Testing procedures for these assumptions will be reported in a separate article. &dquo;