A criterion for measuring the degree of connectedness in linear models of genetic evaluation

Summary - A criterion for measuring the degree of connectedness between factors arising in linear models of genetic evaluation is derived on theoretical grounds. Under normality and in the case of 2 fixed factors (0, 0), this criterion is defined as the Kullback-Leibler distance between the joint distribution of the maximum likelihood (ML) estimators of contrasts among 0 and 0 levels respectively and the product of their marginal distributions. This measure is extended to random effects and mixed linear models. The procedure is illustrated with an example of genetic evaluation based on an animal model with phantom groups


INTRODUCTION
The development of artificial insemination in livestock and the potential for using sophisticated statistical BLUP methodology (Henderson, 1984(Henderson, , 1988 gave new impetus for across-herd or station genetic evaluation and selection procedures, eg reference sire systems in beef cattle (Foulley et al, 1983;Baker and Parratt, 1988) or sheep (lVliraei Ashtiani and James, 1990) and animal model evaluation procedures in swine (Bichard, 1987;Kennedy, 1987;Webb, 1987).
In this context, concern about genetic ties among herds or stations is becoming increasingly important although, from a theoretical point of view, complete disconnectedness among random effects can never occur, as explained in detail by Foulley et al (1990). Petersen (1978) introduced a test for connectedness among sires based on the property of the &dquo;sire x sire&dquo; information matrix after absorption of herd-year-season equations. Fernando et al (1983) proposed an algorithm to search for connected groups in a herd-year-season by sire layout which was based on the physical approach of connection developed by Weeks and Williams (19G4). This view was also taken up by Tosh and Wilton (1990) to define an index of degree of connectedness for a factor in an N-way cross classification. Foulley et al (1984,1990) reviewed the definition and problems relevant to this concept. They offered a method for determining the level of connectedness among 2 levels of a factor by relating the sampling variance of the corresponding contrast under the full model to its value under a model reduced by the factors responsible for unbalancedness.
The purpose of this paper is 2-fold: i) to extend this procedure defined for a specific contrast to a global measure of connectedness among levels of a factor; ii) to set up a theoretical framework to justify such a measure on mathematically rigorous grounds.

METHODOLOGY
Our starting point is the following basic property: if observations in each level of some factor (ie B) are equally distributed across levels of another factor (ie 0), BLUE estimators of the contrasts B i -B i &dquo; <! &mdash;<!' are orthogonal under an additive fixed linear model with independent and homoscedastic errors. This property is lost under an unbalanced distribution up to an ultimate stage consisting of what is called disconnectedness or confounding between the 2 factors. This suggests the idea of measuring the degree of connectedness by some distance between the current status of the layout and the first &dquo;orthonormal&dquo; one following the terminology of Calinski (1977) and Gupta (1987). The Kullback-Leibler distance I 12 (x) = J p l (x) In [ P I ( X )/ P 2 (x)]dx between 2 probability densities Pi(!);P2(-!) turns out to be a natural candidate for measuring such a distance (Kullback, 1968(Kullback, , 1983. The model assumed is a linear model with additive fixed effects and NIID (normally, identically and independently distributed) residuals e ! N(O, (]' 2 I N ) where y is an N x 1 data vector, 9, ! and A are vectors of fixed effects and X o , X j and X are the corresponding incidence matrices. Without loss of generality, we will assume a full rank parameterization in vectors 0 and < pertaining to factors 0 and 0 and resulting in contrasts such as B i -0 1 and ! &mdash; !1 so that: where me and ni o are the numbers of levels for the factors 9 and § respectively.
The vector X in [1] designates remaining effects of the model. In a 2-way crossclassified design (eg mean ti, &dquo;treatment&dquo; and &dquo;block&dquo;), one has A = c1 N with c = p + 9 1 + 1> 1 but this parameterization turns out to be more general and may include one or several extra factors.
Degree of connectedness is assessed through the Kullback-Leibler distance between the joint density f (9, !) of the 1!IL (maximum likelihood) estimators â Similarly, by substituting to 0: and, finally from the last term in (11!, one has: Four remarks are worth mentioning at this stage: 1) As shown by formulae [10] !13!, [14] and !15!, one may talk equivalently about connectedness between 0 and 0 as well as connectedness of (or among 9 levels) due to the incidence of 0 (or connectedness of 0 due to the incidence of 0) in a model including 0, 0 and A using the terminology of Foulley et al (1984,1990). This terminology is also in agreement with that taken up by statisticians (Shah and Yadolah, 1977).
2) It is interesting to notice that the variance Coo.,5 of the conditional distribution of 0 given $ is also the variance of the marginal distribution of 6 under the reduced model (0, A). This leads to view the ratio of determinants in [13] in the same way as Foulley et al (1990) ie using their notation: where C R and C F are C matrices pertaining to 4 under the full (F) model in [1] and the reduced model (R) without 0 respectively. Moreover, the -y coefficient defined as: generalizes the -yi i , coefficient of connectedness introduced by Foulley et al (1990) for the contrast 9 j -8 i , ; it varies similarly from q = 0 (or D = +oo) in the case of complete disconnection to 7 = 1 (or D = 0) in the case of perfect connection (ie ortlzogonality).
3) Let us consider the characteristic equation: The roots k i of [18] are the eigenvalues of CBB 'Coo.0 or CF 1 C R so that: where kg is the geometric mean of the kis and ro = dim (Coo). Hence In q = rokg which is the justification to standardize D and y to: so as to take into account the numbers of elements in 0 to be estimated when comparing degree of connectedness of factors differing in number of levels. This standardization procedure is analogous to that proposed by S61kner and James (1990) for comparing statistical efficiency of crossbreeding experiments involving different numbers of parameters. In that respect q § can be interpreted as a kind of average measure of connectedness for (0i,!) among all pairs of levels of the factor 9 due to the incidence of the nuisance factor 0 for a fixed effect model (see the Appendix). Since y is equal to both JC-'Coo. 01 and IC;JCq,q,.oj, one can standardize with respect to ro or as well as to r b depending on the factor which we are interested in.

4) An alternative form to [18] is:
the roots of which p 2 = 1 -k i turn out to be the squared canonical sampling correlations between â and !. Since the (non zero) roots of [21] are also the (non zero) roots of ICøoC¡¡iCoø -p 2 C øø/ = 0, they satisfy the equation !C!.6 &mdash; (1 -p2)C!!I = 0. Thus q can be expressed as: with p i = 0 (ie ki = 1 -p.2 = 1) for i = re + 1, re + 2, ... , ro if ro < r or for i = r4> + 1, rØ + 2,..., re if r4> < reo 5) The presentation was restricted to 2 factors and 0 . It can be extended to more than 2 classifications. For instance, with 3 factors _B, ø, 1 Ji, one can consider the Kullback-Leibler distance between f (4,4, O) and f(0) f«, lY ). The resulting D coefficient can be expressed as D = 2 ln (IIee,'>'1 / IIee '4 >w,>,1) and interpreted as the degree of connectedness of e due to fittiiig q 5 and TI in the complete model (a, <i !,À). 6) This approach developed for models with fixed effects can be extended to mixed models as well. A first obvious extension consists of taking k in [1] (or part of it) as a vector of random effects. The only change to implement in computing the matrix in [7] is to carry out an absorption of A equations which takes into account the appropriate structure of this vector. Actually this can be easily done using the mixed model equations of Henderson (1984).
In more general mixed models, one has to keep in mind that from a statistical point of view, connectedness is an issue only for factors considered as fixed (Foulley et al, 1990). In other words, in a model without group effects, BLUP of sire transmitting abilities or individual genetic merits always have solutions whatever the distribution of records across herd-year-seasons and other fixed effects.
Nevertheless, the phenomenon of non orthogonality between the estimation of a contrast of fixed effects and the error of prediction in some level of a random effect still exists and may be addressed in the same way as outlined previously. For instance to measure degree of connectedness between one random factor u = {ui}; i = 1, 2, ... , m u (eg sire) and one fixed factor < (eg herd), it suffices to consider in [3] its error of prediction from BLUP ie replace 4 in [2a] by A = {!i = u iu il . All the above formulae apply since the derivation of [10] or [16] requires tr (QC) = 0 (see !9cJ) which results from general properties of the Z and C matrices ((8J, [9a] and !9bJ) that do not refer to any particular structure (fixed or random) of the vectors of parameters. Again, the only computational adjustment to make is to view the corresponding I matrices as coefficient matrices of Henderson's mixed model equations (Henderson, 1984) after absorption of the equations in h. In fact, this extension fully agrees with the role played by ICI in the the theory of Bayes D-optimality (see eg DasGupta and Studden, 1991).

NUMERICAL EXAMPLE
A small hypothetical data set is employed to illustrate the procedure.
The layout (table I) consists of a pedigree of 8 individuals (A to H) with performance records on 7 of them (B to H) varying according to sex (si; i = 1, 2), year (a j ; j = 1, 2, 3) and herd (h!; k = 1, 2). Unknown base parents (a to h) were assigned to 3 levels of a group factor (9¡; L = 1, 2, 3). Data of this layout are analyzed according to an individual (or &dquo;animal&dquo;) genetic model (Quaas and Pollak, 1980) accomodated to the so-called accumulated grouping procedure of Thompson (1979), Quaas and Pollak (1982), Westell (1984) and Robinson (1986) (see Quaas, 1988 for a synthetic approach to this procedure). Using classical notations, this model can be written as: or, using distributions where y is the data vector, i3 is the vector of fixed effects (sex, year, herd), u is the random vector of breeding values, and X and Z are the corresponding incidence matrices. The vector u of breeding values has expectation Qg and variance A O '2 a where Q defined as in Quaas (1988) assigns proportions of genes from the 3 levels of group (vector g) to the 8 identified individuals, A is the so-called numerator relationship matrix among those individuals and a £ is the additive genetic variance. Using Quaas' notations, u can be alternatively written as: with u* ! N(0, A Qd ) being the random vector of the within-group breeding values.

The (full rank) parameterization chosen here is:
The grouping strategy of base animals is an issue of great concern for animal breeders due to the possible confounding or poor connectedness with other fixed effects in the model (Quaas, 1988). Therefore, it is of interest to look at the degree of connectedness between this group factor and other fixed effects, or equivalently to degree of connectedness among group levels due to the incidence of other fixed effects. In this example, 3 fixed factors (in addition to group) were considered which are sex (S), year (A) and herd (H) and their incidence on connectedness of groups can be assessed separately (S, A, H) or jointly (S + A, A + H, H + S, S + A + H). From notations in (1), degree of connectedness of G due to A is based on: The corresponding information matrix is obtained from the coefficient matrix derived by Quaas (1988) for a mixed model having the structure described in !23aJ, [23b] and (23c). Letting the vector of unknowns be (P', g', u')', this coefficient matrix is given by: In this example, the matrices involved in [26] are: Elements in the first column of Q within brackets are deleted in the computations due to the parameterization chosen in [24a] and [24b}. A-' is half stored with non zero elements being: A * may also be calculated directly from Quaas' rule (Quaas, 1988). Connectedness between groups due to the incidence of the other fixed effects was assessed under the full model using Quaas' system in [26], and also for an u * deleted model (y = Xp + ZQg + e), then using the ordinary least squares equations. Numerical results are given in table II. In this example, the main sources of disconnectedness are by decreasing order: herd, year and sex, the first factor being by far the most important one since the -y * values associated with herd are 0.312, 0.247, 0.272 and 0.239 when this factor is considered alone, and with year, sex and year plus sex respectively. Actually, this result is not surprising on account of the grouping procedure based on parents in groups 2 and 3 coming out of different herds. One may also notice that D values for combinations of factors exceed the sum of D values for single factors. For instance, D is equal to 1.433 for S + A + H vs ED = 1.316 for each factor taken separately. Results for the purely fixed model (u * deleted) are in close agreement with those of the full model. This procedure of ignoring u * effects for investigating linkage among groups was first advocated by Smith et al (1988) due to its relative ease of computation in large field data sets.
The extension of the theory to the measure of degree of connectedness of random factors is illustrated in this example by calculations of D and & d q u o ; ' ( * for breeding values (table II). Sources of unbalancedness rank as previously, but the average level of connectedness (-y * = 0.574) for breeding values in higher than for groups (y * = 0.239) due to prior information (Foulley et al, 1990).
The theory also applies to specific contrasts among effects as originally proposed by Foulley et al (1984by Foulley et al ( , 1990. The degree of connectedness for pair comparisons among breeding values then reduces, simply to the ratio of prediction error variance of the pair comparison under a reduced model (R) with some effects deleted (in table III, all fixed effects except mean and group) and under the full model (F), ie: where 6 i i, = ui -uj, . Figures shown reflect a great heterogeneity in the pattern of degree of connectedness. This diversity can usually be well explained by looking at the levels of factors which differ or are shared by individuals compared. For instance, B and F are closely connected (y * = 0.840 and 0.808 in I and II respectively) because they are in the same herd and share close proportions of genes from the 3 groups of base parents (0.5, 0 and 0.5 from groups 1, 2 and 3 respectively in B vs 0.375, 0.125 and 0.5 in F). On the contrary, D and G who are coming fiom different herds and for whom, 3/4 of their genes are originating from different groups (groups 2 and 3 respectively) are poorly connected (-y * = 0.047 and 0.064 in I and II respectively). Moreover, !y* values computed according to both procedures (exact or approximate definition) are in good agreement in this example although it is difficult to draw general conclusions from such a limited example.

DISCUSSION AND CONCLUSION
This paper provides a theoretical framework to the definition of an objective criterion for measuring the degree of connectedness between factors involved in Gaussian linear models of genetic evaluation. The procedure proposed herein is based upon tlie assessment of non-orthogonality between estimators of contrasts (or errors of prediction for random effects) via the Kullback-Leibler distance.
This measure offers great flexibility since it can be employed for a particular comparison among levels of some factor or for a global evaluation of their degree of connectedness. Applications of these criteria to degree of connectedness among sires in a reference sire system based on planned artificial inseminations with link bulls have already been made in France (Foulley et al, 1990;Hanocq et al, 1992;Laloe et al, 1992). where C R and C F are the same as in [16]. This criterion appears also in statistical inference on variance-covariance matrices as the so-called Stein loss function (Anderson, 19b4;Loh, 1991). Here, it can be interpreted as the Kullback-Leibler distance between the marginal density f (9) of 8, and its conditional density, f (8!!), given the value of the parameter !.
The feasibility of our procedure is determined by the ability to compute the logarithm of the determinant of a coefficient matrix after possible absorption of some factors as required by other statistical procedures based on the likelihood function. In the current context of genetic evaluation with the animal model, an application of this procedure to phantom groups might be feasible using, at least, the model ignoring u * as a first approximation.
In that respect, it has also been suggested (Kennedy and Trus, 1991) to look at the elements of the coefficient matrix X'ZQ whose relative values in row k provides the expected proportions of genes out of the different levels of groups contributing to the corresponding level of the k th fixed effect. In our example, these values are as follows: These figures show a more unbalanced distribution across herd and/or year than across sex levels. Notice that this matrix gives the distribution of data according to groups for each factor separately. No account is taken of the joint distribution of data between those factors. In this model, this means that the factors sex and group are not perfectly connected due to slighty unbalanced proportions observed. As a matter of fact, 9 2 -9 1 is correlated to §2 -¡it and 9 3 -!l in the &dquo;sex + group&dquo; model whereas they are uncorrelated in the full model (see table II). The -y * criterion applied to breeding values measures how the C. matrix of variances of prediction errors is reshaped due to the incidence of an unbalanced distribution of data across the nuisance factors. This change in C implies a related change in the variance covariance matrix of estimated breeding values which influences the selection differential. Accuracy of selection is also expected to be altered. In this respect, insufficient connectedness can be compared to some extent to some non-optimum selection procedure which ignores, or does not weight properly, some sources of information, eg, within family selection vs index selection. More research is needed in this field to quantify the amount of genetic progress which may be lost due to reduction in the degree of connectedness.
For fixed effects, connectedness is directly related to the unbiasedness requirement. This is especially true for group effects in the animal model for which much concern has been raised (Smith et al, 1988;Quaas, 1988;Canon et al, 1992). The criterion developed here may help to check whether differences between groups in a particular model can be reasonably captured by the data structure. If not, one will have to reconsider the grouping procedure, or one may be tempted to put prior information on group effects ie to treat them as random as suggested by Foulley et al (1990). In any case, one will have to compare different models and there are now specific statistical procedures available to do that in animal breeding (Wada and Kashiwagi, 1990 ie the expectation with respect to the distribution of 9 1 of the conditional expectation of lnR(!2,<)'!!i) taken with respect to the distribution of Ô2,! given Ô 1 . This conditional expectation is by definition a D-measure noted D(B 2 , I) 81 ) ; because this is again a constant (see (10!): which does not depend upon 0 i , [A.3b]  Similarly, -y(6, !) = exp [-2D( Ø , +)] can be expressed as the following product: and equivalently after permutation of < 9 i and W 2 , as: Thus, letting F(8j , $) such that: one has: and -y * (0, ell) = (q(0, ell) 1/2 can be interpreted as, either the geometric mean of the F(8.j , <) coefficients in [A.6], or as the geometric mean of all possible !y(6i, 41!j) coefficients (including the unconditional ones). For three elements B i' 8j , 8 k , in 0, one would have: ' and similarly for 4 elements 0,, 8j , 8!;, B! These formulae can be easily extended to any number of elements r o in 9_. Formula !A.7! applies and the coefficient of the power pertaining to the !(0,, <)!,...) term given k variables !j in (F(8; , $)1 ' is then 1/Cr e _ 1 ie the inverse of the coefficient for the ktli power in the binomial expansion of order re -1.