Bias in multiple genetic correlation from half-sib designs

Biais de la correlation genetique multiple (R 2 ) dans un plan experimental avec demi-freres. Des carres et des co-produits moyens entre peres et intra-pere ont ete simules pour etudier le biais du R 2 genetique (defini comme le carre de la correlation multiple entre un caractere et (q - 1) autres caracteres calculee a partir d'une estimee de la matrice des covariances genetiques), dans des schemas experimentaux equilibres comprenant des demi-freres. Des equations approximatives de prediction de ce biais ont ete etablies dans le cas d'une correlation nulle dans la population. Dans le cas, le biais est a peu pres inversement proportionnel aux degres de liberte d'estimation des composantes paternelles, a la precision de l'epreuve de descendance (implicite), et proportionnel a (q - 1). Si on utilise une regression genetique multiple basee sur un grand nombre de caracteres et/ou un petit nombre de peres, on s'expose a une perte de reponse a la selection par rapport a l'utilisation d'une regression basee sur les vrais parametres de la population.


INTRODUCTION
In animal breeding, some traits are difficult or impossible to measure on animals that we want to select. For example, traits may be sex-limited (eg, litter size in pigs, milk production in dairy cattle), or animals may be too old by the time the trait is expressed (eg, herdlife in dairy cattle).
One way to predict these traits of interest is by using a regression on traits that are easier to measure. Such traits may be physiological predictors, genetic markers, or general traits which are cheaper or easier to measure (eg, type traits in dairy cattle to predict herdlife). In practice, the regression will most likely be a genetic regression, ie predicting the estimated breeding value (EBV) of the trait of interest from EBV of other traits. The use of multiple genetic markers to predict some quantitative trait is also a form of multiple (genetic) regression.
One parameter which summarizes the precision of the genetic regression is the multiple genetic correlation p 9 , or rather its square, p9, which is more convenient to use. We define R9 as an estimate of p9. However, it is well known that the estimate (R 2 ) of the squared multiple correlation coefficient (p 2 ) from phenotypic regression is biased (Fisher, 1924). The aim of this study is to investigate the behaviour of R9 9 2 for balanced half-sib designs. For given population structures, intensity of selection, and relative economic values, p9 determines the responses to selection (eg, Sales and Hill, 1976). Hence by investigation of the behaviour of R9 we can also give examples of the consequences for selection response. 9 METHODS Throughout we assume multivariate normality of observations.

Phenotypic regression
We have q traits in total, N observations, and predict the kth trait from (q -1) other traits. Fisher (1924Fisher ( , 1928 and Wishart (1931) showed that: where p 2 = square of population multiple correlation, F = a hypergeometric function, and For p 2 = 0 , If the population correlation is not zero, approximations to the mean and variance of R 2 are Genetic regression Simulation Between (B) and within (W) sire matrices of mean squares and mean crossproducts of order q were simulated by sampling from independent Wishart distributions, where n is the number of progeny per sire, s is the number of sires, df w = s(n -1), and df b = (s -1). E is the within-sire residual covariance matrix, and It is the between-sire (genetic) covariance matrix. An estimate of the sire covariance matrix (which is one-fourth of the genetic covariance matrix) is Parameter estimates were forced to be in the parameter space (ie genetic correlations between -1 and 1, and heritabilities between 0 and 1) by attenuating estimates. First G and W were diagonalised, The eigenvalues of G, D i , correspond to canonical heritabilities, h2 ci = 4D i/ (D i +1).
Without loss of generality, assume that we wish to predict trait 1 from the other (q&mdash;1) traits (the predictors) using estimates of the genetic and residual covariances matrices, p2 is defined as (1 -REL 1 )(1 -RELp) I / ls(n -i )RELpRELI 11 [6] with REL! = n/(n + A ), A = (4 &mdash; /!.!)//!, and REL!, the 'reliability' pertaining to the (q -1) predictors. (The definition of REL j is a standard expression of the reliability of a progeny test with n progeny and heritability h § . ) If s(n -1) is large, then the simplest approximation to Rg is These approximations are appealing because of their similarity to !3!. (NB: [6] and [7] reduce to (q-1)/(s-1) for large n.) Equation [7] indicates that the expected value of the estimate of p 2 is approximately the number of variates used in the genetic regression divided by the 'effective number of sires'.
In some cases, for example when we deal with genetic markers, the heritabilities of the (q &mdash; 1) predictors, and their correlations with each other, may be known a priori. If the covariances among the predictors are zero, and their heritabilities are equal, then, after some algebra, Equation [8] suggests an adjusted estimate of pg,

Examples for phenotypic R Z 2
In table I, the exact mean and standard deviation of R 2 are given for various combinations of p2, q and N (using results from Wishart (1931)). As was shown in the previous section, these values correspond to the limiting case of very large progeny group sizes in half-sib designs. For most combinations the bias in R 2 is small, although for relatively few observations (N = 10Q and N = 200) and a large number of traits (q = 10 and q = 20), the bias and standard deviation of R 2 can be large. For example, when p 2 = 0 and q = 20, the mean and standard deviation of R 2 ( x 100) for N = 100 are 19.2 and 5.5 respectively (table II). Even for N = 400, the mean R 2 is nearly 0.05.
Examples for R9 when p 2 = 0 In table II, simulation results, and their predictions, are shown for various combinations of s, q, and n. The predictions were made according to (6!, using population parameters. In all cases the heritability of all traits was 0.25. In general, predictions and simulation results agreed reasonably, although for small n and s, and large q, the prediction tends to be too low. For example, for s = 100, q = 20 and n = 25, the average R2 from simulation was 0.93, whereas the prediction was only 0.49. Predicting herdlife from type traits Various authors have found associations between type traits and herdlife or survival in dairy cattle (eg, Rogers et al, 1988;Brotherstone and Hill, 1991;Boldman et al, 1992;Short and Lawlor, 1992). Most analyses were from sire models with many type traits analysed simultaneously. A typical value for the heritability of functional herdlife (= HL = herdlife adjusted for milk production) is 0.05. Equation [7] was applied to the situation where (functional) herdlife is predicted from a range of type traits, with h 2 of herdlife of 0.05 and h 2 of type traits of 0.30. Average predicted Rg ( x 100) for p 2 = 0, q = 20 and n = 50, were 61.9, 30.8, 15.4, 7.7 and 3.8 for s = 100, 200, 400, 800, and 1600, respectively.
In practice the EBV for milk yield may be combined with the EBV for herdlife (predicted from EBV of type traits) in an overall selection index. The efficiency of such an index was investigated using results from Short and Lawlor (1992). Their estimated genetic and phenotypic covariance matrices of HL and 15 type traits (hence, q = 16) for grade Holsteins were assumed to be the population covariance matrices. For each simulation, the estimated covariance matrices (with s = 1400 and n = 33) were used to create a selection index combining milk with HL. It was assumed that the h 2 for milk yield was known (h 2 = 0.25), and that milk yield and HL were independent (it is a separate issue what the correlation between adjusted herdlife and milk yield really is, since the adjustment is usually at the phenotypic level). Further assumptions were that the selection index was based on 50 progeny for milk yield and type traits, and that relative economic weights of milk/HL were 2:1 (in genetic standard deviation units). These results are presented in table III. The (assumed) genetic pg was 0.37, which follows directly from the results from Short and Lawlor (1992). The average Rg from simulation was 0.81, with a proportion of 0.58 of the simulated genetic covariance matrices that were attenuated. The optimum selection index (using population covariance matrices) resulted in a correlation between index and goal (r IH ) of 0.813. The achieved r! was on average 0.795, and the predicted r IH (assuming the estimated covariance matrices are the true ones) was 0.82 (table III). Hence, although the genetic Rg 9 2 was severely overestimated, the loss in response was small (0.795/0.813 = 0.978 efficiency). Ignoring type traits altogether gives an r IH of 0.785. Finally, using a selection index with milk yield and HL itself results in r IH = 0.826.

DISCUSSION
For half-sib population structures, average R2obtained from simulation and from prediction equations were compared for different number of sires, number of traits, and number of progeny per sire. In general, there was good agreement, although with a large number of traits (q) and small number of sires (s), average R 2 from simulation were larger than predicted. The reason for this is 2-fold. First, ligher order terms from the Taylor series which are not taken into account are likely to be proportional to q 2 , so that the prediction would be too low. Second, for combinations of large q and small n, the probability of non-positive-definite matrices and hence attenuation is higher (Hill and Thompson, 1978). After attenuation, the assumption of E(G) = B11 is not valid anymore, and the prediction will be out. For s = 100, q = 20 and n = 25, including higher order terms in the prediction (terms not shown) gave a predicted Rg of 0.58.
In table IV simulation results are presented separately for those replicates whose estimated covariance matrices were attenuated, and for those for which no attenuation was required (ie G = (B -W)/n). For nearly all combinations of parameters, the average Rg was nearly 1.0 for when covariance matrices were attenuated. This can be explained as follows: when the (B -W) is non-positivedefinite, a linear combination of all traits exists with zero genetic variance, and, therefore, any single trait may be predicted from a linear combination of all other traits with an accuracy of unity. Consider the bivariate case when the linear combination l l y l + 1 2 y 2 has zero variance, Var(l l y l + 1 2Y2 ) = a + 2cov + b = 0.
Hence, cov = &mdash;(a+b)/2, and r = -(a+b)/2(ab)1!!. The last term is always < -1, unless a = b. Hence, on the original scale, the correlation between y l and y 2 is < -l, which will be forced to -1, and the resulting R 2 will be 1.0. The same principle holds when for more than 2 traits, ie when y 2 itself is a linear combination of more than 2 traits. This has implications for inferences drawn from REML estimation, because most REML algorithms in practice do require estimates to be within the parameter space. Therefore, one should be very cautious in drawing inferences about functions of parameter estimates (such as R!) from large estimated covariance matrices.
Because the mean Rg depends on whether covariance matrices are attenuated, a refinement of the prediction equations is to predict the proportion of estimates for which this occurs. This was beyond the scope of this study, but Hill and Thompson (1978, and references therein) addressed that issue. Meyer and Hill (1983) found large losses in response for s = 100, n = 4(8) and 2 or 4 traits of equal importance when estimated covariance matrices were used in a selection index. Losses in response were much smaller when 'bending' was applied to the between-sire covariance matrix.
Overestimation of the multiple correlation coefficient from a multiple regression of (estimated) breeding values on genetic marker scores has similarities with the topic addressed in this study. When estimating associations between genetic markers and quantitative traits we have to specify what kind of population the sample is from. Usually association studies are either from populations derived from crosses between divergent lines (or inbred lines) or within families in completely outbred populations. When dealing with crosses from different breeds or inbred lines, the bias in phenotypic R 2 applies since linkage disequilibrium will be across the population. For half-sib designs in outbred populations essentially the bias in the within-sire R z is of interest because regressions of phenotypes on markers are within families. However, these cases are extremes. In practice, we may deal with a population which was created by hybridization a number of generations ago, and in that case it would not be unreasonable to look for genetic markers that explain some of the between-sire variance. A thorough study of the bias in R 2 from using genetic markers, taking into account the discrete nature of marker scores and linkages between markers and quantitative trait loci was outside the scope of this study. Sales and Hill (1976) derived losses in response to selection when including worthless marker traits in a selection index. For marker-assisted selection in a population created by recent hybridization, re-sampling of data after choosing an initial set of markers (Lande and Thompson, 1990) should reduce the bias in R 2 . However, although the individual marker effects may be estimated without bias in the subsequent sample (a result of Lande and Thompson's proposal), their combined effect, as measured by the R 2 , may still be biased. This could lead to a loss in response to selection compared to using the true marker effects because information from markers will usually be combined with phenotypic information in a selection index so that an upward bias in the R 2 from markers will result in too much weight given to the marker information.
In general, obtaining unbiased estimates of p2 is intractable, because the mean of Rg depends on the unknown population parameters in a complex way (ie first and second derivatives of R2 with respect to estimates of individual variance components in the Taylor series). In very limited cases, prior information about (co)variances can be used to adjust R 9 2. For example, if the heritabilities for all traits are known, and the (q -1) predictors are known to be uncorrelated, Equation [9] can be used to adjust R2. Table V shows simulation results using !9!. The adjustment works well, expect for large q and small s. The reasons for the poor performance of the adjustment for q = 20 and s = 100 are the same as before, ie higher order terms in the Taylor series are ignored and the probability of attenuation is higher.
Although the genetic R2 for predicting herdlife may be severely overestimated, the effect on loss in response to selection seems small. This is because the relative economic weight for HL was assumed to be half that of milk yield, and because the heritability of HL was small. For the example of Short and Lawlor (1992), h 2 of HL was only 0.04. Hence, even if we think we can accurately predict HL when in fact the prediction is inaccurate, response to selection is only reduced slightly because the prediction of HL gets a low weight in the overall selection index. Still, the loss in efficiency (2.2% for the example) should be compared to the maximum gain obtained by including type traits (0.813/0.785 = 3.6% extra gain in the example).
Thus, only about one-third of the maximum achievable gain was obtained. Finally, it seems undesirable to include traits in the selection index for which the estimated parameters may be subject to large error.