Approximating selection differentials and variances for correlated selection indices

On propose des formules de calcul approche des differentielles de selection et des variances d'index de selection apres selection directionnelle quand les candidats a la selection ont des index distribues normalement et correles de maniere quelconque. Ces formules ont pour base celles etablies en cas d'equicorrelation entre observations et font intervenir des polynomes des variables suivantes : taux de selection, coefficient de correlation moyen et ecart type moyen de ce coefficient par observation. Les coefficients des polynomes sont calcules apres ajustement a des donnees simulees. Les situations simulees font varier la structure des correlations (1, 2 ou 3 coefficients de correlation intra-classe, de valeurs 0,3 a 10,99), le nombre de familles (1, 2, 5 ou 10), la taille de famille (constante ou non) et le taux de selection (de 0,5 a 50%). En moyenne, 90% du biais introduit en ignorant les correlations entre observations est corrige par nos formules de prediction des differentielles de selection et des variances des observations selectionnees. Des comparaisons sont effectuees avec d'autres methodes de correction proposees pour des structures de correlation particulieres.

F Phocas JJ Colleau Institut national de la recherche agronomique, station de génétique quantitative et appliquée, 78352 Jouy-en-Josas cedex, France (Received 1st June 1994; accepted 1st August 1995) Summary -Empirical formulae were derived to approximate selection differentials and variances of the selected estimated breeding values when the estimated breeding values of the candidates for directional selection are multinormally distributed and correlated in any manner. These formulae extended the well-known exact basic form for the equicorrelated case, taking into account selection pressure, average pairwise correlation coefficient and average standard deviation of pairwise correlation per observation, through polynomials fitted to simulated data. Simulations were carried out for different correlation structures (1, 2 or 3 different intra-class correlations per family, ranging from 0.3 to 0.99), for different numbers of independent families (1, 2, 5 or 10), for constant or variable family size and for selection pressures ranging from 0.5 to 50%. On average, 90% of the bias occurring when ignoring correlations between observations was removed by our prediction formula of selection differential or variance of selected observations. Comparisons with other correction methods, which assume special correlation structures, were also carried out.

INTRODUCTION
The relative efficiencies of alternative breeding schemes can be assessed through deterministic predictions. Both selection and limited size of breeding populations lead to complex consequences for genetic gains, so that unbiased predictions are difficult to obtain (see review by Verrier et al, 1991, for example). An important consequence is that estimated breeding values (EBVs) of candidates are correlated (through genetic relationships and for statistical reasons, because EBVs are obtained from the same set of observations). However, a very common assumption is that candidates correspond to independent observations from an infinite population. Consequently, genetic gains are overestimated because selection differentials and variances of EBVs between selected candidates are overestimated. The amount of bias can differ according to breeding scheme and correctness of comparisons between schemes can be impaired. Burrows (1972) provided an accurate, easy to implement, approximation of selection differentials when independent candidates are drawn from a finite population. When the number of observations is larger than 5, it leads to errors which are always smaller than 2%, and usually smaller than 1%. Conversely, no exact method has been found to take into account any correlated structure among normally distributed observations. Owen and Steck (1962) gave the exact solution for equicorrelated multivariate normal distribution. If we define uniform families as families of identical size and identical within-family correlation structure, Hill (1976) and Rawlings (1976) provided the exact solution for the case of uniform independent families of within-family equicorrelated observations. Since this solution uses multiple numerical integration, they proposed ad hoc approximations which were relatively poor for high intra-class correlations (over 0.6) and severe selection pressures (below 10%). Rawlings' empirical formula was based on Owen and Steck's result for the equicorrelated case. Perez-Enciso and Toro (1991) proposed a method to account for any variance-covariance structure among indices. For equal variances, their method corresponded to Rawling's approximation. Meuwissen (1991) improved Rawlings' approximation for the case of several uniform families and found an extension for uniform full-sib families nested within uniform half-sib families. His correction was very accurate for the breeding schemes examined, ie assuming a hierarchical mating design. However, it cannot be generalized to any correlation structure.
The purpose of this paper is to provide approximation formulae for both the selection differential and the variance of EBVs of selected candidates, assuming no specific correlation structure but assuming that variances of EBVs are constant across candidates. Keeping Meuwissen's basic idea, these formulae are derived by fitting an extended Owen and Steck's formulae to simulated data.

General form
Selection differential Rawlings' (1976) formula consists of using Owen and Steck's (1962) exact formula for selection differential when population is split into independent and uniform families: I o is the standardized selection differential for finite independent observations and depends on n (number of candidates), p (selection rate), 7 oo (selection differential for infinite independent observations) through Burrows' approximation: r is the average pairwise correlation coefficient and is equal to for f families of size s with within-family correlation coefficient p. We suggest here a generalization of Rawlings' formula for any correlation structure, taking into account the following parameters: 1) the selection pressure (p $ 0.5); 2) the average pairwise correlation coefficient: where n is the number of candidates and pi! is the correlation between EBVs of candidates i and j; 3) the average standard deviation of the pairwise correlation coefficients involving a given candidate When variances of EBVs are standardized to 1, the analytical expression of the approximation proposed is: where P stands for a polynomial of variates p, r, and a r . In the equicorrelation situation (ar = 0), Owen and Steck's exact results still hold with such an approximation. Rawlings (1976) compared his approximate correction with exact results obtained from numerical integration and found that the discrepancies between them increased when correlations increased. This justified a further correction term including r. Introduction of parameters Qr and p was basically justified by the fact that Rawlings' approximation is less and less accurate when the variability of pg increases and/or p decreases. The polynomial form of the approximation was considered to be the simplest to implement when no analytical underlying theory is referred to.

Variance of selected EBVs
Owen and Steck's (1962) exact result for the equicorrelated case is V c = (1r)T!o where Y o is the variance of selected independent observations. Burrows (1972) showed that population size hardly affects this last variance. Therefore, U o is calculated as for an infinite population.
where X o o is the selection threshold in an infinite population.
The analytical expression of the approximation is: where Q is a polynomial of variates p, r, ar cancelling out when a r = 0.
Data examination showed that the first part of the approximation, V, = (1r)V o accounted for the major part of variance reduction induced by a correlated structure. The second part of the approximation was introduced as a multiplier factor because observation on calibration data sets showed that this method provided positive approximations for variances. Expressions ensuring positivity in any situation, such as (1 -1')V a exp(polynomial), were not able to provide a good fit. We will comment further on this point.

Fitting polynomial coefficients
Different structures were generated to provide variation for r and Qr . For a given structure, 5 000 replicates were generated. Subsamples corresponding to different selection rates (p) were extracted. The basic observed values I obs and V obs were respectively the averaged values of selected candidates (selection differentials) and the pooled value of within-replicate variances of selected candidates.
Only p values equal to or lower than 0.50 were investigated since the following equations exist: Therefore, if p were greater than 0.5, the prediction should hold for p * = 1p and back solution for p would be given by the above formulae.
Dependent combined observed values from several combinations of data structure x selection rate were analysed to test a polynomial regression, using the SAS procedure 'General Linear Models' (SAS/STAT User's Guide, 1990).
To estimate coefficients of the polynomial P, the dependent variate y was such that: which corresponded to For the polynomial Q, dependent variate z was such that V obs = (1 -r) V a (1 + z) which corresponded to Testing goodness of fit Polynomials of degrees 5 and 6 were tested for P and Q, respectively. They provided better adjustments (R-square values) than polynomials of lower degrees. Fitting higher polynomials led to singularities in our data sets. Only significant polynomial coefficients on p, r, a r and higher degrees of these variates were considered for use in correction formulae.
In addition to the R-square values provided by the model, relative errors incurred with different procedures were considered: -from treating variates as independent 1) for selection differentials UI = 100 J o -Jobs I.bs 2) for variance of selected observations U v = 100 V o -Vo b , ; V obs from correction attempts according to different formulae 1) for selection differentials Fi = 100 Ih -Jobs I Jobs where F is a generic letter corresponding to R, M, P (Rawlings, Meuwissen and polynomial formulae, respectively) 2) for variance of selected observations Fv = 100 1 VF -Vobsl !obs with F corresponding to B (Owen and Steck, 1962) or P (polynomial formula).
Absolute values of ratios are used because correction formulae sometimes lead to overcorrection, ie negative values of relative errors. Rawlings' formula often corresponds to overestimation. Regression formulae such as Meuwissen's and polynomial formulae lead to overestimates in some cases or underestimates in others.
Correction inefficiency corresponds to the ratio of errors still remaining after correction, compared with errors incurred with no correction at all.
Correction inefficiencies for selection differentials correspond to ratios Fj = F II U I , where F stands for alternative correction formulae. Correction inefficiencies for variances correspond to ratios F! = Fv/U!r. When reading tables, small values are favourable when considering either errors or correction inefficiencies.

SIMULATED DATA SETS
Calibration data sets Two sets of simulated data were generated and pooled to estimate coefficients of the polynomials involved in the previous formulae. These data sets were chosen in order to represent a large variation for the correlation structure among EBVs. For that purpose, values of intra-class correlations were arbitrarily taken without considering real breeding scheme structures. An n-candidate layout was simulated as a set of n correlated standardized normal variates, the basic normal variates representing EBVs. In such a simulation, there is no need to simulate performances leading to these EBVs.

Data set 1
In the first data set, 40 candidates for selection were simulated; 1 200 situations were examined according to the number (1, 2 or 5) of independent groups, called 'families', and the size of these groups (constant or variable). Possible contributions of families, when family size is not constant, are shown in table I. Selection pressures were 50, 40, 30, 20, 10 and 5%. Furthermore, 3 correlation structures were simulated.
In the first correlation structure, candidates of the same family were equicorrelated. This could correspond to full-sib of half-sib family structures. Cases with 1 family were not simulated since the exact result is known. Intra-class correlation values considered are shown in table II.
In the second correlation structure, 2 different intra-class correlations were considered within each family. This corresponded to the nested full-half sib family structure analysed by Meuwissen (1991); each family of half-sibs was made up of several groups of full-sibs. The number of full-sibs was 2 in each group. The 9 considered pairs of intra-class correlation coefficients between full-and half-sibs are shown in table II.
In the third correlation structure, 3 different intra-class correlation values per family were considered because each family was split into 2 subgroups. Correlations within sub-groups were r n and r 22 respectively. Correlation between sub-groups was r 12 . This could correspond to subgroups with different information, although strictly speaking, this would lead to heterogeneity of variance. The 6 combinations (r il , r 22 , r 12 ) considered are shown in table II. Sub-group 1 represented 25, 50 or 75% of each family.

Data set 2
In the second data set, 315 situations involving many more candidates (200) and more severe selection pressures (0.5, 1, 5%) were simulated. The number of families was 1, 2, 5 or 10. Heterogeneity for family size is shown in table I. Correlation structures varied according to the same principle as in data set 1 but values were not quite the same (see table II). Sub-group 1 represented 25 or 75% of each family.

Cross-validation data sets
The aim of these data sets is to validate the prediction formulae for correlation structures different from those used for fitting the polynomials. This is a way to test the prediction abilities and robustness of the fitted polynomial equations.
Four situations relative to breeding schemes (10 000 replicates per situation) were considered to derive different structures of correlations among indices. A BLUP animal model evaluation was used to rank animals.
Beef cattle breeding schemes (2 situations) Correction formulae were tested on a simulated selection nucleus for beef crossbreeding on dairy cattle (Phocas et al, 1995, unpublished results).
Generation 1 consisted of 186 dams born from 31 unrelated sires. Generation 2 was produced by mating these dams to 3 sires (1 calf per dam). A BLUP evaluation was implemented, assuming that males or females of generation 2, females of generation 1, males of generation 1, and males of generation 0 were recorded for a single trait.
The first situation corresponded to h 2 = 0.25 and the second corresponded to h 2 = 0.10.
The efficiency of our correction formulae was tested on generation 2, when selecting replacement females (p = 46/93) and males (p = 3/93) and for both heritabilities. Candidates for selection can be unrelated, half-sibs (same sire), cousin (same maternal grand-sire) or both at the same time. For h 2 = 0.25, values for r and Qr were 0.176 and 0.246. For h 2 = 0.10, the corresponding values were 0.223 and 0.309.

Dairy cattle breeding schemes (2 situations)
Intensive breeding schemes using embryo transfer and putting emphasis on pedigree selection are likely to induce high correlations between EBVs of candidates and to reduce effective selection differentials. These schemes are referred to as multiple ovulation and embryo transfer (MOET) schemes (Nicholas and Smith, 1983).
The efficiency of the proposed correction formulae was tested on the 192 females of generation 2, born from 4 sires and 48 dams. Each dam was mated to 2 different sires (factorial mating design). Each mating produced 2 females (and 2 males). The 48 dams were assumed to be recorded on milk yield (h 2 = 0.25) and to be born from 4 sires, unrelated to sires of generation 2. Female candidates of generation 2 were assumed to be evaluated according to a BLUP procedure, to produce replacement females or replacement males.
An 'adult' MOET (first situation) was mimicked assuming generation 2 was recorded (1 lactation per individual). In this situation, relevant r and a, were 0.160 and 0.220, respectively. If a 'juvenile' MOET (second situation) was implemented, females of generation 2 were not recorded before selection. In our layout, all the progeny (4 individuals) of the same dam had the same EBV. Therefore, selection was not carried out among 192 individual EBVs but among 48 EBVs groups; the corresponding r and Qr were 0.137 and 0.251.

Fitting selection differentials
Polynomial P was estimated from the observed data on 1 515 (ie 1200 + 315) basic situations. However examination of the results showed that high values of r(r > 0.6) were detrimental to goodness of fit. Therefore, we restricted data adjustment to the 1 383 situations where r was smaller than 0.6.
Coefficients of the polynomial of degree 5 shown in the Appendix were found to be significant. Similarity of coefficients suggested some grouping and the variate transformation d = r &mdash; QT . Examination of the new results suggested further additional variate transformations e = r(l -r) and b = d(1 -4.2e). This led to only 5 significant regression coefficients without loss of accuracy, as compared to the first adjustment. This polynomial was: with where estimation standard errors are in parentheses. In these conditions, the Rsquare value for this polynomial adjustment was found to be 0.85. Table III shows that the average relative error (P I ) was only 2.5% compared with 26.4% when no correction is used and with 7.1% when Rawling's correction is made. In 96% of cases, relative errors were smaller than 10%, whereas this occurred in only 20% of cases when no correction was used and 79.5% of cases when Rawling's correction was implemented. The average correction inefficiency rate of the polynomial adjustment (Pj&dquo;) was 10%, which meant that 90% of the bias occurring with no correction for correlated EBVs was removed. Only 77% of this bias was removed by Rawlings' formula.  Rawlings' correction (12.3%).
Comparison with Meuwissen's formulae was possible on the 681 situations (out of the 1383 simulated) with 1 or 2 intra-class correlation coefficients. These results are shown in table V. Average pairwise correlation coefficient was smaller than 0.5 for each situation with constant family size and 1 or 2 correlations (table Va). For these cases, Meuwissen's formulae were really better than ours: whereas Meuwissen's average error was smaller than 1% with a maximal error of 4%, the average error incurred with the polynomial formula was nearly 4 and 2% for 1 or 2 correlation cases, respectively. When family sizes were heterogeneous (table Vb), performances of our formula were maintained whereas those of Meuwissen's prediction, assuming a constant average family size, deteriorated and became worse than ours.

Fitting variances of selected observations
Only 606 situations of data set 1 (40 candidates) corresponding to r < 0.6 and constant family size were examined to adjust polynomial Q.
The results obtained suggested fitting p &mdash; 0.5 instead of p. Finally, polynomial Q was: with where the values in parentheses are the estimation standard errors. The R-square value of adjustment was found to be 0.84. Table VI shows that large relative errors for variance of selected EBVs were observed on the simulated data. Considering candidates as independent led to an However, polynomial approximation for variances cannot be considered as safe as for selection differentials. Firstly, we were unable to find reasonable adjustment when data sets included variable family sizes. Secondly, in 2 cases out of the 606 analyzed, corresponding to 0.99 intra-class correlations, our correction led to errors higher than those incurred with no correction. Thirdly, the theoretical form of equation [2] does not preclude negative predictions.
Examination of values of Q according to p and a r showed that positive values of approximated variances are obtained for any selection rate as soon as a r is smaller than 0.35, and for selection rates higher than 2% for Qr between 0.35 and 0.5. The polynomial approximation should not be used for a r greater than 0.5.

Cross-validation
The examples chosen correspond to situations where ignoring correlation between EBVs would lead to substantial relative errors: 10-20% for selection differentials and 30-130% for variances of the selected candidates (table VII). Rawlings' formula was found to be satisfactory (relative errors about 2%) when estimating selection differentials from moderate selection pressures (25-50%). Relative errors increased (6-12%) when selection was more severe (p around 2-4%). Owen and Steck's formula for predicting variances decreased biases but relative errors were still high (5-100%).
The efficiency of the polynomial formula was comparable to that of Rawlings' for moderate selection rates but was clearly superior for more severe selection, because relative errors by our formula in that case did not increase very much and were around 1-3%. Our polynomial formula for approximating variances was not entirely satisfactory but succeeded in giving better results than Owen and Steck's. The range of relative errors for variances was 0-50%; variances were overestimated for moderate selection pressures and underestimated for severe selection pressures.

DISCUSSION AND CONCLUSION
The objective of this work was to provide approximate expressions for selection differentials and corresponding variances on EBVs, easy to calculate and robust for any correlation structure between EBVs which are multinormally distributed. Hill (1977) proved that selection differential can be used to predict selection response when animals are ranked and selected on an optimum selection index.
Although no absolute proof can be given of the validity of our empirical approach for any situation, the moderate prediction errors observed on the calibration data sets (involving a very large diversity of situations) and on the cross-validation data sets (quite different from the former ones) lead one to think that these formulae are relatively robust and might be used for deterministic prediction on breeding schemes, especially when factorial mating designs are implemented and/or family sizes are variable (see, for instance, artificial insemination and natural service families).
However, particular situations should be addressed: 1) When r, the average pairwise correlation coefficient, is greater than 0.6, the situation corresponds to a population with all members closely related (for instance, a family with many full-sibs without own peformance information) and Rawlings' formula should be used.
2) With moderate selection pressures (p greater than 20%), although the polynomial approximation led to similar results, Rawlings' formula is recommended for the sake of simplicity.
3) With a hierarchical mating design leading to full-sibs nested within half-sibs, with families of constant size and constant intra-class correlation coefficients, Meuwissen's formula is preferred. The situation is quite clear for variance corrections since our methods is much better than Owen and Steck's formula and the major part (88%) of the bias occurring with no correction is removed. However, the errors are still large. Important restrictions are that family size should be constant and that a r , the average standard deviation of the pairwise correlation coefficients involving a given candidate, should not exceed 0.5. For these reasons, further improvement should be investigated. Additional heuristic research is needed to provide relevant approximations for selection differentials and variances of the selected candidates when variances are not constant (Perez-Enciso and Toro, 1991) and/or when EBVs of candidates do not have the same expectation, due to mixing of age cohorts, for instance. Such problems are very commonly encountered when attempting to implement deterministic predictions of genetic response.