A sampling method for estimating the accuracy of predicted breeding values in genetic evaluation

A sampling-based method for estimating the accuracy of estimated breeding values using an animal model is presented. Empirical variances of true and estimated breeding values were estimated from a simulated n-sample. The method was validated using a small data set from the Parthenaise breed with the estimated coefficient of determination converging to the true values. It was applied to the French Salers data file used for the 2000 on-farm evaluation (IBOVAL) of muscle development score. A drawback of the method is its computational demand. Consequently, convergence can not be achieved in a reasonable time for very large data files. Two advantages of the method are that a) it is applicable to any model (animal, sire, multivariate, maternal effects...) and b) it supplies off-diagonal coefficients of the inverse of the mixed model equations and can therefore be the basis of connectedness studies.


INTRODUCTION
The accuracy of predicted breeding values may be assessed by prediction error variance (PEV) (e.g. [16]) or by other criteria which are functions of PEV such as the coefficient of determination (CD) (e.g. [17]) also defined as the squared correlation between a true genetic merit and its estimate [4,25]. PEV and CD were first used to evaluate the accuracy of the estimated breeding value of each animal (PEV; e.g. [10,26] CD; e.g. [4,25,27]). Then, they were extended to connectedness studies. In these studies, genetic comparability of two animals or two populations of animals could be assessed by measuring the PEV [4,16] or CD [17,18] of contrast between their genetic merits.
In theory, PEV and CD are derived from the elements of the inverse of the coefficient matrix of the mixed model equations. In practice, however, the number of animals to be evaluated is generally too large for this coefficient matrix to be inverted, and the elements of the inverse have to be approximated.
Attention has mainly been focused on diagonal elements and, therefore, on individual PEV or CD. Approximations have usually been found using analytical methods. Typically, diagonal elements are adjusted for connections to parents, progeny and fixed effects, and the reciprocal of resulting coefficients provides an approximation of the diagonal elements of the inverse [1,12,19]. Recently, Jamrozik et al. [15] applied such a method to random regression models. Analytical methods have been developed to approximate accuracies of prediction resulting from multiple trait analyses as well [8][9][10].
The partitioned matrix theory and sparse matrix inversion methods [20,21] have also been proposed to calculate the accuracies of random effects from a single trait animal model with direct and maternal effects [24]. These authors [24] also proposed a method to approximate these values in a reduced computing time.
Another approach for estimation of accuracies could be the use of samplingbased techniques such as Bootstrap [2] or Gibbs sampling [7], now increasingly more useful due to the availability of inexpensive and powerful computers.
The aim of this paper was to show how a simple sampling method could be used to calculate an approximate CD. This method was validated using an animal model with a sub-sample of data recorded on the French Parthenaise breed. It was then applied to all the Salers breed animals involved in the French on-farm evaluation of 2000.

Models
Consider a Gaussian mixed linear model with one random factor and a residual effect: y = Xb + Zu + e (1) where y is the performance vector of dimension n, b the fixed effect vector, u the random effect vector, e the residual vector, and X and Z the incidence matrices which associate elements of b and u with those of y.
The variance structure for this model is: and where A is the numerator relationship matrix, and the scalars σ 2 a and σ 2 e are the additive and residual variance components, respectively. The BLUP (Best Linear Unbiased Prediction) of the breeding values u, denotedû, is the solution of: Z MZ + λA −1 û = Z My (4) where λ = σ 2 e /σ 2 a and M = I−X(X X) − X . M is a projection matrix orthogonal to the vector subspace spanned by the columns of X: MX = 0.
The variance structure of u andû is [13]: and The accuracies of estimated breeding values (û) may be given by prediction error variances (PEV) or by other functions derived from PEV such as the CD. The PEV of the estimated breeding value of an animal i is: (7) or PEV(i, i) = var(u i ) − cov(u i ,û i ) (8) where u i andû i are the true and estimated breeding value of i, respectively, and V uu (i, i) and V uû (i, i) are the i-th diagonal elements of matrices V uu and V uû , respectively. The CD of i is: Since Vûû = V uû [13], individual CD may also be calculated as: CD(i, i) is therefore the squared correlation between the true and predicted breeding values of i [4,25]: Estimating PEV or CD by including formulas (5) and (6) in formulas (7) or (9) requires the approximation of diagonal elements of the matrices A and C uu , as shown by e.g. [1,19,24,27]. By using a sampling technique, estimating PEV or CD from formula (8) or (10) involves the empirical estimation of variances and covariances of predicted and true genetic values. Importantly, such a strategy can be implemented without any complex matrix computation. By extension, this method may be easily used to estimate off-diagonal elements of A and C uu which are of interest to study genetic connectedness between animals or populations (herds, years, countries...). The precision of a comparison between the genetic merits of animals or groups of animals can be estimated by looking at PEV [5,16] or at CD [17,18] of the corresponding contrast. This contrast may be seen as a linear combination of breeding values (x u) where x is a vector whose elements sum to 0 [17] e.g., the contrast between breeding values of two animals i and j is: The PEV of x u is: (11) and its CD is: Finally, this method may be used to estimate individual PEV or CD and PEV or CD of a comparison within or between any random variable in the model such as maternal effect, permanent environment effect... by replacing the breeding values in formulas (8), (10), (11) or (12) by the desired variable.

Sampling method algorithm
The method consists of estimating the different variances involved in formulas (8) or (10). These estimates are obtained from the empirical distribution of u andû using a sampling process. The inbreeding of the parents was ignored to simplify the procedures of simulation.

Simulation of vector u
The vector of breeding values (u) is normally distributed with a variance matrix Aσ 2 a , whose order can reach more than 10 6 . Current random number generators cannot draw vectors with such complicated multivariate distributions. Nevertheless, a vector accounting for the particular pattern of the matrix A can be easily derived using a method such as the one described by Foulley and Chevalet [3], for example. This method is regularly used in simulation studies and is briefly described here: 1. First, animals involved in the simulation are sorted chronologically, from the oldest to the youngest. Hence, the parents' breeding values are simulated before those of their progeny. 2. A breeding value u i is randomly generated for each animal i from a normal distribution which depends on the status of i's parents j and k: If j and k are unknown, then u i is generated from N(0, σ 2 a ); If one parent, say j, is known, then u i is generated from N(0.5u j , 0.75σ 2 a ); If j and k are known, then u i is generated from N(0.5u j + 0.5u k , 0.5σ 2 a ). At the end of the process, the vector u = {u i } is actually distributed according to the multivariate Gaussian distribution N(0, Aσ 2 a ).

Simulation of vector y
Since the estimation of variance matrices does not depend on fixed effects, these effects are set to 0 without loss of generality. The performance of each performance recorded animal t is then equal to y t = u t + e t , where e t is randomly generated from the Gaussian distribution N(0, σ 2 e ). Performances of the non-recorded animals are not simulated.

Simulation of vectorû
The vectorû is then obtained by solving the mixed model equations (formula 4) using the simulated performances (y).

The sampling process and variances estimations
Repeating this process n times produces 3 n-vectors for each animal i: i are respectively the value of the k-th replicate of y i , u i andû i . According to the Glivenko-Cantelli theorem (e.g. [6]), the empirical distributions of u i andû i converge to their true distributions as n increases. Empirical variances and covariances are, therefore, computed for each animal i from the n replicates of u i andû i (u i andû i ).
The empirical variances and covariances structure between u andû is given by: PEV or CD are then estimated by replacing the variance component formulas (7) or (9) by these empirical estimates. PEV or CD of any contrasts can also be estimated by computing directly their own empirical variances and covariances without actually computing all the other off-diagonal elements of the matrices.
NAG subroutines were used for drawing random numbers [22].

Validation of the method
Validation of this method was done in a sub-sample of the data used on the French on-farm evaluation, IBOVAL, for the Parthenaise breed. The trait analysed was the muscular development score at weaning, and the model used in this present study was the model used in the real IBOVAL evaluation [14].
The data set consisted of 1 592 Parthenay animals among whom 970 were performance recorded. Contemporary groups (38 levels), and four fixed effect factors were included in the model. The heritability was equal to 0.28.
The limited size of the data set allowed the estimation of the true CD by inversion of the coefficient matrix of the mixed model equations (formula 6).
The approximate CD based on formula (10) were estimated by solving the mixed model equations for 500, 1 500, 5 000 or 25 000 replicates of y, u andû. BLUP were estimated using an iteration method involving successive overrelaxation (SOR). A relaxation parameter of 1 was used for the first six iterations, 1.2 from Iteration 7 to Iteration 40, and 1.5 from Iteration 41 until convergence. The process stopped when the convergence criterion reached 10 −4 . The convergence criterion was: was the vector containing the BLUE and the BLUP (according to formula (1):b (k) andû (k) ) from the k-th iteration.

Application of the method
The method to estimate CD by simulation was applied to the Salers breed animal model for muscular development score at weaning. This data set was used for the 2000 IBOVAL evaluation. It consisted of 291 965 animals among whom 234 615 were performance recorded. The model for evaluation included the contemporary group effect (8 654 levels), sex (2 levels), calving season (8 levels), sire breed (2 levels), dam parity combined with age at first calving (18 levels), scoring status (4 levels: not weaned, just weaned, weaned, unknown), calf particular individual situation (2 levels: favoured in view to the agricultural shows; normal) and calf rearing management method (4 levels). Details of the model are given by the Institut de l'élevage and INRA in [14].
In order to test the repeatability of the results, estimated CD from 10 samples of 100 replicates of y, u andû were compared. Such comparisons were also done within 10 samples of 200, 300, 400 and 500 replicates.
Finally, comparisons of estimated CD with 300 replicates by decreasing the convergence criterion from 10 −4 to 10 −3 were made to test the loss of precision with respect to the gain of rapidity.
All the computation used a RISC 595 supercomputer with a CPU of 133 MHz.

Validation of the method
The true CD ranged between 0 and 0.852, with a mean of 0.297 and a standard deviation of 0.173 (Tab. I).
When the number of replicates increased, the correlation between the estimated and the true CD increased. Concurrently, the maximum deviation and the mean absolute deviation between the estimated and the true CD decreased (Tab. I). Consequently, the percentage of large deviations (the difference between true and estimated CD was greater than 0.05) dramatically decreased from 12.3% to 0.4% with 500 and 25 000 replicates respectively. These results confirmed that the empirical estimators of CD converged to the true values of CD as the number of replicates increased.

Application of the method
The large size of the data set prevented estimating the true CD by inversion of the coefficient matrix of the mixed model equations. Consequently, CD values from 6 000 replicates were treated as optimal simulated CD against which other results could be compared. These optimal estimated CD values ranged between 0 and 0.990, with a mean of 0.411 and a standard deviation of 0.120 (Tab. II). For a fixed convergence criterion, the duration of one replicate increased as the size of the problem increased. Consequently, the number of replicates could not be very high if the method was to be run in a reasonable time. A compromise had to be reached between the convergence criterion and the number of replicates.
In the Salers application, about 33 replicates were run per hour with the convergence criterion set at 10 −4 . The statistics presented in Table II and Figure 1 confirmed that the higher the number of replicates was, the more the approximate CD converged to the optimal estimated ones (obtained with 6 000 replicates). Nevertheless, a satisfactory approximation of the CD was obtained with 300 replicates in approximately 9 h. Actually, more than 78% of these deviations were lower than 0.050 (Fig. 1) and the correlation between the optimal and estimated CD reached 0.95 (Tab. II). In this situation, the highest deviations occurred when the optimal CD were midrange (from 0.25 to 0.55; Fig. 2). When the optimal CD were higher than 0.55, the deviations noticeably decreased.
Consequently, since the sires' optimal CD were slightly higher on average than the whole optimal CD (0.55 versus 0.41 on average), their CD tended to be better estimated. Their largest deviation was 0.15 and more than 86% of these deviations were lower than 0.050 (Fig. 3). Moreover, the larger the sires' family was, the higher was the sires' optimal CD and the better estimated were these CD. Among the sires with more than 100 tested progeny, mainly the sires used for artificial insemination, the largest deviation was 0.083 and 98% of these deviations were lower than 0.050 (Fig. 3). Absolute deviation mean and maximum deviation between the optimal and estimated CD according to the optimal CD value (Salers application).  The distributions of CD from the 10 samples of 100 replicates were very similar showing the repeatability of the results. Nevertheless, the animals whose CD were the worst estimated were not always the same. The same results were obtained with 10 samples of 200, 300, 400 and 500 replicates.
Finally, the results given in Table III and Figure 4 showed that decreasing the convergence criterion from 10 −4 to 10 −3 increased the number of replications  per hour by 2.5 without a significant loss of precision: the correlation with optimal CD decreased very slightly from 0.946 to 0.94, maximum deviation and mean absolute deviation were only up to 0.01 and 0.002 respectively whereas, CPU time noticeably decreased from about 9 h to 3 h 30 min. Other reductions of CPU time might be obtained by using iterative methods to solve the mixed model equations such as Chebyshev acceleration or the conjugate gradient [11] which are more optimal than the one involving SOR which was used.

CONCLUSION
This method is very simple in its principle and in its application. Long chains of conditional stochastic draws such as those used in Gibbs sampling schemes, for instance in Jamrozik et al. [15] are not needed. Obviously, this method is very computationally demanding. Even in a context of inexpensive and high-speed computing, it could be a limiting factor. It can, however, be applied to estimate CD for small to medium-size analyses, with about 300 000 animals included in the evaluation, which corresponds to seven beef breeds currently evaluated in France (Blonde d'Aquitaine, Salers, Aubrac, Gasconne, Parthenaise, Maine-Anjou and Bazadaise). If the data set is too large, considering for instance reduced animal models [23] instead of animal models could reduce the problem.
Furthermore, this method is quite general and could be easily adopted for other genetic evaluation models, particularly for multivariate models or models including maternal effects or many fixed effects.  CD estimated from 6 000 replicates.
An extremely interesting extension of this kind of method could concern connectedness studies [4,17,18]. Whatever the method used to address these problems, off-diagonal coefficients of the PEV matrix (C uu ) [5,16] and offdiagonal coefficients of the relationship matrix (A) [18] are needed to estimate CD of a comparison. No analytical method can deliver these coefficients. According to Misztal [20], using a sparse matrix inversion algorithm still requires a prohibitively large amount of CPU time with a Cray 2 supercomputer when the number of equations is larger than 100 000. Sparse matrix inversion algorithms which provide a reduction in the number of calculations, have been developed but only for approximate diagonal elements of the inverse alone [21,24]. The sampling method theoretically allows the estimation of the entire matrices (C uu and A) and, then, the computation of any desired CD or PEV of comparison between breeding values of sires, or genetic levels of years, herds, countries,... It is, however, obviously unrealistic because of the prohibitively large amount of storage space needed to stock all the coefficients. However, by looking at only some contrasts of interest between genetic merits of some chosen animals or genetic level of some chosen herds, can reduce problems. Nevertheless, further development of supercomputers in terms of computing speed and storage ability will certainly make computer intensive methods such as sampling methods more and more useful.