Semi-parametric estimates of population accuracy and bias of predictions of breeding values and future phenotypes using the LR method

Background Cross-validation tools are used increasingly to validate and compare genetic evaluation methods but analytical properties of cross-validation methods are rarely described. There is also a lack of cross-validation tools for complex problems such as prediction of indirect effects (e.g. maternal effects) or for breeding schemes with small progeny group sizes. Results We derive the expected value of several quadratic forms by comparing genetic evaluations including “partial” and “whole” data. We propose statistics that compare genetic evaluations including “partial” and “whole” data based on differences in means, covariance, and correlation, and term the use of these statistics “method LR” (from linear regression). Contrary to common belief, the regression of true on estimated breeding values is (on expectation) lower than 1 for small or related validation sets, due to family structures. For validation sets that are sufficiently large, we show that these statistics yield estimators of bias, slope or dispersion, and population accuracy for estimated breeding values. Similar results hold for prediction of future phenotypes although we show that estimates of bias, slope or dispersion using prediction of future phenotypes are sensitive to incorrect heritabilities or precorrection for fixed effects. We present an example for a set of 2111 Brahman beef cattle for which, in repeated partitioning of the data into training and validation sets, there is very good agreement of statistics of method LR with prediction of future phenotypes. Conclusions Analytical properties of cross-validation measures are presented. We present a new method named LR for cross-validation that is automatic, easy to use, and which yields the quantities of interest. The method compares predictions based on partial and whole data, which results in estimates of accuracy and biases. Prediction of observed records may yield biased results due to precorrection or use of incorrect heritabilities. Electronic supplementary material The online version of this article (10.1186/s12711-018-0426-6) contains supplementary material, which is available to authorized users.


Background
Models for genetic evaluation are an oversimplification of reality that usually holds only in the short run and in closely-related populations. Their properties are rarely well known, which can lead to unexpected results. For instance, initial applications of genomic predictions of breeding values (GEBV) in dairy cattle led to biases, with young "genomic" selected bulls with high GEBV being overpredicted, as verified by posterior progeny testing [1][2][3]. As a result, further use of GEBV in the dairy industry required extensive cross-validation and a more formal analytical framework [4][5][6].
The introduction of new methods for genetic or genomic evaluation raises the question of model choice (comparing across models) and model quality (features of a particular model). Thus, we need tools to rank, understand and quantify the behavior of prediction models in an "animal breeding" context. The need for these tools has dramatically increased with the implementation of genomic selection, given its built-in encouragement to take riskier decisions such as selection of unproven young candidates, in particular in dairy cattle. The method that is most commonly used to check genomic predictions is some form of crossvalidation, a test that was rarely used in pedigree-based genetic evaluation studies, which relied primarily on progeny testing (but see [7,8]). In genomic prediction, cross-validation studies are indeed the norm [4,9,10].
Cross-validation tests rely on either one of two approaches: (1) comparing (G)EBV or predicted phenotypes to (pre-corrected) observed phenotypes, deregressed proofs, or yield deviations [9]; or (2) comparing (G)EBV to highly accurate EBV from progeny testing. Another approach, which is in between the two above approaches, is based on daughter yield deviations (DYD; [6]), which are close to highly accurate EBV if heritability is high and the number of daughters is large. Cross-validation is very useful but there are some concerns about the quality or adequacy of these approaches for several reasons, including: (a) the need to pre-correct phenotypes; (b) the growing difficulty to obtain unbiased estimates of DYD with the increasing use of non-progeny tested bulls selected based on GEBV; and (c) their inadequacy for indirect predictions such as those of maternal effects, for which there is no direct observation related to the animal. Apparent contradictions exist, such as lower accuracy of GEBV than that of pedigree EBV [5,11], or accuracies higher than 1 for lowly heritable traits. For a detailed review of cross-validation in animal breeding and its metrics, we refer the reader to our review [12].
Cross-validation is a good tool but has some limitations as discussed above. Thus, there is an increasing need for a simple general tool that can be used in several complex scenarios, including for traits with a low heritability (e.g. reproductive and fitness traits), for indirectly observed traits (random regression coefficients, maternal effects, GxE interactions), and with limited size of progeny test groups (e.g., pigs). Here, we propose to complement cross-validation approaches with semiparametric procedures based on the classical theory of genetic evaluation.
Semiparametric procedures based on the mixed model equations are appealing because they combine theory, which we know is approximately and/or asymptotically correct, with model-free evidence from data. In the 1990's, there was some effort to develop such procedures [13]. Reverter et al. [14] showed that the amount of change in EBV from one genetic evaluation to the next (i.e., with the arrival of "new" data) was predictable. In parallel, bias in across-country predictions [15,16] led to the introduction of the Interbull tests [17], which draw on a similar idea. This family of methods has been used to check unbiasedness of predictions and, in the case of the Interbull tests, relies heavily on the availability of progeny tests based on large numbers of daughters.
In this work, we draw on analytical results from [14] and present theoretical features of semi-parametric procedures, namely method LR (from "linear regression"). These procedures are a series of statistics, which describe the change of predictions from "old" to "recent" evaluations that can be used to compute and compare population accuracies and biases of (genomic) predictions. We also explore analytical properties of the ability to predict future phenotypes, sometimes called "predictivity". Then, we illustrate the method with deterministic results for simple designs and for experimental beef cattle data.
This work proposes estimates of the "population" accuracy, which is the correlation between true (TBV) and estimated breeding values (EBV) across individuals in a population. Population accuracy is relevant to compare the predictive ability of models and to maximize genetic progress. This work does not propose methods to estimate individual accuracies, which are a measure of the risk when choosing a particular animal for breeding [18].

Methods: analytical developments
We propose to test the quality of evaluation methods using cross-validation tests based on successive EBV of a set of "focal" individuals (a validation cohort). These "focal" individuals can be the whole population [14,19] or a set of "focal" individuals of interest, such as "genomic" candidates for selection [6].
We will use the convention that var(x) refers to a scalar, the variance of a random element from a single realization of random vector x (in other words, where n is the size of x ), whereas Var(x) refers to the variance-covariance matrix of elements in x during conceptual repetitions. We use a similar convention for cov x, y and r x, y , which are scalars that represent the covariance and correlation across elements in x and y , whereas Cov x, y is a matrix.

Definition of population accuracy, bias, and dispersion
Let u be the true breeding value (TBV) and û an estimated breeding value (EBV) of a single individual. The classical definition of accuracy is the correlation r u,û for one individual across conceptual repeated sampling [20], which is a measure of the expected magnitude of the change in EBV with increasing information. Accuracies are also used to forecast genetic progress in a selection scheme [18,21,22]. This use applies to large unrelated populations, and made sense at the time of selection-index based selection (e.g. selecting boars based on family information). However, for the joint evaluation of all animals, the relevant measure according to Bijma [18] is "the correlation between true i.e. TBV and EBV in the candidates for selection, which is a property of a population, not of an individual". This "population accuracy" (we will use this term hereafter) is the correlation r u,û = cov u,û / var û var(u) across a series of individuals. Accordingly, bias is defined as the difference of means ū −ū and dispersion as the slope of the regression of u on û : cov u,û /var û . Indeed, in practice, proxies to these empirical measures are used in crossvalidation studies. In other words, accuracy measures the ability to rank individuals within the focal set of individuals, taking the possible relatedness within the sample into account [23,24], as well as the buildup of the Bulmer effect that reduces genetic variance and makes evaluation more difficult [18,25].
Note that the three quantities accuracy, bias, and dispersion are defined as scalars, i.e. and have distributions, i.e. over conceptual repetitions r u,û have themselves a mean and a variance.
We also use indicators of (self-)relationships and of genetic variances within the sample. If the relationship matrix across focal individuals is K , then we use diag(K) −K = 1 +F − 2f where F is the inbreeding coefficient and 2f is the relationship between individuals ( f can be understood as coancestry), and the bar operators imply averages, i.e. X is the average across elements of X . The statistic diag(K) −K was used by [26] to describe the decrease in genetic variance due to relationships in a related but unselected population. For selected populations, even of infinite size, there is a further decrease in genetic variance due to the Bulmer effect [18,27], and we will use σ 2 where k is the reduction due to selection and σ 2 u,∞ is the genetic variance at equilibrium in a population under selection. The equivalence between Henderson's [28] results for the decrease in genetic variance in a selected population and σ 2 u,∞ = (1 − k)σ 2 u was shown (in simplified settings) by [27,29].

Statistics to test the quality of evaluation methods in brief
Consider successive evaluations with "partial" and "whole" data ( û p and û w , respectively), which is based on the use of "old" ( p ) and "recent + old" ( w from "whole") phenotype data, respectively. Note that in the following, û p and û w have the same dimension and may be a subset of "focal" individuals (e.g. the young candidates for selection) or the number of animals in the entire dataset (i.e., in the relationship matrix). In general, the breeder is concerned with the population accuracy of candidates for selection, because higher population accuracy of selection candidates implies greater genetic progress. Typically, focal individuals have no phenotype (or offspring phenotyped) in p but have phenotype (or offspring phenotyped) in w , but this is not a requirement for the proposed method. Reverter et al. [14] described the amount of change that is expected in consecutive genetic evaluations of individuals as a function of their respective accuracies, and they proposed statistics to check biases in genetic evaluations. The proposed criteria were very beneficial because (1) they do not require knowledge of the TBV, only the EBV from successive evaluations, and (2) they do not require knowledge of adjustment factors to pre-correct phenotypes.
In general, assumptions are: Cov û w ,û p = Var û p , and Cov u −û p ,û p = Cov u −û w ,û w = 0 . Henderson [28] proved that Cov u,û = Var û even in the presence of selection, which when coupled with the results in [14] yields Cov û w ,û p = Var û p . Intuitively, this holds if "old" errors in prediction ( u −û p ) are uncorrelated with "new" information, which in turn holds if the model takes selection correctly into account. Another assumption, which will be shown later in this paper, is that the set of focal individuals is sufficiently large and "diverse" (for instance, there are several full-sib families and not just one). The derivations of Reverter et al. [14] referred to the individual case (e.g. r u,û ) and not to sets of individuals (e.g. r u,û ) that are used for cross-validation. We extend their results as shown below, which leads to the following main results.
1. The statistic µ wp =û p −û w , has an expected value of 0 if the evaluation is unbiased. 2. The regression of EBV obtained with "whole" ( w ) data on EBV estimated with "partial" ( p ) data b w,p = cov(û w ,û p) var(û p) has an expectation, E b w,p = 1 if there is no over/under dispersion. 3. The correlation of EBV based on partial and whole data, ρ p,w = cov(û p ,û w ) √ var(û w )var(ûp) , is a function of their respective accuracies ( acc ), with an expected value Legarra and Reverter Genet Sel Evol (2018) 50:53 E ρ w,p ≈ acc p acc w , where acc is the population accuracy (correlation between TBV and EBV across animals). 4. The covariance of EBV based on partial and whole data is a function of the squared accuracy (reliability) of the partial EBV, ρ 2 Cov w,p = cov(û w ,û p) 5. The slope of the regression of EBV based on partial on EBV based on whole data, b p,w = cov(û w ,û p) var(û w ) is, on expectation, a function of the respective accuracies E b p,w = acc 2 p acc 2 w that is, the expectation of the slope is proportional to the relative increase in average reliabilities from EBV based on partial to EBV based on whole data.

Proofs of the adequacy of the statistics
In the following, we prove that the statistics described above are related to bias, slope and accuracies. We make repeated use of the following results for biquadratic forms [30]: consider random vectors n J is the centering matrix [31]. Considering our problem, we make the hypothesis that the two genetic evaluations (e.g. males before ("partial") and after ("whole") progeny testing) have different means: Note that the meaning of the "expected mean of EBV û " is unclear under selection. For instance, the last generation is expected to have means higher than 0, but these means will differ for males (heavily selected) and females (less selected). We will assume that the focal individuals include sets of animals that are comparable, i.e. under repeated sampling they have the same average genetic level. For instance, if 1% of the elite females and 10% of the elite bulls are selected, offspring from these animals should have on average the same genetic level across conceptual repetitions of the breeding scheme; the actual animals that are selected will differ but the genetic gain will on average be the same. We also assume (as commonly done) that, because of selection, Var(u) = G = Kσ 2 u,∞ ( K is a relationship matrix) instead of the non-selection case, Var(u) = G = Kσ 2 u . This implies that Henderson's [28] description of decrease in variance due to the Bulmer effect is correct and that it can be summarized by a single parameter σ 2 u,∞ [18,25,27]. Following classical notation: Var(u) = G = Kσ 2 u and Var û − u = C uu , and the matrix of prediction error variances and covariances (PEV and PEC) can be described as [14,28].
where C uu p and C uu w are matrices of PEV and PEC for the partial and whole analysis, respectively. These expressions assume that genetic evaluation deals correctly with the decrease in genetic variance due to selection [28,32] in which case Var û p = Cov û p ,û w . From here, we derive expectations of several possible quadratic forms that are combined to produce estimators of bias, dispersion and accuracy. In principle, genetic evaluation does not need to be based on mixed models (statistics can be computed regardless of the procedure), but our results only hold if the variances and covariances of estimators and true values are as described above. Average inbreeding and relationships in K are also needed. Ideally, the evaluation is based on conditional means such that the properties described before hold. Precision of the estimators of accuracy and bias depends on the distributional properties of the EBV and TBV, which can be derived when assuming multivariate normality but we have not attempted to do so herein.

Averages of estimated breeding values to estimate bias
It is straightforward to show that E 1 ′û p n = µ p ,

Quadratic forms of estimated breeding values
For the method R of covariance estimation, it is rec- is included in the quadratic forms, especially in the presence of selection [33]: However, these weighted quadratic forms lead to estimators that are difficult to understand. Hence, in the following, we will use "unweighted" quadratic forms.
The quadratic form using not-centered û p , û w has expectation: where n is the number of individuals, 1 +F is the average self-relationship, PEV p = diag C uu p is the average prediction error variance and σ 2 u,∞ is the genetic variance. It is worth noting that the classical definition of individual accuracy is based on [20]. Thus, the expression above for E û ′ pû w is a function of individual expected average reliabilities acc 2 i , but also of means µ w , µ p .
To remove dependence of the quadratic form above on means, it makes sense to use centered û w and û p : n J is the centering matrix [31]. By its properties, S1 = 0 and S ′ S = S ′ = S , and therefore is a function of the average self-relationships 1 + F minus the average relationship between individuals, 2f , and PEV minus PEC. Inclusion of relationships between individuals results in the corresponding reduction in genetic variance due to inbreeding to be accounted for, i.e., if as usual F ≈ f , then 1 +F − 2f σ 2 u = 1 −F σ 2 u , which shows the decrease in genetic variance [26,34,35]. Similarly, PEV p − PEC p considers the fact that estimates of û are correlated across individuals (the so-called "co-reliabilities" [25]), showing that there is little value in having high individual accuracy if predictors are correlated across individuals.
The remaining quadratic forms needed for our developments are: In the remainder of this paper, we assume that the expectation of a ratio of quadratic forms is equal to the ratio of the expectations. The "Appendix" shows that this holds when the number of individuals included in the statistics is large (several hundred or more) or when they are not structured into very large sibships. Otherwise, as shown in the "Appendix", both the true regression coefficient b = cov û p , u /var û p and its estimator b = cov û p ,û w /var û p have an expectation less than 1, even when the model is perfect and the EBV have the right dispersion.

Quadratic forms of estimated and true breeding values
Here, we give an alternative definition of the population accuracy, i.e. the expected correlation of EBV and TBV in a sample, as a ratio of quadratic forms: Using this has expectation: The denominator 1 +F − 2f σ 2 u,∞ corresponds to the expected genetic variance in the focal population and takes the reduction in variance due to relationships 1 +F − 2f and selection ( σ 2 u,∞ ) into account. With all these elements, we can compute the expectation of the derived statistics, as done in the following.

Regression of EBV from whole data on EBV from partial data
is composed of two quadratic forms.
When assuming that the expectation of the ratio is equal to the ratio of the expectations, Note that this expectation involves PEV and off-diagonal PEC. Importantly, it must hold that Var û p = Cov û p ,û ′ w (as usually assumed).

Correlation of EBV from whole and EBV from partial data
This statistic is composed of three quadratic forms and assuming that the square root of the expectation is equal to the expectation of the root, it gives: Therefore, ρ w,p is a direct estimator of the increase in population accuracy of EBV from partial to whole data, acc p acc w .

Estimation of accuracy from the covariance of EBV based on whole and EBV based on partial data
We can get a direct estimator of accuracy (and not of ratios of accuracies) from cov û p ,û w = 1 n û p −ū p ′ û w −ū w , from which we can derive the statistic: u,∞ Legarra and Reverter Genet Sel Evol (2018) 50:53 with expectation acc 2 p as follows Thus, ρ 2 cov w,p is a direct estimate of squared population accuracy of EBV based on partial data, which we call ρ 2 cov w,p since it is an estimator of the squared accuracy (a squared correlation) based on the covariance between û p and û w . This statistic requires an estimator of σ 2 u,∞ that can be obtained by modelling the selection scheme [18] or be explicitly estimated [34].

Regression of EBV from partial data on EBV from whole data
with expectation: which is a function of squared population accuracies, i.e.
although ρ w,p and b p,w need not be equal for single realizations, i.e. for the analysis of one particular dataset.

Effect of over/underdispersion of breeding values on statistics
Statistics used to compute slopes and accuracies deal well with regular bias ( ū p � =ū w ) because the û p and û w are centered. However, overdispersion (inflation) of EBV is a frequent phenomenon [6]. To consider a simple case, assume that EBV based on partial and whole data are uniformly scaled by regression coefficients θ 2 p and θ 2 w , with θ 2 p > θ 2 w ≥ 1 (i.e., there is more overdispersion with less data or with old data), resulting in: yielding, e.g., The regression of EBV from whole on partial data, not equal to 1 but equal to the ratio of dispersions. Thus, a value of b w,p < 1 (as often observed for genomic predictions) may indicate overdispersion of EBV based on partial data but also underdispersion of EBV based on whole data. The reverse regression of EBV from partial on whole is a ratio of dispersions and reliabilities. Finally, the correlation has the following expected value: retrieving a ratio of accuracies. Thus, the statistic ρ w,p (correlation of "whole data" and "partial data" EBV) is an estimator of change in accuracy and is not affected by this very simplistic form of overdispersion. Note that equivalent biases result when θ 2 w > θ 2 p ≥ 1 , i.e. when there is more overdispersion with more data or with recent data. Thus, regression of EBV from whole on partial data informs about over/underdispersion, regression of EBV from partial on whole data can be interpreted as a function of accuracies, and the correlation of EBV from partial and whole data is useful as a ratio of accuracies.

Predictivity: correlation of EBV with precorrected data
A very common strategy in cross-validation tests is to compare predictions of EBV with precorrected phenotypes for the predicted individuals [9,10], i.e. using r y * new ,û p , where y * new is the precorrected "new" data available in the whole data. It is, however, not clear whether this is a valid estimator of accuracy and what the effect of precorrection is. Here we derive some results that show that the use of precorrected data can be problematic in some cases: many levels of the main environmental effect or wrong variance components.
Precorrected data are obtained with the whole dataset using where β is typically a BLUE estimator of fixed effects. In fact, Var y * = VPV for [36], which leads to: where C ββ w is the PEV of fixed effects in β obtained from analysis of the whole dataset. Now, we will consider only new data that are not in the partial dataset and assume for simplicity one record per individual. We further assume that the new data are only affected by a single fixed effect (say contemporary group), such that: The covariance of EBV with y * new can be obtained as follows: Because by orthogonality, Cov û p , Xβ = 0 ( [37] equation 5.28), and where Cov û p , u = G − C uu p , and Cov û p , e new = 0 , the latter because EBV based on partial data do not influence e new (again, assuming there is no effect of selection). Therefore, which yields the following expectations for n individuals: which is equal to 1 as expected, and the correlation is equal to: with expectation: Thus, the cross-validation correlation of EBV with precorrected phenotypes depends on population accuracy, heritability, and errors in estimates of fixed effects. If fixed effects are estimated with high precision C ββ w ≈ 0 and off-diagonals (both in relationships and in PEV) are negligible, then: If we divide the square of this by the population heritability where h 2 is heritability in the base population [9]. However, if there has been selection, using acc ≈ population accuracy because σ 2 u,∞ < σ 2 u and h 2 ∞ < h 2 . Using the "dairy" example in [18] h ∞ = 0.67 . The latter can, in turn, be translated to an "unselected accuracy" of 0.82 [18,27].
There is a second and not negligible source of bias due to C ββ w � = 0 . For a single fixed effect, matrix XC ββ w X ′ contains var β i (the variance of the estimate of the effect that affects the i-th record) on the diagonal and cov β i ,β j on off-diagonals (the covariance of the estimates of the effects that affect the i-th and j-th records). We will assume that covariances of estimates across levels of the fixed effect are negligible (this is not true if relatives are spread across fixed effects). For a balanced design with n records in y * new , n i = n/m records for each of the m levels of the fixed effect, and with records ordered within level, the structure of XC ββ w X ′ is:  50:53 where var β 1 = var β 2 = . . . = var β m = var β .
Also, we will assume that var β i = Thus, if there are several levels of the fixed effect, the estimate of the cross-validation accuracy will have an upward bias, which is greater for a smaller number of records per contemporary group. This bias is due to the assumption that the precorrection is perfect. For instance, for n = 500 and 25 contemporary groups, the bias is an extra 5% apparent accuracy. For n "large" and m "not small", bias is approximately 1 n i , i.e., inversely proportional to the size of the contemporary group, which does not disappear with increasing n.

Comparison with current Interbull validation procedures
The Interbull method [6] uses a simple regression that can be written as 2DYD = 1b 0 + b 1ûp + ǫ , where DYD are daughter yield deviations (computed with the whole dataset) and act as pseudo-data for bulls. Elements of ǫ are assumed to be independent across bulls with variance inversely proportional to the equivalent number of daughters (this can be viewed as DYD having different heritabilities across bulls). Thus, this setting is similar to the previous section on predictivity. The above proofs apply and the expected value of b 1 is 1, although, using â = DYD −b 1ūp does not yield a correct estimate of µ p − µ w , i.e. bias, unless b 1 = 1 . Also, the expected value of r 2 û p , DYD is acc 2 rel , where rel is the average reliability of the EBV of bulls based on progeny. Here, as in the analysis on predictivity, off-diagonals are ignored, which should not affect results if progeny numbers are large enough. .

Markers considered as "new" data: pedigree BLUP and (SS) GBLUP
The addition of marker genotypes to a pedigree-based BLUP genetic evaluation can also be viewed as having "more data", e.g. on a correlated trait [38,39]. Thus, a way to check the increase in accuracy from adding marker information (e.g. from BLUP to GBLUP) is to view the data with marker genotypes as "whole" and the data without markers as "partial". Using G to refer to EBV with markers and A to EBV without markers, this yields: i.e., the lower the correlation between genomic EBV and pedigree EBV, the higher the extra accuracy from genomic data. This assumes that Cov û G ,û A = Var û A , as assumed by [39], which sounds reasonable but has been formally proved only for a single marker that is fitted as a correlated trait [38]. The procedure above uses the same phenotypes for the evaluations with either G or A . An alternative procedure may be to compare the increase in accuracy from "partial" to "whole" in both approaches. In this case, to compare EBV from a genomic-based method (GBLUP or SSGBLUP) with EBV from a pedigree-based method (PBLUP), we suggest the following procedure: 1. Compute EBV with all data ("whole") using the method that is deemed to be optimal; we will assume that this is GBLUP. 2. Choose a cutoff date and create a partial dataset by setting phenotypes after cutoff date to missing; 3. Compute GEBV based on the partial data using GBLUP; 4. For "focal" individuals (i.e., the validation cohort): , and ρ GBLUP p,w that describe respectively bias, dispersion and accuracy of EBV from GBLUP; 5. Compute PEBV based on "partial data" and using PBLUP; 6. Compute statistics µ PBLUP p,w , b PBLUP w,p , and ρ PBLUP p,w that describe respectively bias, dispersion and accuracy of PEBV from PBLUP; 7. The statistic ρ PBLUP p ,GBLUP p quantifies the inverse of the relative increase in accuracy from PBLUP to GBLUP in the partial data; 8. The statistic ρ PBLUP w ,GBLUP w quantifies the inverse of the relative increase in accuracy from PBLUP to GBLUP in the whole data. Legarra and Reverter Genet Sel Evol (2018) 50:53 Data: example using beef cattle data

Animal population, genotypes and phenotypes
The statistics described above were tested in a reallife dataset. We used genetic and phenotypic resources (for details see Table 1) from Brahman cows (N = 995) and bulls (N = 1116) that have been widely described in the recent literature [40][41][42]. Yearling body weight (YWT) computed from the average of all body weights recorded between 300 and 420 days of age was used as the phenotype. The 2111 Brahman cattle were genotyped using either the Illumina BovineSNP50 BeadChip (Illumina Inc., San Diego, CA; [43]) or the BovineHD panel (Illumina Inc., San Diego, CA) that includes more than 770,000 single nucleotide polymorphisms (SNPs). Animals that were genotyped using the lower density array had their genotypes imputed to higher-density, as described previously [44]. The imputation was performed on 30 iterations of BEAGLE [45], using 519 individuals genotyped using the BovineHD chip as reference. After imputation, we retained genotypes on 729,068 SNPs, of which 651,253 were mapped to autosomal chromosomes and had a minor allele frequency (MAF) higher than 1% and were used to build the genomic relationship matrix (GRM) according to Method 1 in [46].

Procedure to generate partial datasets and cross-validation statistics
The data described above comprised the whole dataset. One thousand partial datasets were generated by setting a random 50% of records missing. It is worth noting that these animals are contemporaries (the resource population spans a few years and animals are not descendants from each other) and, therefore, there are no issues related to selection.
A simple breeding value mixed-model was used for the analysis of YWT with the fixed effects of contemporary group (combination of sex, year and location), age of dam at birth in year classes, and age at measurement as a covariate, and the random additive polygenic effects and residuals as random effects. Variance components estimates and BLUPs of breeding values were obtained using the Qxpak5 software [47]. All datasets were analyzed using both the pedigree-based numerator relationship matrix (NRM) and the SNP-based genomic relationship matrix (GRM). Table 2 lists the 16 statistics that were used to compare EBV from the whole and partial datasets. Note that in order to highlight the impact of the data partitioning, some of these statistics were computed separately for the individuals in the whole and the partial datasets, in the same context as 'reference' and 'validation' individuals, respectively. For instance, ρ v w,p is the correlation between the EBV obtained using the whole dataset and the EBV obtained using the partial dataset, but computed only by using the validation individuals that have missing phenotypes in the partial dataset, i.e. the random 50% individuals with omitted phenotypes in the 'partial' dataset. The EBV of those animals are predicted in the partial dataset using parent average (i.e. using pedigree) or using genomic information from relatives (i.e. using the GRM). In the whole dataset, they are predicted using own records. Table 3 provides summary metrics (mean, standard deviation, minimum and maximum) for the 16 statistics across the 1000 partial datasets obtained using the NRM and the GRM. The means are also presented in the bar diagram of Fig. 1. Notable changes from using NRM versus GRM were the 66.5% increase in the estimated heritability (from 0.260 to 0.433), the 21.4% increase in ρ v w,p (from 0.550 to 0.668) and the 4.1-fold increase in r y v ,û v (from 0.076 to 0.312). Figure 2 presents a heatmap of the correlation matrix among the 16 statistics obtained using the NRM and the GRM. The individual values are provided in Additional file 1: Tables S1 and S2. We observed a strong negative correlation ( r < −0.90 in all cases) between the heritability estimates and the regressions of EBV from whole on EBV from partial data (i.e. b w,p , b r w,p and b v w,p ). This is consistent with the expectation of over-and under-dispersion for regression values < 1.0 and > 1.0, respectively.

Results
One metric of interest is the correlation of EBV with precorrected phenotype (i.e., r y * new ,û p denoted here as r y v ,û v ) since this is one of the most frequent measures of accuracy in cross-validation studies. Quite encouraging is the high correlation observed between r y v ,û v and the regressions of EBV from whole on EBV from partial data (i.e. b w,p , b r w,p and b v w,p ), which ranged from 0.604 to 0.746, as well as the high positive correlation of r y v ,û v with the correlations between "whole' on "partial' (i.e. ρ w,p , ρ r w,p and ρ v w,p ), with a maximum correlation of 0.806 between r y v ,û v and ρ v w,p . These results illustrate that the proposed metrics, particularly ρ v w,p , are also estimators of the accuracy of EBV based on the partial (earlier) data (termed acc p in our algebraical derivations).
Striking is the novel finding of the strong negative correlation of r y r ,û r (where û r are "reference" animals in the training dataset) with r y v ,û v . The former is bound  Variance of the difference between whole and partial EBV computed within validation samples Table 3 Summary metrics (mean, standard deviation, minimum and maximum) for the 16 statistics across the 1000 partial datasets (each one setting a random 50% as missing phenotypes) and obtained using either the pedigree-based NRM or the SNP-based GRM   Table 2 for a description of the statistics and to Supplementary Tables 1 and 2 for the actual correlation values to be high since it reflects the prediction's goodness of fit when computed on the data that is used to build the prediction, and averaged to 0.849 and 0.898 when using the NRM and the GRM, respectively (Table 3). However, the negative correlation of r y r ,û r with r y v ,û v indicates that when the breeding value model is particularly good at fitting the reference ('old') data (reflected in part by a high heritability estimate), this strong fitting ability disappears when applied to the validation ('new') data, which seems to imply overfitting (by chance). Indeed, a very strong correlation ( r = 0.933) was observed between the estimate of heritability and r y r ,û r , and a moderately strong negative correlation (r = − 0.543) between the estimate of heritability and r y v ,û v ( Fig. 2 and Additional file 1: Table S2). Importantly, these problematic relationships were not observed with either ρ r w,p or ρ v w,p . Finally, we explored the changes in 'consecutive predictions' , which are represented here by the move from partial (old) to whole (new) data. We used the absolute difference between predictions (statistics d r w,p and d v w,p ) and the variance of the difference of predictions ( Vd r w,p and Vd v w,p ) and explored the relationships of these with the previous 12 statistics. Please note the strong negative correlation (r = − 0.838) between d v w,p and ρ v w,p . This contrasts with the not so strong correlation (r = − 0.548) between d v w,p and r y v ,û v (Fig. 3). Between two competing measures of accuracy, the measure that is more closely related to changes in predictions will be preferred. Based on this and our results, we conclude that ρ v w,p is better than r y v ,û v . Thompson [13] outlined and discussed methods for the statistical validation of genetic models for genetic evaluation [14,17,48]. He emphasized the need for the statistical models to be based on genetic considerations. Today, different genetic considerations (e.g. oligogenic vs polygenic models) may lead to different prediction models, in particular in the area of genomic selection. Thus, the question "which model is best?" is today more important than ever. In this work, we attempt to provide quantitative geneticists with a set of tools to make their own decisions. Why do animal breeders aim at having predictions that are unbiased in both senses, i.e. µ p − µ w = 0 and b w,p = 1 ? Practically, to avoid suboptimal "biased" decisions, e.g. choosing too few or too many, or simply the wrong set of, young animals. Theoretically, best predictors, defined as conditional expectations, have optimal selection properties [49,50], and therefore we should aim for models (not necessarily linear) that yield such best predictors. In practice, unbiasedness is a property that holds on expectation: for any real dataset, from one evaluation to the next, there will be small deviations; for instance, µ p − µ w may differ from 0 just because of small noises. However, it is important to ascertain if these deviations are large (and affect the practice of selection) or not.

Discussion
In selection, the expected genetic gain at the stage of selection is �G = 1 n Σ(EBV s ) − 1 m Σ(EBV c ) =ū s −ū c , i.e. the average EBV of the "s" selected animals minus the average EBV of the "c" animals candidates to selection. To avoid surprises (over-or under-estimation of Fig. 3 Scatter plot of the relationship of the absolute difference in EBV between the whole and partial datasets ( d v w,p ) with the correlation of the EBV based on the partial data with the adjusted phenotypes ( r y v ,û v ; left panel) and the correlation between EBV based on the whole and partial data ( ρ v w,p ; right panel) across the 1000 partial (random 50%) beef cattle datasets selected animals), we need ū s =ū s , i.e., the estimate of the mean and the true mean of breeding values should be the same for selected animals. For this to hold, we must avoid two kinds of systematic errors: bias (wrong estimate of genetic trend) and over-/under-dispersion, which is often incorrectly referred to as "bias" in animal breeding literature. If selection is by truncation on EBV, the true mean after selection, under multivariate normality, is µ T =ū s = 1 ′ u /n + irσ u , where 1 ′ u /n is the mean of all selection candidates and irσ u is the genetic gain. This genetic mean is (implicitly) predicted before selection as µ E = 1 ′û /n + iσû . For µ T = µ E to hold, we need an unbiasedness condition (i.e. ū =ū among all selection candidates) and a second condition that is σû = rσ u . The latter condition, however, only holds if cov u,û = var û , which amounts to the regression coefficient cov(u,û) var(û) to be 1. However, the equality cov u,û = var û holds under quite restrictive conditions [30,33]. In a frequentist context, Henderson [28,32] proved that selection can be ignored if the model is correct, selection is contained "in the data", and under the assumption of multivariate normality. In a Bayesian context, Sorensen et al. [34] proved that selection can be ignored if the evaluation model is correct. However, models are rarely correct, at most they are robust. In particular, the widely used animal model that includes unknown parent groups [51] is biased by construction, because genetic groups are due to genetic selection but fitted as fixed effects, which ignores established genetic theory [52].
It may be argued that for the results in [14] to hold (roughly, future errors in prediction are not correlated to current errors in prediction), future data does not need to depend on past data. This is, however, not the case if there is selection: unborn progeny of unselected animals do not yield data. In principle, models should consider selection correctly, if all information is included. A counterexample where, old data affect future errors of prediction is as follows. Consider EBV ( u p ) of a young bull based on one record of the dam, with var u p = h 4 4 , and a subsequent EBV based on n progeny records ( u w ) but not on maternal performance. Then, cov u p , u w = 1 8 h 4 2n n+ , which is not equal to var u p because the dam performance was not included in u w . If there is no selection, there is no problem, but this is rarely the case, and it is actually selection that creates bias due to an increase in the genetic level of the trait and a reduction in genetic variance.
Thus, we see the process of estimation of accuracy and bias of EBV by our proposed method LR as a double process. First, checking of the model in order to have a model that empirically has the "best" properties (estimation of bias); and then, estimation of its accuracy. We propose the following two-step praxis approach. First, to ascertain as best as possible that models are empirically unbiased using the statistics µ p − µ w and b w,p which should have values 0 and 1, respectively-perhaps using, if not all, many animals (as in the original paper of Reverter et al. [14]). Second, for all models that are empirically unbiased, accuracies can be compared based on the proposed statistics, which rely on unbiasedness.
Still, there is a problem in method LR, and in all methods that rely on linear regression of "predictands" (pseudo-TBV from accurate progeny testing, less accurate EBV or precorrected records) on "predictors" (typically EBV). As shown in the "Appendix", due to family structure and the not complete accuracy of EBV, the true value of the regression of TBV on EBV, i.e. the "true" b , has an expectation lower than 1, E(b) < 1 . Accordingly, regression of "whole data" EBV (or of precorrected data) on "partial data" may seem to indicate bias: E b < 1 because for the "true" b , E(b) < 1 . In other words, EBV may appear to be over-dispersed when they are actually not, which holds for method LR and for any other similar method such as "predictivity" or the Interbull tests, since it is a fundamental property of the crude regression of a vector of TBV u on a vector of EBV û . It seems relevant to assess, in practice, the extent of this inequality E(b) < 1 , since evaluations are often scaled such that b (actually its estimate) is equal to 1, which implies that EBV may be too much deflated after the scaling. However, we will not address these points here, since this should be the subject of a simulation study that goes far beyond this paper. The deviation of E(b) from 1 is important if the cohort, or focal group, is small and related, and it does not depend on the quality of the "predictand". Therefore, our recommendation is to use large cohorts for validation. This bias inherent to cross-validation analysis deserves further examination in future studies.
Fixing the models to observe constraints on estimated bias should be based on rigorous genetic or statistical arguments (i.e. re-estimating variance components and heritabilities), rather than quick fixing procedures such as multiplying by constants, manipulating relationships or changing hyper-parameters of prior distributions. For instance, [53] found empirically that equaling statistics of A 22 and G provided unbiased predictions, but this has a genetic interpretation of modelling selection and drift from the base to the genotyped population [54,55].
In the analysis of genetic trend for litter size in pigs, Sorensen et al. [48] also emphasized "forward" crossvalidation for model checking, using what we called in this paper "predictivity", instead of relying solely on model-based predictions. Recently, Putz et al. [56] tested by simulation several methods to validate accuracies by cross-validation. They reported poor performance of comparisons of (in our notation) û w and û p , without realizing that r(û w ,û p ) is not an estimator of accuracy but of ratios of accuracies. In addition, they did not simulate selection, in which case theoretical accuracy is equal to validation accuracies.
We have shown that precorrection of phenotypes using whole data may bias the result of predictivity. This is particularly relevant for small contemporary groups such as in dairy or beef cattle as opposed to, say, sheep or aquaculture species. Some measure of error in precorrection due to estimation of contemporary groups should be reported in cross-validation results.
Although the ranking of methods should be similar, estimates of population accuracies may be biased. Comparing û w and û p , as we propose in this work, might be a better option, although it involves more parametric assumptions.
One final consideration involves discussing the difference between population and individual accuracy. Quoting [18]: "For response to selection, the [population] accuracy should reflect the correlation between true and EBV in the candidates for selection, which is a property of a population, not of an individual. For the stability of EBV, the accuracy should reflect the standard error of an EBV, which can be defined for a single individual. " Our work deals with population accuracies, not with individual accuracies. The former are useful for model selection and for genetic gain; the latter are useful for individual decisions. The population accuracy is not a function of individual accuracies. For instance, consider full sibs that are evaluated by using parent average and for which their parents are known exactly: individual accuracy is 0.71. However, population accuracy is 0, since all full-sibs have exactly the same parent average. Thus, population accuracies involve both individual reliabilities and co-reliabilities [24,25].
The expression for the covariance of bilinear forms is ( C is the covariance matrix across x i ; p 58 [30]). Applied to our case, Cov x ′ 1 A 12 x 2 , x 3 A 34 x 4 = Cov û p Sû w ,û p Sû p , this yields (as Cov û p ,û w = Cov û p ,û p = K): The terms linked to the means disappear, as before, because they have the form µ1 ′ S ′ KS1µ which has a value of 0. The expression Cov(X, Y ) = 2 n 2 tr(SKSK) can be computed explicitly for ideal cases. A slightly more enlightening expression is, after algebra, twice the average variance within rows minus the variance of rows means of K = G − C uu p . Putting all together results in: This is always positive, which means that the b estimated as the linear regression of û w , but also the "true" b of the regression of true EBV u on û p , has an expectation less than 1, even when the model is correct, contrary to common assertions. The expectation of b is actually with value equal to 1 means that the estimators û p are deflated-too much regressed. This raises questions on the use of cross-validation to choose the best model for evaluation. The underestimation depends on the total number of individuals in the focal set, on their relationships (on G ) and the accuracies and co-reliabilities on the "partial" dataset (on C uu p ) but it does not depend on the final reliabilities on C uu w (which implies that the derivation applies for TBV). Inclusion of sibs increases systematic error. For instance, n = 100 with half-sibs of size 10 and information in "partial" evaluation equal to 1 observation with h 2 = 0. Setting n = 100 with families of five half-sibs results on E(b) = 0.96 . These systematic errors deserve further exploration (e.g. properties of the estimators for different accuracies and family structures)-but this is out of the scope of this paper.

Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.