An analytical framework to derive the expected precision of genomic selection

Background Formulae to predict the precision or accuracy of genomic estimated breeding values (GEBV) are important when modelling selection schemes. Simple versions of such formulae have been proposed in the past, based on a number of simplifying hypotheses, including absence of linkage disequilibrium and linkage between loci, a population made up of unrelated individuals, and that all genetic variability of the trait is explained by the genotyped loci. These formulae were based on approximations that were not always clear. The objective of this paper is to offer a unique framework to derive equations that predict the precision of GEBV from the size of the reference population and the heritability of and number of QTL controlling the quantitative trait. Results The exact formulation of the precision of GEBV involves the expectation of the inverse of a linear function of the genomic matrix, which cannot be calculated from simple algebra but can be approximated using a Taylor polynomial expansion. First order approximations performed better than the initial prediction equations published in the literature. Second order approximations produced almost perfect estimates of precision when compared to results obtained when simulating situations that agreed with the assumptions that were required to derive the precision equations. Using this proposed framework, we present several generalizations, including multi-trait genomic evaluation. Conclusions Although further improvements are needed to account for the complexity of practical situations, the equations proposed here can be used to derive the precision of GEBV when comparing breeding schemes a priori. Electronic supplementary material The online version of this article (10.1186/s12711-017-0366-6) contains supplementary material, which is available to authorized users.


Background
After the seminal work of Meuwissen et al. [1], who provided statistical methods to exploit linkage disequilibrium (LD) between genotyped marker loci and quantitative trait loci (QTL) in animal and plant breeding, as previously proposed by Lande and Thompson [2], genomic selection was launched, which has since revolutionised both research in quantitative and applied genetics and practical breeding plans. The benefits of this technology are considerable in dairy cattle (e.g. [3][4][5][6][7][8]) and dairy cattle breeders very rapidly changed their schemes in order to adopt genomic selection methods. Thus, it became possible to improve the reliability of estimated breeding values (EBV) at a young age, avoid costly and lengthy progeny tests, and limit the detrimental evolution of inbreeding. However, the application of genomic selection was not so clear in other breeding sectors, for various reasons: the high relative costs of genotyping (compared to the value of reproducers), the limited size of the (reference) populations required to calibrate the effects of single nucleotide polymorphisms (SNPs), and the fact that basic schemes were already organised with short generation intervals (e.g. [9][10][11][12][13]).
Mathematical models to describe and evaluate breeding plans can be useful to decide whether a breeding scheme based on genomic evaluations should be implemented or not e.g. [14]. These models are often based on stochastic simulations, in which the characteristics of single individuals (their genotypes at a number of SNPs, including QTL located across the genome, and their phenotypes for traits influenced by QTL) are generated, in order to produce data files that can be used as in "real life" (e.g. [15,16]). Alternatively, models that describe populations at a higher level (generations, cohorts, classes of reproducers defined by their role in the scheme) offer a more rapid and flexible alternative to evaluate alternate breeding programs. In such approaches, deterministic equations link population characteristics such as heritability, mean LD, replacement rates, and the number of genotyped individuals to expected genetic progress by unit of time. Some of the most important equations in these models are the formulae that predict the precision of genomic EBV (GEBV). Analyses of simulations and real data have clearly demonstrated that the precision of GEBV depends on the structure of the reference population and the characteristics of the marker set used. The size of this reference population, its diversity, the genetic distance between the reference population and the group of selection candidates genotyped, the number of markers, and the degree or strength of LD are the main factors that influence this precision [17][18][19][20][21][22][23][24][25][26][27][28][29].
A very simple formula to obtain the precision of GEBV was given by Daetwyler et al. [17], based on a number of simplifying hypotheses that included: absence of LD and linkage between loci, a population made up of unrelated individuals, and all genetic variation of the trait is explained by the genotyped loci. Under this approach, the regression of phenotypes on SNP genotypes was performed one locus at a time. This equation has been widely used and cited more than 100 times in the literature. Adjustments have since been proposed to deal with the distribution of marker allele frequencies [20], include dependence between marker loci through the definition of an effective number of independent loci [30], include the proportion of genetic variance explained by markers [22], and account for a smaller error variance when multiple marker loci are considered simultaneously [8]. Brard and Ricard [31] reviewed and challenged these formulae, using the results reported in 13 publications based on real data. They showed that the size of the reference population and the number of independent segments had a considerable impact on precision, and that the different formulae produced very different results. Other situations were explored by Hayes et al. [21] by considering dependence between the reference and candidate populations, by Wientjes et al. [32], who studied multi-population scenarios, and by Elsen [33], who suggested opportunities for the more systematic exploration of dependence between SNPs and between individuals.
In the present paper, using the simple situation that was initially studied by Daetwyler et al. [17], we propose a framework to derive equations that predict the precision of GEBV based on the size of the reference population, and the heritability of and number of QTL controlling the quantitative trait. We are interested in the expectation of the precision of GEBV, before implementing possible genotyping and selection schemes, as a tool for optimizing resources. With this prior approach, the variability summarized when computing the expectation of GEBV precision comes from marker locus polymorphisms as well as from QTL and environmental random effects. After demonstrating the performance of the solutions obtained, we explore extensions to more complex situations. Ten equations are successively proposed: (1) a general formulation of the expectation of the precision of GEBV; (2) the Daetwyler et al. [17] equation that assumes that the error variance is not modified after correction for SNP effects; (3) the Daetwyler et al. [17] equation that accounts for the corresponding reduction error variance; (4) an approximation of Eq.
(1) based on a Taylor series expansion; (5) and (9) applications of Eq. (4) to the first order, assuming that all SNPs contribute equally to the genetic variance (Eq. 5) or that their effects share the same prior variance (Eq. 9); (6) an extension of Eq. (5) to the second order; (7) and (8) applications of Eq. (6) when assuming that distribution of allele frequencies is uniform (Eq. 7) or U-shaped (Eq. 8); (10) an extension to the multivariate situation.

Proposed framework Notations and hypotheses
A list of abbreviations is in Table 1. Genomic predictions are based on a set of M biallelic SNPs, with alleles A k and B k at locus k, the frequency of allele B k being f k . All SNPs are assumed to be in linkage equilibrium. The reference population, which is considered to be a random subset of a larger population, is made up of N unrelated individuals, which are genotyped and phenotyped. We are interested in the precision of the GEBV of a selection candidate that is not related to individuals in the reference population, but belongs to a selection population that is another subset of the larger population. The GEBV is derived as a SNP best linear unbiased prediction (BLUP) based on SNP genotypes.
The random elements of the prediction model y = Xβ + e are as follows: y is a vector of phenotypes recorded in the reference population, assumed to be centred at zero. β is a vector of SNP effects and is randomly distributed with a mean of Note that based on this matrix, the β k effects are supposed to be uncorrelated. X is the genotype matrix defined by X ik = n ik − 2f k , where n ik ∊ {0, 1, 2} is the number of B k alleles carried by individual i at locus k. We assume that allele frequencies f k are known. The expectation of X ik is null, and its variance is σ 2 k = 2f k 1 − f k . Under linkage equilibrium between SNPs, the expectation of matrix X ′ X is All genetic variability is assumed to be explained by the SNPs. e is a vector of residuals with a mean of 0 and covari- g = wβ is the true genomic breeding value of the candidate to be predicted, with w the vector of SNP genotypes, defined as the rows in X. The variance of w is v(w) = F. Assuming all genetic variability is explained by the SNPs, For a given set of genotypes X, variance

Expected precision of GEBV
Four sources of variation underlie the correlation between genomic breeding values (g) and their prediction (ĝ): the SNP genotypes (X and w), their effects (β), and the environmental effects (e). Quite often, we are interested in the precision of GEBV, given the population genotypes (X and w), and the randomness arising from the variability of β and e, i.e. r 2 (X, , which is a function of matrices X and w. A priori, before genotyping, for instance when different SNP chip densities or reference population sizes are compared, the criterion of interest is . This is the situation explored in this paper. The denominator in the previous equation for the precision of the GEBV is the genetic variance in the selection population: The variances of SNP effects are not known but must be estimated (e.g. [1]). In our a priori estimation of the precision of the GEBV, simplifying assumptions are needed. Following VanRaden [19], all variances of SNP effects are assumed to be equal to σ 2 β and, thus, σ 2 g = σ 2 β k σ 2 k . Alternatively, following Wientjes et al. [32], all SNPs contribute equally to σ 2 e = σ 2 g , i.e. σ 2 k σ 2 β k = σ 2 g /M. This is the situation considered in the present paper. The ratio σ e 2 /σ g 2 will be denoted by λ.
Vector of residuals The numerator of the equation for preci- Since β and X are independent from w and E[w] = 0 Finally

Approximation of the precision of GEBV proposed by Daetwyler et al. [17]
In their derivation of the precision of GEBV, Daetwyler et al. [17] considered marker effects as both random and fixed effects. With our notations, they as a fixed effect estimator of β. In this con- However, when considering β as a random effect, they used cov β,β − β = 0, giving cov g, g = v g , , i.e. the inverse of the clas- . Assuming that SNPs are in linkage equilibrium, have uncorrelated effects, and independence between SNP effects and genotypes When the reference population size is suffi- Initially, Daetwyler et al. [17] assumed inconsistently that both σ 2 p = 1 ("assuming the phenotypic variance is 1") giving σ 2 g = h 1 0 and σ 2 e = 1 ("for the present, we shall conservatively take σ 2 e = 1"). Since the candidate and reference individuals belong to the same population, var w j = var x ij and v g = h 2 0 + M/N, which gives: (2) A correction was proposed to relax the approximation σ 2 e = 1, which resulted in an upward correction of r 2 . The idea was to replace σ 2 e = 1 by σ 2 , giving a quadratic equation in r 2 and An alternative derivation of Eq. (2) was proposed by Wientjes et al. [32]. The main idea was that, assuming all SNPs are independent, their effects can be estimated in single random effect models, with y = X k β k + e k for locus k, , and the precision of GEBV is

Another approach to calculate the precision of GEBV
In the following, we do not assume that σ Y 2 = σ e 2 or σ 2 Y = σ 2 e k , and a unique multi QTL random model (y = Xβ + e) is used to describe relationships between phenotype and genotype. As in Wientjes et al. [32], we assume that σ 2 k σ 2 β k = σ 2 g /M. In Eq. (1), the expectation of the inverse of matrix X ′ X + appears. This matrix can be broken down into diagonal D = E X ′ X + and non-diagonal elements E = X ′ X − E X ′ X . As in Goddard et al. [22] and Elsen [33], a Taylor series expansion for matrix E is used to find approximations: Because D −1 is not random and E X [E] = 0 , the second order approximation is the precision of GEBV can be approximated by:

First order approximation
Matrix D −1 is diagonal with terms 1 In the first order, r 2

Second order approximation
When the size of the reference population is limited, elements of matrix X ′ X differ from their expectations: nonzero non-diagonal terms are present even if the SNPs are in linkage equilibrium and diagonal elements diverge from the genetic variances. The second order approximation of Eq. (4) partly captures these deviations. In Eq. (4), matrices F and D −1 are both diagonal. Thus, we only need the diagonal of matrix P = D −1 ED −1 E when computing the traces. Additional file 1 shows that terms order approximation of the precision of GEBV is thus which, using δ k = σ k 2 (N + Mλ), simplifies to: This expression includes genetic variances σ 2 k , which in practice can be estimated from the genomic data available. A priori, when these variances are not available, we can approximate the last term by using E 1 A general situation is a uniform distribution of the frequencies between f min , the minimum minor allele frequency of genotyped SNPs (MAF), and 1 − f min (i.e. a probability density function f f k = A practical approximation of the expected precision of GEBV is thus:

Numerical comparison of estimates of the precision of GEBV
Equations (1), (2), (3), (5) and (7) were evaluated by simulating data corresponding to the hypotheses that underlie their development: unrelated reference and candidate individuals, SNPs in linkage equilibrium, GEBV from a SNP-based BLUP model, and all causal SNPs are included in the SNP panel. Heritability ranged from 0.1 to 0.7, genotypes were available for 500 or 1000 SNPs and the size of the reference population ranged from 1000 to 10,000.
The minimum MAF (f min ) was 0.025, 0.05, 0.075 or 0.10. One hundred replicates were simulated for each scenario, with allele frequencies generated for each. Computations were made in FORTRAN with the help of the Nag library [34].
The results obtained when M = 500, f min = 0.05 and h 0 2 = 0.3 are in Fig. 1. Table 2 shows the effects of heritability and f min on the results. They showed consistently that (i) the prediction proposed by Daetwyler et al. [17] is closer to the "true" precision given by Eq. (1) when the size of the reference population is limited, (ii) the first order approximation proposed in the present paper (Eq. (5)) improves when the size of this population increases, i.e. when the assumptions that underlie these equations are more realistic, and (iii) the second order approximations of Eqs. (3) and (7) were nearly perfect in all cases, with a slight advantage for those of Daetwyler et al. [17] in most cases.

Extensions
The framework that we propose is flexible to accommodate alternative situations without major problems, as illustrated in the following.

Distribution of allelic frequencies
In our a priori approach leading to Eq. . Figure 2 shows the values of these expectations for different minimum MAF. When the minimum MAF is higher than 5%, this correction factor E 1 σ 2 k is almost 2 and, in most cases, the expected precision of GEBV is close to: Goddard [20] also derived the precision of GEBV in this case of a U-shaped distribution of allele frequencies. In his formulation (formula (8) of his paper), the expectation of the precision of GEBV depends on the where M e is the effective number of SNPs genotyped. Replacing log (2N e ) by −log f min , as above, and assuming all SNPs are unlinked, resulting in M e = M, we compared the two approaches by using simulated data, as explained in the previous section. Results in Table 3 suggest that, in most cases, Goddard's formula  (7), when 500 SNPs are genotyped, minimum minor allele frequency is 0.05, and heritability is 0.3 underestimates the precision, while the two other formulations are close to the expected value obtained from simulation.

Distribution of SNP effects
In previous developments, it was assumed that each SNP had its own distribution of effects with a variance of σ 2 β k .
This was the condition assumed by Meuwissen et al. [1] when defining BayesA and BayesB Markov chain Monte Carlo approaches to genomic evaluation. This was justified in practice because the authors did not work at a single locus level but considered haplotypes of markers around each tested position, while theoretical justifications were given in the Bayesian LASSO context by Park and Cassela [35]. Alternatively, Meuwissen et al. [1] considered a unique variance σ 2 β under the GBLUP approach, which is also the case for the fitted SNPs in model BayesC π [36]. The assumption of an equal contribution of each SNP to the genetic variance is no longer valid and variance σ β 2 is linked to genetic variance by σ 2 g = k σ 2 k σ 2 β . In this case, approximations of Eq. (1) for the expected precision of GEBV can be obtained using the same approach as before, using a matrix Taylor series expansion. As shown in Additional file 2, the first order approximation is given by: . An illustration of the quality of this approximation is in Fig. 3. The quality of the first order approximation of Eq. (1), which always overestimates precision, increased with population size and heritability and appeared to be satisfactory when N ≥ 5000.

Multivariate prediction
A simple generalisation of the expected precision of GEBV can be obtained when retaining the previous assumption. A total of n c traits are recorded in the reference population and this information is used to predict the global genetic value of the candidate: γ = j a j g j = ag, where a is a vector of n c economic weights and g the column vector of n c genetic values, i.e. g j = wβ j . Vector β j is a vector of the M SNP effects on trait j (β ′ j = β j1 , . . . , β jM ). We assume that the vector of genotypes (w) is the same for all traits. All previous assumptions are retained: all SNPs have an effect on all traits, all SNPs have an equal contribution to genetic variance for each trait, and individuals are unrelated. SNP effects are distributed with specific prior variances of σ 2 β jk , with zero correlations between SNPs. It is also assumed that the effects of SNP k on traits j and j ′ are correlated, with a covariance of σ 2 The objective is to predict precision r 2 = cov(γ ,γ ) 2 v(γ )v(γ ) , where γ = aĝ, with ĝ being the vector of GEBV. Thus, we need the variance v ĝ of these GEBV, a n c × n c matrix. As detailed in Additional file 3, this variance is estimated at the first order using: which is an obvious generalisation of the equivalent equation in the single-trait situation, which led to Eq. (5).

Discussion
Using classical statistical theory, the expected precision of GEBV based on marker-based BLUP was derived simply. Numerical approximations, based on a matrix Taylor series expansion, were produced for simple situations. From simulations that were consistent with the assumptions corresponding to these situations, these approximations performed similarly to and often better than the formulae for precision of GEBV that were previously published. However, the framework developed here is simpler and enables direct generalisations.
The first order approximation proposed here (Eq. 5) differs from formula (2)   the error term variance is defined when a single SNP effect is estimated. In Eq. (2), it was assumed that this error term variance is the total phenotypic variance, because when estimating a unique SNP effect, all other SNP effects participate to the error term. Too much error is assumed with this approximation and the precision is under evaluated. Equation (5) behaves as if all other SNP effects were perfectly estimated, limiting the error term to the only non-genetic part. This gives an overestimation of the GEBV precision. The second order approximations try to correct for these under-or overestimations: Eq. (3) replaces the 1 − h 0 2 term of Eq. (5) by 1 − r 2 h 0 2 , which corrects for the non-perfect estimation of other SNPs effects. Equation (7) accounts for the lack of orthogonality between the SNPs.
Asymptotic behaviours of first order approximations are not the same: when all the observed variability has an (additive) genetic origin, i.e. when h 0 2 = 1, formula (2) simplifies to N N +M , while our Eq. (5) predicts a perfect precision of GEBV. This discrepancy disappears when correcting Eq. (2) for non-perfect estimation of other SNPS (Eq. 3). With this correction Eq. 3 predicts a perfect precision of GEBV when h 0 2 = 1. In spite of being algebraically very different, the second order approximations underlying the Eqs. (3) and (7) worked very similarly and produced results that were very close to those observed from simulations.
The hypotheses that underlie the equations derived here are strong and efforts should be made to overcome these constraints. First, it was assumed that all genetic variability is explained by the SNPs included in the evaluation. Although this is increasingly true as the size of SNP chips grows towards a full knowledge of genomes by resequencing and imputation, other polymorphisms, including copy number variations (CNV), may play a role and the genotype information obtained is still far from sufficient to fully explain genetic variability. It has been suggested [8,22] that the proportion b of the genetic variance explained by the markers should be taken into account through a reduction in the heritability (from h 2 to bh 2 ) in the equations used and, using path coefficient theory, through a regression of precision of GEBV by b (from r 2 to br 2 ). This is easily implemented in the equations provided in this paper.
A second central hypothesis was independence between SNPs. With the current sizes of SNP chips, which will be even larger in the future, close SNPs are in LD and cannot be considered to be independent. This dependence means that non diagonal terms of E X ′ X are non-null, with E X ′ X kl = 2N kl , where kl is the LD between SNPs k and l. Equations can be derived for this situation, based on principles similar to the theory given here, but they are cumbersome, e.g. [33]. The concept of effective independent chromosomal segments has been discussed [17] and formalised [20] as an alternative to the true number of markers. The idea is that the precision of the genomic prediction model "depends on the variation in the realised relationship between pairs of animals" [20]. Then, the effective number of loci is defined as the "number of independent loci that gives the same variance of realised relationships as obtained in the more realistic situation" [20]. Solutions have also been proposed to estimate the effective number of loci, from population genetics considerations [20,21,37] or from real data (e.g. [32]).
A third assumption was the absence of relationships between individuals in the reference population and between candidates in the reference population. A formalisation of situations where individuals are related was proposed [33] but only for the first order approximations of precision. Although relationships between reference individuals and candidates were accounted for by using this first order approximation, this was not the case for the structure of the reference population itself. Therefore, further efforts are needed, which is particularly important since it is clear that (1) genomic predictions of breeding values arise only partly from historical LD and increase in precision when individuals in the reference population and candidates are more closely related [26,[38][39][40], and (2) the structure of the reference population is a key factor in the precision on GEBV, e.g. [41].
The predicted variances of SNP effects calculated by Eq. (10) in the multivariate situation were obtained under strong assumptions. First, it was assumed that GEBV are computed using a multivariate approach that considers correlations between the effects of SNPs on different traits. However, in practice, GEBV are often computed using single-trait algorithms. In our formulation, this is equivalent to omitting the off-diagonal terms in matrices B kk and E when estimating the SNP effects β. In this case, the variance of those effects, and v ĝ , do not simplify to the equations derived in the case studied. A second important assumption was that all SNPs contributed equally to genetic covariances, as a direct extension of the single trait situation studied. The alternative assumption of unique (regardless of the SNPs) variances and covariances (e.g. σ 2 β jk = σ 2 β j ∀k) is also possible, as described in the previous section. Both these assumptions are, however, questionable, in particular because genetic correlations lower than 1 suggest that only a limited proportion of the SNPs (and underlying QTL) affect all traits. Extra prior information about the underlying genetic architecture of these correlations would be useful in this regard.
A few other assumptions are used in the current paper, including the additivity and i.i.d. of QTL effects, and the use of GBLUP. As long as the objective is to model and optimise breeding plans, then only relative values will be of interest and we assume that these assumptions are not critical for those comparisons.

Conclusions
The objective of this paper was to provide a clear framework to derive predictive equations to estimate the precision of GEBV. Such equations can generate results in a second and thus enable the optimisation of a breeding program design through intensive numerical exploration. Not the entire complexity of practical breeding programs was included in the simple formulae derived here and in previously published papers. The purpose was to support the a priori comparison of breeding schemes, rather than to evaluate actual breeding schemes. The exact formulation of precision involves the expectation of the inverse of a linear function of the genomic relationship matrix, which cannot be calculated from simple algebra but can be approximated by a Taylor series development, as was already suggested by Goddard [20]. Second order approximations produced nearly perfect estimates of this precision, when compared to the results obtained by simulating data that are in agreement with the assumptions required to obtain the equations to estimate the precision of GEBV. We proposed several generalisations for the estimates of precision for this initial case, including multi-trait evaluation. Other situations can also be derived within the framework presented here.