 Research Article
 Open Access
 Published:
Approximated prediction of genomic selection accuracy when reference and candidate populations are related
Genetics Selection Evolution volume 48, Article number: 18 (2016)
Abstract
Background
Genomic selection is still to be evaluated and optimized in many species. Mathematical modeling of selection schemes prior to their implementation is a classical and useful tool for that purpose. These models include formalization of a number of entities including the precision of the estimated breeding value. To model genomic selection schemes, equations that predict this reliability as a function of factors such as the size of the reference population, its diversity, its genetic distance from the group of selection candidates genotyped, number of markers and strength of linkage disequilibrium are needed. The present paper aims at exploring new approximations of this reliability.
Results
Two alternative approximations are proposed for the estimation of the reliability of genomic estimated breeding values (GEBV) in the case of nonindependence between candidate and reference populations. Both were derived from the Taylor series heuristic approach suggested by Goddard in 2009. A numerical exploration of their properties showed that the series were not equivalent in terms of convergence to the exact reliability, that the approximations may overestimate the precision of GEBV and that they converged towards their theoretical expectations. Formulae derived for these approximations were simple to handle in the case of independent markers. A few parameters that describe the markers’ genotypic variability (allele frequencies, linkage disequilibrium) can be estimated from genomic data corresponding to the population of interest or after making assumptions about their distribution. When markers are not in linkage equilibrium, replacing the real number of markers and QTL by the “effective number of independent loci”, as proposed earlier is a practical solution. In this paper, we considered an alternative, i.e. an “equivalent number of independent loci” which would give a GEBV reliability for unrelated individuals by considering a subset of independent markers that is identical to the reliability obtained by considering the full set of markers.
Conclusions
This paper is a further step towards the development of deterministic models that describe breeding plans based on the use of genomic information. Such deterministic models carry low computational burden, which allows design optimization through intensive numerical exploration.
Background
The effectiveness of genomic selection comes from the possibility of predicting breeding values on unphenotyped and young animals [1]. Genomic selection promised and proved to be extremely efficient and beneficial for dairy cattle (e.g. [2–7]), but debate continues for other species and production sectors (e.g. [8–12]). A key criterion to decide whether or not selection schemes (also referred to here as breeding plans) should include genomic information is the reliability of the genomic predictor. It was clearly shown that this reliability depends on the structure of the reference population and on the characteristics of the marker set used. The size of this reference population, its diversity, the genetic distance between the reference and the group of selection candidates genotyped, the number of markers, and the degree or strength of the linkage disequilibrium are the main factors that influence this reliability [13–23].
An extensive literature exists on the mathematical modeling of selection schemes prior to their implementation, in order, for instance, to optimize their design, or to evaluate the usefulness of new technologies such as embryo transfer, sperm selection, DNA markers and others (e.g. [24] for a review). These models account for factors such as selection intensities and maintenance or loss of genetic variability. Among these parameters, the precision of breeding value estimates is central. To model genomic selection schemes, equations that predict this reliability as a function of the factors cited above are needed (e.g. [6, 25, 26]).
The quantitative influence of these factors (size of the reference population, its diversity, etc.) was assessed by simulation studies [18–21, 27, 28]. An equation that predicts the reliability of genomic evaluation in the very simple situation of independent quantitative trait loci (QTL), that are perfectly marked by single nucleotide polymorphisms (SNPs) and populations (reference and candidates) of unrelated individuals was derived [13]. This approach was extended to the case when only a part of the genetic variability is imperfectly marked by SNPs [16, 19], and the situation of nonindependence between reference and candidate populations was explored [17]. It was demonstrated that genomic information captures historical linkage disequilibrium, shortterm linkage between QTL and markers and additive relationships between reference and candidate individuals, the equation of the reliability accounting for these three phenomena being derived in a very simple case of one QTL marked by a single SNP [22].
A Taylor expansion of a matrix inverse involved in the reliability formula was suggested [18], which led to the algebraic development of an approximation. This approximation seems to work well in the simple situation but lacks generality. In this paper, an alternative approximation is proposed, opening a way to include nonindependence between reference and candidate populations, and between markers.
After a formalization of the genomic selection context, the principles that underlie these approximations are presented and their properties are compared by using a simple example. Then, the new approximation is derived when reference and candidate animals are related. This is illustrated by some numerical examples. Finally, the extension to the linkage disequilibrium situation is described.
Methods
General framework
Although the prediction equations derived below were based on a number of simplifying assumptions, it is important to first draw a complete description of the biological framework, as a basis to subsequently simplify the discussion.
The SNP effects are estimated in a reference population, P_{r}, comprising n _{ r } individuals. The genomic estimated breeding values (GEBV) are calculated for a population of candidates for selection and used in breeding, P_{c}, comprising n _{ c } individuals.
Let \( {\mathbf{\mathcal{P}}} = \left( {{\mathbf{\mathcal{P}}}_{r} ,{\mathbf{\mathcal{P}}}_{c} } \right) \) the population structure (including pedigree relationships between individuals and marker allele frequencies, but not including genotypes and phenotypes).
Individuals are characterized by their genotypes at n _{ M } markers (observed) and at n _{ Q } QTL (unknown). Alleles will be noted A _{ m } and B _{ m } for the marker m and A _{ q } and B _{ q } for the QTL q. Let a _{ tim } ∊ {0, 1, 2} and a _{ tiq } ∊ {0, 1, 2} be the numbers of B _{ m } (and respectively, B _{ q }) alleles that an individual i from population P_{t} (P_{r} or P_{c}) carries at marker m (respectively, QTL q). Let p _{ tm } and p _{ tq } be the frequencies of alleles B _{ m } and B _{ q } in P_{t}.
Genotypic values will be assigned to the different markers and QTL genotypes. Following [18], genotypes will be coded as x _{ tim } = a _{ tim } − 2p _{ tm } and w _{ tiq } = a _{ tiq } − 2p _{ tq }. Different codifications can be proposed [15]. In particular, as described for instance in [29], genotypic values may be standardized, i.e. x _{ tim } = (a _{ tim } − 2p _{ tm })/σ _{ tm } and w _{ tiq } = (a _{ tiq } −2p _{ tq })/σ _{ tq }, with variances σ ^{2}_{ tm } = 2p _{ tm }(1 − p _{ tm }) and σ ^{2}_{ tq } = 2p _{ tq }(1 − p _{ tq }). Most of the following developments are given with the first codification here, and the results with the second codification are described in a specific section.
These genotypic values are assembled in matrices X (dim (X) = (n _{ r } + n _{ c }) × n _{ M }) and W (dim (W) = (n _{ r } + n _{ c }) × n _{ Q }). Submatrices corresponding to subpopulations will be noted in the following way: X ^{′} = (X ^{′}_{ r } , X ^{′}_{ c } ) and W ^{′} = (W ^{′}_{ r } , W ^{′}_{ c } ).
The genetic model assumes additivity of QTL effects. The additive genetic value of an individual is described as \( g_{ti} = \sum\nolimits_{q = 1}^{{n_{Q} }} {w_{tiq} \alpha_{q} } \) and, in general, \( {\mathbf{ g = W\alpha }} \). The phenotypic values when observed are y = g + ɛ.
A statistical model describes the performances in the reference population as random variables for which the expectations are linear combinations of SNP effects: \( y_{ri} = \sum\nolimits_{m = 1}^{{n_{S} }} {X_{rim} \beta_{m} + e_{ri} } \) and, in general, y = Xβ + e.
In these models, the SNP (or QTL) effects may be considered as fixed, or random. Since the number of SNPs is much bigger than the number of individuals (n _{ M } ≫ n _{ r }) the second solution is generally chosen in the statistical model (but not always see [1, 13]).
In the random model, a distribution \( {\mathcal{L}}\left( {{\varvec{\uptheta}}_{{\upbeta }} ,{\mathbf{V}}_{{\upbeta }} } \right) \) (respectively \( {\mathcal{L}}\left( {{\varvec{\uptheta}}_{\upalpha} ,{\mathbf{V}}_{\upalpha} } \right) \)) of the SNP (respectively QTL) effects is assumed, with θ _{ β } (respectively, θ _{α}) being the vector of expectations and V _{β} (respectively, V _{α}) being the matrix of variances. For a full description of the variability, the V _{β} and V _{α} matrices are each subdivided into four blocks corresponding to the reference and candidate populations and their covariances. Covariances between the α and β vectors have also to be considered. Most generally, the SNP (QTL) effects are supposed i.i.d. giving V _{β} = Iσ ^{2}_{β} (V _{α} = Iσ ^{2}_{α} ). The interpretation of these QTL effects is nicely debated in Gianola et al. [30]. In the frequentist view, we simply have to imagine that QTL effects are randomly sampled from a distribution with a σ ^{2}_{α} variance. In the Bayesian context, the prior variability of the SNP effects was most generally described as heteroskedastic or even coming from mixtures of SNPs with or without an effect on the trait.
The expectations \( {\varvec{\uptheta}}_{{\upbeta }} \left( {{\varvec{\uptheta}}_{{\upalpha }} } \right) \) are generally assumed equal to zero, but when information about population history is available (in particular, when we know it is a mixed population), nonzero values should be considered.
The vector q = Xβ is a quantity similar but not equal to the genetic value g. Its element q _{ ti } is the molecular score of individual i in population t. This vector may be segmented in two parts: \( {\mathbf{q^{\prime}}} = \left( {{\mathbf{q^{\prime}_{{{r}}}}} ,{\mathbf{q^{\prime}_{{\mathbf{c}}} }}} \right) \).
Since the variances may be defined within a population, we have \( v\left( {{\mathbf{q}}_{{\mathbf{r}}} {\mathbf{X}}} \right) = {\mathbf{X}}_{{\mathbf{r}}} {\mathbf{V}}_{{{\bf\upbeta r}}} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} \), and v(q _{ c }X) = X _{ c } V _{βc} X ^{′}_{ c } . The residual variance is v(e) = I σ ^{2}_{ e } . Assuming that the distribution of marker effects is centered (\( {\varvec{\uptheta}}_{{\varvec{\upbeta}}} = {\bf 0} \)) and i.i.d. (\( {\mathbf{V}}_{{{\upbeta r}}} = {\mathbf{I}}\sigma_{{{\upbeta r}}}^{2} \, {\text{and}} \,{\mathbf{V}}_{{{\upbeta c}}} = {\mathbf{I}}\sigma_{{{\upbeta c}}}^{2} \)), and extending Gianola et al. [30], we have \( v\left( {q_{ri} } \right) = {\text{E}}_{{\mathbf{X}}} \left[ {v\left( {q_{\text{ri}} {\mathbf{X}}} \right)} \right] = \sigma_{\beta r}^{2} \sum\nolimits_{m} {2p_{mr} \left( {1  p_{mr} } \right)} = \sigma_{\beta r}^{2} \sum\nolimits_{m} {\sigma_{mr}^{2} = \sigma_{\beta r}^{2} \tau_{r} } \) in the reference population, and v(q _{ ci }) = σ ^{2}_{βc} τ _{ c } in the candidate population. Assuming that the distribution of the marker effects and genotypes are the same in P_{r} and P_{c}, i.e. \( p_{rm} = p_{cm} = p_{m} , p_{rq} = p_{cq} = p_{q} \), thus τ _{ r } = τ _{ C } = τ and σ ^{2}_{ βr } = σ ^{2}_{ βc } = σ ^{2}_{ β } , we define σ ^{2}_{ q } = τσ ^{2}_{β} . Thus, \( v\left( {{\mathbf{q}}{\mathbf{X}}} \right) = \frac{1}{\tau }{\mathbf{XX^{\prime}}}\sigma_{{q}}^{2} \). These equations hold even if the markers are in linkage disequilibrium (LD) as shown in Eq. A2 from Gianola et al. [30].
We note σ ^{2} as the total phenotypic variance, i.e. σ ^{2} = σ ^{2}_{ q } + σ ^{2}_{ e } , and ν ^{2} as the proportion of this variance explained by the molecular score \( \left( {\nu^{2} = \frac{{\sigma_{q}^{2} }}{{\sigma^{2} }}} \right) \). The ratio \( \frac{{\sigma_{q}^{2} }}{{\sigma_{e}^{2} }} \) will be noted γ.
The SNP effects β may be estimated in different ways. The genomic best linear unbiased prediction (BLUP) will only be considered here, with \( {\hat{\varvec{\upbeta} }} = {\text{cov}}\left( {{\varvec{\upbeta}},{\mathbf{y}}} \right){\text{var}}\left( {\mathbf{y}} \right)^{  1} {\mathbf{y}} \). Classically, this equation becomes \( {\hat{\varvec{\upbeta }}} = \sigma_{{{\upbeta }}}^{2} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} \left[ {\sigma_{{{\upbeta }}}^{2} \left( {{\mathbf{X}}_{{\mathbf{r}}} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} + {\mathbf{I}}{{\uplambda }}_{{{\upbeta }}} } \right)} \right]^{  1} {\mathbf{y}} = \left( {{\mathbf{X}}^{\prime}_{{\mathbf{r}}} {\mathbf{X}}_{{\mathbf{r}}} + {\mathbf{I}}{{\uplambda }}_{{{\upbeta }}} } \right)^{  1} {\mathbf{X}}^{\prime}_{{\mathbf{r}} }{\mathbf{y}} \) with λ_{β} = σ ^{2}_{e} /σ ^{2}_{β} . The linear combination \( \hat{\varvec{q}}_{\varvec{c}} = {\mathbf{X}}_{\varvec{c}} {\hat{\varvec{\upbeta}}} \) is the GBLUP vector for candidates in P_{c}. It must be emphasized that these estimations and predictions are conditional on the genotypic structures defined by X (X _{ r } and \( {\mathbf{X}}_{\varvec{c}} \)).
Given X, the reliability of the GBLUP is \( r^{2} \left( {g_{ci} ,\hat{q}_{ci} {\mathbf{X}}} \right) = \frac{{cov^{2} \left( {g_{ci} ,\hat{q}_{ci} {\mathbf{X}}} \right)}}{{v\left( {g_{ci} {\mathbf{X}}} \right)v\left( {\hat{q}_{ci} {\mathbf{X}}} \right)}} \).
In [16], the reliability is described (Eq. 6 in [16]) as \( r\left( {g_{ci} ,\hat{q}_{ci} } \right) = r\left( {g_{ci} ,q_{ci} } \right) \times r\left( {q_{ci} ,\hat{q}_{ci} } \right) \), by ignoring the conditioning on X. In Goddard et al. [18], the reliability is described as \( r_{{g_{ci} ,\hat{q}_{ci} }}^{2} = \frac{{v\left( {\hat{q}_{ci} } \right)}}{{v\left( {g_{ci} } \right)}} = \frac{{v\left( {q_{ci} } \right)}}{{v\left( {g_{ci} } \right)}}\frac{{v\left( {\hat{q}_{ci} } \right)}}{{v\left( {q_{ci} } \right)}} \). In this formulation, \( \frac{{v\left( {q_{ci} } \right)}}{{v\left( {g_{ci} } \right)}} \) is the proportion of the genetic variance explained by the markers and \( \frac{{v\left( {\hat{q}_{ci} } \right)}}{{v\left( {q_{ci} } \right)}} \) is the accuracy of estimated marker effects. This is similar to the \( {\text{qr}}_{{\hat{Q}}} \) reported by Dekkers et al. [25]. All these reliability formulae are approximations since \( cov^{2} \left( {g_{ci} ,\hat{q}_{ci} } \right) = cov^{2} \left( {\sum w_{ciq} \alpha_{q} ,\sum x_{cis} \hat{\beta }_{s} } \right) \ne v\left( {\hat{q}_{ci} } \right) = v\left( {\sum x_{cis} \hat{\beta }_{s} } \right) \), in general.
Situation analyzed in this paper
In the following, ignoring the difficulty that was mentioned above, we will assume \( r^{2} \left( {q_{\text{ci}} ,\hat{q}_{ci} {\mathbf{X}}} \right) = \frac{{cov^{2} \left( {q_{ci} ,\hat{q}_{ci} {\mathbf{X}}} \right)}}{{v\left( {q_{ci} {\mathbf{X}}} \right)v\left( {\hat{q}_{ci} {\mathbf{X}}} \right)}} = \frac{{v\left( {\hat{q}_{ci} {\mathbf{X}}} \right)}}{{v\left( {q_{ci} {\mathbf{X}} } \right)}} \). We are interested in a single candidate in \( \varvec{P}_{\varvec{c}} \) with a x _{ c } vector of marker genotypes.
Formulae were simplified in two ways. (1) the i index of the candidate was omitted in the following developments: the genetic value of the candidate is noted q _{ c }, estimated by \( \hat{q}_{c} = cov\left( {q_{c} ,{\mathbf{y}}} \right)v\left( {\mathbf{y}} \right)^{  1} {\mathbf{y}} \), and its precision is \( r^{2} \left( {q_{\text{c}} ,\hat{q}_{c} {\mathbf{X}}} \right) = \frac{{{\text{v}}(\hat{q}_{c} {\mathbf{X}})}}{{{\text{v}}\left( {q_{\text{c}} {\mathbf{X}}} \right)}} \), with \( v\left( {q_{\text{c}} {\mathbf{X}}} \right) = \sigma_{{{\upbeta }}}^{2} {\mathbf{x}}_{{\mathbf{c}}} {\mathbf{x}}^{\prime}_{{\mathbf{c}}} \) and \( v\left( {\hat{q}_{c} {\mathbf{X}}} \right) = \sigma_{{{\upbeta }}}^{2} {\mathbf{x}}_{{\mathbf{c}}} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} \left( {{\mathbf{X}}_{{\mathbf{r}}} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} + {\mathbf{I}}{{\uplambda }}_{{{\upbeta }}} } \right)^{  1} {\mathbf{X}}_{{\mathbf{r}}} {\mathbf{x}}^{\prime}_{{\mathbf{c}}} \) (where \( {\mathbf{x}}_{{\mathbf{c}}} \) is a row vector); (2) the r indices of reference individuals were most often omitted, which resulted in y _{ i } for their phenotypes and q _{ i } for their molecular scores.
In fact, our objective was to estimate the expectation of this precision across the variation domain of X _{ r } and x _{ c } given the pedigree structure \( \left( {{\mathbf{\mathcal{P}}}_{\varvec{r}} ,{\mathbf{\mathcal{P}}}_{\varvec{c}} } \right):\;{\text{E}}_{{\mathbf{X}}} \left[ {r^{2} \left( {q_{\text{c}} ,\hat{q}_{c} {\mathbf{X}}} \right){\mathbf{\mathcal{P}}}} \right] \). It will be noted \( E\left[ {r_{{q_{\text{c}} ,\hat{q}_{\text{c}} }}^{2} } \right] \).
The following approximation was made: \( {\text{E}}\left[ {r_{{q_{\text{c}} ,\hat{q}_{\text{c}} }}^{2} } \right] = \frac{{{\text{E}}_{{\mathbf{X}}} \left[ {{\text{v}}(\hat{q}_{c} {\mathbf{X}})} \right]}}{{{\text{E}}_{{\mathbf{X}}} \left[ {{\text{v}}\left( {q_{\text{c}} {\mathbf{X}}} \right)} \right]}} = \frac{{{\text{E}}\left[ {{\text{v}}\left( {\hat{q}_{c} } \right)} \right]}}{{{\text{E}}\left[ {{\text{v}}\left( {q_{\text{c}} } \right)} \right]}} \).
Let A be the pedigree relationship matrix between individuals in \( {\mathbf{\mathcal{P}}} \). Its blocks are \( {\mathbf{A}} = \left( {\begin{array}{*{20}c} {{\text{a}}_{\text{cc}} } & {{\mathbf{A}}_{{{\mathbf{cr}}}} } \\ {{\mathbf{A}}_{{{\mathbf{rc}}}} } & {{\mathbf{A}}_{{{\mathbf{rr}}}} } \\ \end{array} } \right) \). Let \( {\mathbf{G}}^{ *} = {\mathbf{XX^{\prime}}} = \left( {\begin{array}{*{20}c} {{\mathbf{x}}_{{\mathbf{c}}} {\mathbf{x}}^{\prime}} & {{\mathbf{x}}_{{\mathbf{c}}} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} } \\ {{\mathbf{X}}_{{\mathbf{r}}} {\mathbf{x}}^{\prime}_{{\mathbf{c}}} } & {{\mathbf{X}}_{{\mathbf{r}}} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} } \\ \end{array} } \right) \), which results in \( {\mathbf{V}} = \frac{1}{\tau }{\mathbf{G}}^{ *} \sigma_{q}^{2} + {\mathbf{I}}\sigma_{e}^{2} \). It must be noted that the σ ^{2}_{ e } term in the diagonal of the V submatrix corresponding to the candidate population is artificial since candidates are not phenotyped.
We have E[G*] = A τ. The limits of this equality will be discussed below. As indicated above, the denominator of the expected reliability \( {\text{E}}_{{\mathbf{X}}} \left[ {v\left( {q_{\text{c}} {\mathbf{X}}} \right)} \right] \), is τσ ^{2}_{β} = σ ^{2}_{ q } . Approximating \( {\text{E}}\left[ {v\left( {\hat{q}_{c} } \right)} \right] \) by E[cov(q _{ c }, y)]E[v(y)]^{−1} E[cov(y, q _{ c })] is useless because it makes an oversimplification of the relationships between the reference and the candidate population: it considers separately the marginal distributions of x _{ c } X ^{′}_{ r } and (X _{ r } X ^{′}_{ r } + Iλ_{β})^{−1}, while these random matrices are correlated. Estimating directly E[cov(q _{ c }, y)v(y)^{−1} cov(y, q _{ c })] seems impossible in the general case. The approach of Goddard et al. [18] avoids this difficulty, i.e. the variance \( v\left( {\hat{q}_{c} {\mathbf{X}}} \right) = \sigma_{{{\upbeta }}}^{2} {\mathbf{x}}_{{\mathbf{c}}} {\mathbf{x}}^{\prime}_{{\mathbf{c}}} + \sigma_{e}^{2}  \frac{1}{{\left\{ {{\mathbf{V}}^{  1} } \right\}_{{\varvec{cc}}} }} \), and V ^{−1} is approximated by a second degree Taylor expansion (\( {\mathbf{V}}^{  1} \sim {\varvec{\Lambda}}\left( {\mathbf{X}} \right) \)), giving \( v\left( {\hat{q}_{c} {\mathbf{X}}} \right)\sim \sigma_{{{\upbeta }}}^{2} {\mathbf{x}}_{{\mathbf{c}}} {\mathbf{x}}^{\prime}_{{\mathbf{c}}} + \sigma_{e}^{2}  \frac{1}{{{\varvec{\Lambda}}_{\text{cc}} \left( {{\mathbf{x}}_{{\mathbf{c}}} ,{\mathbf{X}}_{{\mathbf{r}}} } \right)}} \).
Alternative approximations of the reliability
Extension of Goddard’s formula
In their “heuristic approximation for V ^{*−1}”, Goddard et al. [18] considered the situation where unrelated individuals are included in the reference and candidate populations, that is E[G ^{*}] = I τ and \( {\mathbf{G}}^{ *} = {\mathbf{I}}\tau + {\mathbf{E}} \), with E, a “noise” matrix centered on the null matrix \( {\bf 0}. \) A direct extension of their development would be the following. The matrix \( {\mathbf{V}} = \frac{1}{\tau }{\mathbf{G}}^{ *} \sigma_{q}^{2} + {\mathbf{I}}\sigma_{e}^{2} \) can be written as:
V = σ ^{2}_{ e } (I + A γ)[I + D γ],
with \( {\mathbf{D}} = \left( {{\mathbf{I}} + {\mathbf{A}}\gamma } \right)^{  1} \left( {\frac{1}{\tau }{\mathbf{G}}^{ *}  {\mathbf{A}}} \right) = {\mathbf{T}}\left( {\frac{1}{\tau }{\mathbf{G}}^{ *}  {\mathbf{A}}} \right) \),
and \( \gamma = \frac{{\sigma_{q}^{2} }}{{\sigma_{e}^{2} }} \). Thus, \( {\mathbf{V}}^{  1} = \frac{1}{{\sigma_{e}^{2} }}\left[ {{\mathbf{I}} + {\mathbf{D}}\gamma } \right]^{  1} {\mathbf{T}} \). The inverse matrix [I + D γ]^{−1} will be approximated using a Taylor series. It must be emphasized that the Taylor series I − D γ + (D γ)^{2} − (D γ)^{3} + ··· converges towards [I + D γ]^{−1} only if the highest Eigen value of D γ is smaller than 1, i.e. if \( \left( {{\mathbf{D}}\gamma } \right)^{\text{t}} \to {\bf 0} \) when t → ∞.
The second order approximation of V ^{−1} is equal to \( \frac{1}{{\sigma_{e}^{2} }}\left( {{\mathbf{I}}  {\mathbf{D}}\gamma + {\mathbf{D}}^{2} \gamma^{2} } \right){\mathbf{T}} \). As E[D] = 0 and \( {\text{E}}\left[ {{\mathbf{D}}^{2} } \right] = {\mathbf{T}}\left( {\frac{1}{{\tau^{2} }}{\text{E}}\left[ {{\mathbf{G}}^{ *} {\mathbf{TG}}^{ *} } \right]  {\mathbf{ATA}} } \right), \) its expectation \( {\text{E}}\left[ {\varvec{\Lambda}} \right] = \frac{1}{{\sigma_{e}^{2} }}\left( {{\mathbf{I}}  {\text{E}}\left[ {\mathbf{D}} \right]\gamma + {\text{E}}\left[ {{\mathbf{D}}^{2} } \right]\gamma^{2} } \right){\mathbf{T}} \) i.e. \( {\text{E}}\left[ {\varvec{\Lambda}} \right] = \frac{1}{{\sigma_{e}^{2} }}\left( {{\mathbf{I}}  \gamma^{2} {\mathbf{TATA}} + \frac{{\gamma^{2} }}{{\tau^{2} }}{\text{E}}\left[ {{\mathbf{TG}}^{ *} {\mathbf{TG}}^{ *} } \right]} \right){\mathbf{T}} \).
Finally, the reliability of the candidate GBLUP is approximated by:
A difficulty with this approximation comes from the T term. As an example, consider a reference population composed of n _{ r } halfsibs of the candidate, \( {\mathbf{T}} = {{\upxi }}{\mathbf{I}} + {{\uppsi }}{\mathbf{J}} \) with \( {{\upxi }} = \frac{4}{4 + 3\gamma } \). As \( {\mathbf{T}}^{\varvec{t}} = {{\upxi }}^{\varvec{t}} {\mathbf{I}} + \left[ {n_{r}^{t} {{\upxi }}^{\varvec{t}} + \cdots } \right]{\mathbf{J}} \), the J coefficient will tend to ∞ as soon as \( n_{r} {{\upxi }} = \frac{{4n_{r} }}{4 + 3\gamma } > 1 \), a very realistic situation. Thus, the convergence of the Taylor series will be a balance between the increase of \( {\mathbf{T}}^{\varvec{t}} \) and decrease of \( \left[ {{\mathbf{D}}\gamma } \right]^{\varvec{t}} \).
Another approximation of the reliability
Principle
Using the classical matrix inversion lemma, the variance \( v\left( {\hat{q}_{{\text{c}}} {\mathbf{x}}_{{\mathbf{c}}} ,{\mathbf{X}}_{{\mathbf{r}}} } \right) = \sigma _{\upbeta }^{2} {\mathbf{x}}_{{\mathbf{c}}} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} \left( {{\mathbf{X}}_{{\mathbf{r}}} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} + {\mathbf{I}}\uplambda _{\upbeta } } \right)^{{  1}} {\mathbf{X}}_{{\mathbf{r}}} {\mathbf{x}}^{\prime}_{{\text{c}}} \) may also be defined as \( {\boldsymbol{v}}\left( {\hat{q}_{c} {\mathbf{x}}_{{\mathbf{c}}} ,{\mathbf{X}}_{{\mathbf{r}}} } \right) = \sigma_{{{\upbeta }}}^{2} {\mathbf{x}}_{{\mathbf{c}}} {\mathbf{x}}^{\prime}_{{\mathbf{c}}}  \sigma_{\text{e}}^{2} {\mathbf{x}}_{{\mathbf{c}}} \left( {{\mathbf{X}}^{\prime}_{{\mathbf{r}}} {\mathbf{X}}_{{\mathbf{r}}} + {\mathbf{I}}{{\uplambda }}_{{{\upbeta }}} } \right)^{  1} {\mathbf{x}}^{\prime}_{{\mathbf{c}}} \).
\( {\mathbf{X}}^{\prime}_{{\mathbf{r}}} {\mathbf{X}}_{{\mathbf{r}}} \) is a very large matrix \( \left( {n_{M} \times n_{M} } \right) \) that describes the LD between markers: its elements tend to be smaller when they are more distant from the diagonal.
Elements of E[X ^{′}_{ r } X _{ r }] are the following: \( {\text{E}}\left[ {{\mathbf{X}}^{\prime}_{{\mathbf{r}}} {\mathbf{X}}_{{\mathbf{r}}} } \right]_{\text{ml}} = {\text{E}}\left[ {\mathop {\sum\nolimits_{\text{i}} {\left( {a_{im}  2p_{m} } \right)\left( {a_{il}  2p_{l} } \right)} }\limits_{{}} } \right] = 2n_{r} \Updelta_{ml}, \) with \( \Updelta_{ml} \) the LD between loci m and l.
E[X ^{′}_{ r } X _{ r }]_{mm} = E[ ∑ _{i}(a _{ im } − 2p _{ m })^{2}] = n _{ r } σ ^{2}_{m} . Let \( {\mathbf{C}} = {\mathbf{I}}{{\uplambda }}_{{{\upbeta }}} + n_{r} {\text{diag}}\left[ {\sigma_{1}^{2} , \ldots ,\sigma_{{n_{M} }}^{2} } \right] \), the X ^{′}_{ r } X _{ r } + Iλ_{β} matrix may be written as:
\( {\mathbf{X_r^\prime}} {\mathbf{X}}_{{\mathbf{r}}} + {\mathbf{I}}{{\uplambda }}_{{{\upbeta }}} = \left[ {\left( {{\mathbf{X}}^{\prime}_{{\mathbf{r}}} {\mathbf{X}}_{{\mathbf{r}}}  n_{r} {\text{diag}}\left[ {\sigma_{1}^{2} , \ldots ,\sigma_{{n_{M} }}^{2} } \right]} \right){\mathbf{C}}^{  1} + {\mathbf{I}}} \right]{\mathbf{C}} \), which results in:
X ^{′}_{ r } X _{ r } + Iλ_{β} = [I + B]C.
The convergence of the Taylor series I − B + B ^{2} − B ^{3} + ··· to (I + B)^{−1} depends on the structure of the B matrix, which varies depending on the sample. However, we can examine the case of its expectation E[B].
E[B]_{ mm } = 0 and \( {\text{E}}\left[ {\mathbf{B}} \right]_{ml} = \frac{{2n_{r} \Updelta_{ml} }}{{{{\uplambda }}_{{{\upbeta }}} + n_{r} \sigma_{\text{l}}^{2} }} \). The ratio λ_{β} is proportional to the number of markers (\( {{\uplambda }}_{{{\upbeta }}} = n_{M} \frac{{\bar{\sigma }_{\text{m}}^{2} \sigma_{e}^{2} }}{{\sigma_{\text{q}}^{2} }}) \) and dominates the denominator when n _{ M } ≫ n _{ r }. The (m, l) term in E[B]^{2}, i.e. \( {\text{E}}\left[ {\mathbf{B}} \right]_{{\varvec{ml}}}^{2} = \sum\nolimits_{k} {\frac{{4n_{r}^{2} \Updelta_{mk} \Updelta_{kl} }}{{\left( {{{\uplambda }}_{{{\upbeta }}} + n_{r} \sigma_{\text{k}}^{2} } \right)\left( {{{\uplambda }}_{{{\upbeta }}} + n_{r} \sigma_{\text{l}}^{2} } \right)}}} \), is of order \( \frac{1}{{n_{M} }} \). Thus, we expect the Taylor series to converge to (I + E[B])^{−1}.
First order approximation
At the first order, \( v\left({\hat{q}_{c} {\mathbf{x}}_{{\mathbf{c}}} ,{\mathbf{X}}_{{\mathbf{r}}} } \right) \sim \sigma_{{{\upbeta }}}^{2} {\mathbf{x}}_{{\mathbf{c}}} {\mathbf{x}}^{\prime}_{{\mathbf{c}}}  \sigma_{\text{e}}^{2} {\mathbf{x}}_{{\mathbf{c}}} {\mathbf{C}}^{  1} \left( {{\mathbf{I}}  {\mathbf{B}}} \right){\mathbf{x}}^{\prime}_{{\mathbf{c}}} \) and the expectation of the reliability of the candidate GBLUP is approximated by \( \tilde{\tilde{\text{E}}}\left[ r_{q_{{\text{c}}} ,\hat{q}_{{\text{c}}} }^{2} \right] = 1  \frac{{\sigma _{{\text{e}}}^{2} {\text{E}}\left[ {{\mathbf{x}}_{{\mathbf{c}}} {\mathbf{C}}^{{  1}} \left( {{\mathbf{I}}  {\mathbf{B}}} \right){\mathbf{x}}^{\prime}_{{\mathbf{c}}} } \right]}}{{\sigma _{\upbeta }^{2} {\text{E}}\left[ {{\mathbf{x}}_{{\mathbf{c}}} {\mathbf{x}}^{\prime}_{{\mathbf{c}}} } \right]}} \).
Using \( {\mathbf{x}}_{{\mathbf{c}}} {\mathbf{C}}^{  1} {\mathbf{X}}^{\prime}_{{\mathbf{r}}} = \left({\begin{array}{*{20}l} {\sum\nolimits_{\text{m}} {\frac{{{\text{x}}_{\text{cm}} {\text{X}}_{{{\text{r}}1{\text{m}}}} }}{{{{\uplambda }}_{{{\upbeta }}} + {\text{n}}_{\text{r}} {{\sigma }}_{\text{m}}^{2} }}} } & \cdots & {\sum\nolimits_{\text{m}} {\frac{{{\text{x}}_{\text{cm}} {\text{X}}_{{{\text{rn}}_{\text{r}} {\text{m}}}} }}{{{{\uplambda }}_{{{\upbeta }}} + {\text{n}}_{\text{r}} {{\sigma }}_{\text{m}}^{2} }}} } \\ \end{array} }\right) \), the last term is: \( \sum\nolimits_{i} {\left( {\mathop \sum \nolimits_{m} \frac{{x_{cm} {\text{X}}_{rim} }}{{\lambda_{\beta } + n_{r} \sigma_{m}^{2} }}} \right)^{2} } .\) Finally, the expectation is:
Results
Application in the case of independent markers
This situation either assumes low density marker information, or corresponds to the idea of an effective number of loci that was developed by Goddard [16, 31]. In the first case, the proportion of the genetic variance explained by the markers \( \frac{{v\left( {q_{ci} } \right)}}{{v\left( {g_{ci} } \right)}} \) is small and this quantity should be considered when estimating the genomic precision.
First approximation
Using the notation X ^{′} = (x ^{′}_{ c } , X ^{′}_{ r } ), the (i, j) element of \( {\mathbf{G}}^{ *} {\mathbf{TG}}^{ *} \) is: \( \left\{ {{\mathbf{XX^{\prime}TXX^{\prime}}}} \right\}_{\text{ij}} = \sum\nolimits_{l} {\sum\nolimits_{k} {t_{kl} \left( {\sum\nolimits_{m} {X_{im} X_{km} } } \right)\left( {\sum\nolimits_{m} {X_{jm} X_{lm} } } \right)} }. \) Thus, elements of E[XX ^{′} TXX ^{′}] will involve expectations of fourth level moments of X _{ im } within m joint distributions: E[\( {\text{E}}\left[ {X_{im}^{2} X_{jm} X_{km} } \right], {\text{E}}\left[ {X_{im}^{2} X_{jm}^{2} } \right], {\text{E}}\left[ {X_{im}^{3} X_{jm} } \right] \) X _{ im } X _{ jm } X _{ km } X _{ lm }], and E[X ^{4}_{ im } ]. Defining and \( \tau _{2} = \sum _{m} [2p_{m} (1  p_{m} )]^{2} \) a _{ ij } the coancestry coefficient between individuals i and j, we found that [See Additional file 1]: \( \left\{ {{\text{E}}\left[ {{\mathbf{XX^{\prime}TXX^{\prime}}}} \right]} \right\}_{\text{ij}} = \sum\nolimits_{l} {\sum\nolimits_{k} {t_{kl} \left( {\frac{1}{2}\tau \alpha_{ijkl}^{1111}  \frac{1}{4}\tau_{2} \gamma_{ijkl}^{1111} + 4a_{ik} a_{jl} \left[ {\tau^{2}  \tau_{2} } \right]} \right)} }, \) where parameters \( \alpha_{ij \cdots K}^{{d_{i} d_{j} \cdots d_{K} }} \) and \( \gamma_{ij \cdots K}^{{d_{i} d_{j} \cdots d_{K} }} \) are functions of the probabilities of the identity states between gametes of \( ij \cdots K \) individuals at marker m (Table 1). In the summations above, when individuals are repeated (e.g. i = j), the corresponding exponents are summed (e.g. α ^{1111}_{ iiil } = α ^{31}_{ il } ).The resulting X _{ im } moments are in Table 2.
Second approximation
The expectations \( {\text{E}}[ {x_{cm}^{2} x_{rim}^{2} }]\) and \({\text{E}}\left[ {x_{{cm}}^{2} x_{{rim}} x_{{cl}} x_{{ril}} } \right] \)are also obtained from the coefficients in Table 1, i.e.: \( {\text{E}}\left[ {x_{cm}^{2} x_{rim}^{2} } \right] = \frac{1}{2}\sigma_{m}^{2} \alpha_{ci}^{22}  \frac{1}{4}\sigma_{m}^{4} \gamma_{ci}^{22} \) and, when markers are independent, E[x _{ cm } X _{ rim } x _{ cl } X _{ ril }] = E[x _{ cm } X _{ rim }].E[x _{ cl } X _{ ril }] = 4a ^{2}_{ ci } σ ^{2}_{m} σ ^{2}_{l} . Let \( \rho_{m} = \frac{{n_{r} \sigma_{m}^{2} }}{{\lambda_{\beta } + n_{r} \sigma_{m}^{2} }} \). After some algebra, it appears that:
where \( \bar{a}_{ci}^{2} \), \( \bar{\alpha}_{ci}^{22} \) and \( \bar{\gamma }_{ci}^{22} \) are the means of the corresponding coefficients, considering all possible i reference individuals.
Parameter estimation
The parameters τ = ∑_{ m } σ ^{2}_{ m } and τ _{2} = ∑_{ m } σ ^{4}_{ m } that appear in the first approximation, and the parameters \( \sum\nolimits_{m} {\rho_{m} } \), ∑_{ m } ρ ^{2}_{ m } and \( \sum\nolimits_{m} {\frac{{\rho_{m}^{2} }}{{\sigma_{m}^{2} }}} \) that appear in the second approximation, are unknown. Their expectations can be derived by making assumptions about the distribution of the marker allele frequencies. They were derived assuming either a uniform distribution of allele frequencies or the Ushaped distribution of allelic frequencies proposed by Goddard [16]: \( f( p) = {k \mathord{\left/ {\vphantom {k {2{\text{p}}\left( {1  {\text{p}}} \right) }}} \right. \kern0pt} {2{\text{p}}\left( {1  {\text{p}}} \right) }} \) with the constant k estimated as 1/log2N _{ e }, N _{ e } being the effective size of the reference population. The expectations of the parameters are in Table 3. The corresponding algebra is detailed in Additional file 2.
The parameters τ and τ _{2} are linked to the number M _{ e } of independent segments. This quantity M _{ e } was defined by Goddard [16] as the number of independent chromosomal segments which would give the same variance of genomic covariances \( c_{ij} \) between individuals i and j as that observed, i.e. when LD exists. Conditional on the genotypic observation, the genomic covariance between two individuals is cov(q _{ i },q _{ j }X) = σ ^{2}_{β} ∑_{q}X_{iq}X_{jq} = c _{ ij }. Thus, v_{X}(c _{ ij }) = σ ^{4}_{β} v[∑_{q}X_{iq}X_{jq}], or v_{X}(c _{ ij }) = σ ^{4}_{β} (∑_{q}v(X_{iq}X_{jq}) + ∑_{q}∑_{q′≠q}cov(X_{iq}X_{jq},X_{iq′}X_{jq′})). When the markers are in linkage equilibrium, the covariance term is null, and \( {\text{v}}_{\text{X}} \left( {c_{ij} } \right) = \sigma_{{{\upbeta }}}^{4} \left[ {\frac{1}{2}\tau \alpha_{ij}^{22}  \frac{1}{4}\tau_{2} \gamma_{ij}^{22}  \frac{1}{4}{\text{a}}_{\text{ij}}^{2} \tau_{2} } \right] \). If individuals are unrelated, α ^{22}_{ ij } = 0, γ ^{22}_{ ic } = −4 and a_{ij} = 0. Thus, v_{X}(c _{ ij }) = σ ^{4}_{β} τ _{2}. As σ ^{2}_{ q } = σ ^{2}_{β} τ, \( {\text{v}}_{\text{X}} \left( {c_{ij} } \right) = \sigma_{q}^{4} \frac{{\tau_{2} }}{{\tau^{2} }} \). From the appendix in the paper of Goddard [16], this variance is v_{X}(c _{ ij }) = σ ^{4}_{ q } /M _{ e }. Thus:
It must be emphasized that M _{ e }, which depends on the variability of allele frequencies, is not the number of markers n _{ M }.
The case of unrelated individuals
The first approximation gives results similar to Goddard et al. [18] when individuals are not related. In this case, A = I then \( {\mathbf{T}} = \frac{1}{1 + \gamma }{\mathbf{I}} = \frac{{\sigma_{e}^{2} }}{{\sigma_{q}^{2} + \sigma_{e}^{2} }}{\mathbf{I}} = \frac{{\sigma_{e}^{2} }}{{\sigma^{2} }}{\mathbf{I}} \). The ratio \( \frac{\gamma }{1 + \gamma } = \frac{{\sigma_{q}^{2} }}{{\sigma^{2} }} = \nu^{2} \) is the proportion of the phenotypic variance explained by the molecular score.
T being diagonal, this equation simplifies to \( \left\{ {{\text{E}}\left[ {{\mathbf{G}}^{ *} {\mathbf{TG}}^{ *} } \right]} \right\}_{\text{cc}} = \sum\nolimits_{k} {t_{kk} \left( {\frac{1}{2}\tau \alpha_{ck}^{22}  \frac{1}{4}\tau_{2} \gamma_{ck}^{22} + 4a_{ck}^{2} \left[ {\tau^{2}  \tau_{2} } \right]} \right)}, \) with \( t_{kk} = \frac{1}{1 + \gamma } \), \( \alpha_{ck}^{22} = 0 \;{\text{and}}\; \gamma_{ck}^{22} =  4\;{\text{if}} \;c \ne k \), \( \alpha_{cc}^{22} = \alpha_{c}^{4} = 2 \;{\text{and}} \;\gamma_{c}^{4} = 0 \), \( {\text{a}}_{\text{cc}} = {\text{a}}_{\text{kk}} = \frac{1}{2} \) and a_{ck} = 0. Hence \( \left\{ {{\text{E}}\left[ {{\mathbf{G}}^{ *} {\mathbf{TG}}^{ *} } \right]} \right\}_{\text{cc}} = \frac{1}{1 + \gamma }\left\{ {\tau + \tau^{2}  \tau_{2} + n_{r} \tau_{2} } \right\} \), and \( {\text{E}}\left[ {{\varvec{\Lambda}}_{\text{cc}} } \right] = \frac{1}{{\sigma_{e}^{2} }}\left\{ {\frac{1}{1 + \gamma } + \frac{{\gamma^{2} }}{{\left( {1 + \gamma } \right)^{3} }}\left( {\frac{1}{\tau } + \left( {n_{r}  1} \right)\frac{{\tau_{2} }}{{\tau^{2} }}} \right)} \right\} \). \( \begin{aligned} {\text{E}}\left[ {v\left( {\hat{q}_{c} } \right)} \right] &= {\text{E}}\left[ {{\mathbf{V}}_{\text{cc}}^{ *} } \right]  \frac{1}{{{\text{E}}\left[ {{\varvec{\Lambda}}_{\text{cc}} } \right]}}\\ &= \sigma^{2}  \sigma^{2} \frac{1}{{1 + \nu^{4} \left( {\frac{1}{\tau } + \frac{{n_{r}  1}}{{M_{e} }}} \right)}} \\ & = \sigma^{2} \frac{{\nu^{4} \left( {\frac{1}{\tau } + \frac{{n_{r}  1}}{{M_{e} }}} \right)}}{{1 + \nu^{4} \left( {\frac{1}{\tau } + \frac{{n_{r}  1}}{{M_{e} }}} \right)}}. \\ \end{aligned} \) If we neglect \( \frac{1}{\tau }  \frac{1}{{M_{e} }} \) and use \( \nu^{2} = \frac{{\sigma_{q}^{2} }}{{\sigma^{2} }} \), we get \( {\text{E}}\left[ {v\left( {\hat{q}_{c} } \right)} \right] = \sigma_{q}^{2} \frac{{\nu^{2} \frac{{n_{r} }}{{M_{e} }}}}{{1 + \nu^{4} \frac{{n_{r} }}{{M_{e} }}}} \), which is similar but not identical to the equation in Goddard et al. [18] (\( \sigma_{q}^{2} \frac{{\nu^{2} \frac{{n_{r} }}{{M_{e} }}}}{{1 + \nu^{2} \frac{{n_{r} }}{{M_{e} }}}} \)). Finally, the precision is estimated as:
In this situation of unrelatedness between the candidate and the reference population, the second approximation simplifies to \( {{\tilde{\tilde{E}}}}\left[ {r_{{q_{\text{c}} ,\, \hat{q}_{c} }}^{2} } \right] = 1  \lambda_{\beta } \frac{{E\left[ {\mathop \sum \nolimits_{m} \rho_{m} } \right]}}{{n_{r} \tau }} \). From Table 3, we have \( E\left[ {\sum\nolimits_{m} {\rho_{m} } } \right] = n_{M} \frac{k}{\omega }\theta \) with \( \theta = \log \left( {\left {\frac{1 + \omega }{1  \omega }} \right} \right),\omega = \sqrt {1 + 4h} \), \( h = {{{{\uplambda }}_{{{\upbeta }}} } \mathord{\left/ {\vphantom {{{{\uplambda }}_{{{\upbeta }}} } {2n_{r} }}} \right. \kern0pt} {2n_{r} }} \) and k = 1/ log 2N _{ e }. As λ _{ β } = τ/γ, we found:
Nonindependence between reference and candidate population, a simple example
We consider the situation of a candidate that is the son of one of the n _{ r } individuals in P_{r} (say the first in the list) while still assuming that reference individuals are unrelated. In this situation, the pedigree relationship matrix is \( \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & {0.5 } \\ {0.5} & 1 \\ \end{array} } & {\bf 0} \\ {\bf 0} & {{\mathbf{I}}_{{{\text{n}}_{\text{r}}  1}} } \\ \end{array} } \right) \), which results in a T matrix \( \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} a & b \\ b & a \\ \end{array} } & {\bf 0} \\ {\bf 0} & {\frac{1}{1 + \gamma }{\mathbf{I}}_{{{\text{n}}_{\text{r}}  1}} } \\ \end{array} } \right) \) with \( \gamma = \frac{{\sigma_{q}^{2} }}{{\sigma_{e}^{2} }}, a = \frac{1 + \gamma }{{\left( {1 + \gamma } \right)^{2}  {1 \mathord{\left/ {\vphantom {1 {4\gamma^{2} }}} \right. \kern0pt} {4\gamma^{2} }}}} \) and \( b =  \frac{{{\gamma \mathord{\left/ {\vphantom {\gamma 2}} \right. \kern0pt} 2}}}{{\left( {1 + \gamma } \right)^{2}  {1 \mathord{\left/ {\vphantom {1 {4\gamma }}} \right. \kern0pt} {4\gamma }}}} \). Applications of formulae (2) and (3) are described in Additional file 3. The expected approximate precision with the first approach is:
where \( c1 = \left( {{\text{a}} + {\text{b}}} \right)^{3} + \left( {{\text{a}}^{2} + {\text{b}}^{2} } \right)\frac{1}{2}{\text{a}} \), \( c2 = \frac{1}{4}{\text{a}}\left( {{\text{b}}^{2}  {\text{a}}^{2} } \right) \) and \( c3 = {\text{a}}^{2} + {\text{b}}^{2} + \frac{1}{2} {\text{ab}} \). And with the second approach:
Alternative genotypes codification
In all the previous developments, genotypes were coded x _{ tim } = a _{ tim } − 2p _{ tm } and w _{ tiq } = a _{ tiq } − 2p _{ tq }. Alternatively, we could define x _{ tim } = (a _{ tim } − 2p _{ tm })/σ _{ tm } and w _{ tiq } = (a _{ tiq } − 2p _{ tq })/σ _{ tq }. The relation between genetic and marker variances becomes σ ^{2}_{ q } = n _{ M } σ ^{2}_{β} and the relation between pedigree and genomic matrices becomes E[G*] = A n _{ M }. Thus, formulae (1) and (2) are still valid when replacing τ by n _{ M }. The \( {\text{E}}\left[ {X_{im}^{{d_{i} }} X_{jm}^{{d_{j} }} \cdots X_{Km}^{{d_{K} }} } \right] \) elements derived in Additional file 1, need to be divided by \( \sigma_{m}^{{d_{i} + d_{j} + \cdots + d_{K} }} \). Table 2 gives the expectations with this alternative codification of genotypes. The quantity {E[XX ^{′} TXX ^{′}]}_{ij} has to be changed, using \( {{\zeta }} = \frac{1}{{n_{M} }}\sum\nolimits_{m} {\frac{1}{{\sigma_{m}^{2} }}} \). We have: \( \sum\nolimits_{\text{m}} {{\text{E}}\left[ {X_{im} X_{km} X_{jm} X_{lm} } \right] = \frac{{n_{M} }}{2}{{\zeta \alpha }}_{\text{ijkl}}^{1111}  \frac{{n_{M} }}{4}{{\gamma }}_{\text{ijkl}}^{1111} } \), ∑_{m}E[X _{ im } X _{ km }] = 2n _{ M } a _{ ik }, and ∑_{m}(E[X _{ im } X _{ km }]E[X _{ jm } X _{ lm }]) = 4n _{ M } a _{ ik } a _{ jl }. Thus:
When applied to the case of unrelated individuals and no LD, i.e. when \( t_{kk} = \frac{1}{1 + \gamma } \), \( \alpha_{ck}^{22} = 0 \;{\text{and}}\; \gamma_{ck}^{22} =  4 \;{\text{if}}\; c \ne k \), \( \alpha_{cc}^{22} = \alpha_{c}^{4} = 2 \;{\text{and}}\; \gamma_{c}^{4} = 0 \), \( {\text{a}}_{\text{cc}} = {\text{a}}_{\text{kk}} = \frac{1}{2} \) and a_{ck} = 0, we have:
which gives:
i.e. \( {\text{E}}\left[ {{\varvec{\Lambda}}_{\text{cc}} } \right] = \frac{1}{{\sigma^{2} }}\left\{ {1 + \nu^{4} \left( {\frac{{{{\zeta }} + n_{R}  1}}{{n_{M} }}} \right)} \right\} \) and \( {\tilde{\text{E}}}\left[ {r_{{q_{\text{c}} ,\hat{q}_{\text{c}} }}^{2} } \right] = \frac{{\nu^{2} \frac{{{{\zeta }} + n_{R}  1}}{{n_{M} }}}}{{1 + \nu^{4} \frac{{{{\zeta }} + n_{R}  1}}{{n_{M} }}}} \).
Based on Additional file 2, the expectation of ζ parameter is \( \frac{k}{4}\left[ {2\log \left( {N_{e}  1} \right) + 2\frac{{N_{e} \left( {N_{e}  2} \right)}}{{N_{e}  1}}} \right] \) for a Ushaped distribution of alleles frequencies and log (N _{ e } − 1) for a uniform distribution.
The case of markers in linkage disequilibrium
So far, following Goddard [16], we considered the situation of n _{ M } independent segments that each carries a single QTL in LD with a single marker. More typically, the genomic information consists of a large number of nonindependent markers. This nonindependence comes from longterm effects due to bottlenecks, mutations, migrations, etc. and shortterm effects due to family structure.
Effective and equivalent numbers of independent loci
We based our developments on the very fruitful concept of the effective number of loci that Goddard defined as “the number of independent loci that gives the same variance of realized relationships as that obtained in the more realistic situation” (Goddard [16] appendix). Since our objective was to predict the reliability of GEBV, we now suggest the alternative definition of an “equivalent number of independent loci” which would give the reliability of GEBV for unrelated individuals when considering a subset of independent markers that would be identical to the reliability obtained when considering the full set of markers. From the derivation of the reliability given previously, defining x ^{i}_{ c } and \( {\mathbf{X}}_{{\mathbf{r}}}^{{\mathbf{i}}} \) as the genotype meed \( {\text{E}}_{{{\mathbf{x}}_{{\mathbf{c}}} ,{\mathbf{X}}_{{\mathbf{r}}} }} \left[ {v\left( {\hat{q}_{c} {\mathbf{x}}_{{\mathbf{c}}} ,{\mathbf{X}}_{{\mathbf{r}}} } \right)} \right] = {\text{E}}_{{{\mathbf{x}}_{\text{c}}^{\text{i}} ,{\mathbf{X}}_{\text{r}}^{\text{i}} }} \left[ {v\left( {\hat{q}_{c} {\mathbf{x}}_{{\mathbf{c}}}^{{\mathbf{i}}} ,{\mathbf{X}}_{{\mathbf{r}}}^{{\mathbf{i}}} } \right)} \right]. \) With a few simplifying assumptions (identical distribution of genotypes in the reference and candidate populations and equal genotypic variance at all loci) a simple formula can be derived [see Additional file 4]:
where tr[M] is the trace of matrix M.
Once marker allele frequencies and betweenmarker LD are estimated in a population of interest, the equivalent number of independent loci which can be estimated from formula (9) and this parameter can be used in models that predict the genetic gain expected from a genomic selection scheme applied to this population.
In the more general situation, prior to the observation of the X _{ r } matrix, a simple approximation for \( n_{{M_{i} }} \) is obtained assuming equal variances σ ^{2}_{ m } = s ^{2}, and using the relation between expected LD and effective population size N _{ e } as derived by Sved [32]: \( {\text{E}}\left[ {2\Updelta_{ml} } \right] = {{\sigma_{m} \sigma_{l} } \mathord{\left/ {\vphantom {{\sigma_{m} \sigma_{l} } {\sqrt {1 + 4N_{e} d_{lm} } }}} \right. \kern0pt} {\sqrt {1 + 4N_{e} d_{lm} } }} \) with d _{ lm } the distance between ordered loci l and m, such that \( d_{lm} = {{\left {l  m} \rightL} \mathord{\left/ {\vphantom {{\left {l  m} \rightL} {n_{M} }}} \right. \kern0pt} {n_{M} }} \) with L the genome length in Morgan. With those hypotheses, let U = tr[(γn _{ R } R + n _{ M } I)^{−1}] with \( {\mathbf{R}}_{\text{ml}} = \sqrt {{{n_{M} } \mathord{\left/ {\vphantom {{n_{M} } {\left( {n_{M} + 4N_{e} \left {l  m} \rightL} \right)}}} \right. \kern0pt} {\left( {n_{M} + 4N_{e} \left {l  m} \rightL} \right)}}} \).
In this simplified situation, the equivalent number of loci is [See Additional file 4]:
Towards an exact treatment of linkage disequilibrium
For a complete treatment of the LD situation, it is necessary to estimate the expectations of the product of four genetic values. For instance, with the second approximation [formula (2)], we need to compute E[x _{ cm } X _{ rim } x _{ cl } X _{ ril }]. Let \( X_{im} = g_{imf} + g_{imd} \), where g _{ imf } and g _{ imd } are the “values” of the alleles transmitted to individual i by its father and its dam, with g _{ imf } and \( g_{imd} = \left( {0 \;or\; 1} \right)  p_{m} \). They will be called allelic values in the following. Equivalent terms are defined for x _{ cl },\( x_{cm} \) and X _{ il }. The random variable M _{ cls } is the allele of individual c received from s at locus \( l \left(f\, {\rm or}\, d \right) \). \( M_{cmt} , M_{ilu}\, {\text{and}}\, M_{imv} \) are defined similarly.
For the candidate c as for the reference individual i, the pair of genetic values may originate from the same parent (and coded on the same chromosome) or not, giving four types of \( \left( {g_{cls} ,g_{cmt} ,g_{ilu,} g_{imv} } \right) \) vectors. In type 1 (\( s = t \, and \, u = v \)), both alleles (belonging to loci \( m\; {\text{and}} \;l \)) of each pair of loci (one for c and one for i) are on the same chromosome (may be from the two fathers, the two dams, c’s father and i’s dam or i’s father and c’s dam). In type 2 (\( s = t \, and \, u \ne v \)), both alleles (belonging to loci \( m \;{\text{and}} \;l \)) of the candidate are on the same chromosome, while alleles of the reference individual i are not on the same chromosome.Type 3 (\( s \ne t \, and \, u = v \)) is the reverse from type 2. In type 4 (\( s \ne t \, and \, u \ne v \)), alleles of loci \( m\; {\text{and}} \;l \) of both individuals and i are on different chromosomes.
For each of these situations, the identity by descent (IBD) status between alleles at locus \( m \) on chromosomes ct and iv, and at locus \( l \) on chromosomes cs and iu are considered. There are four, as follows:
Thus, 16 terms involved in E[x _{ cl } x _{ cm } X _{ il } X _{ im }] are given by:
As described in Additional file 5, only seven \( {\text{E}}\left[ {g_{cls} g_{cmt} g_{ilu} g_{imv} {\mathcal{S}}_{k} } \right] \) are nonnull (Table 4). Principles on which the probabilities φ ^{stuv}_{ k } are estimated and basic examples are described in Additional file 5.
As an illustration, we consider again the situation of a candidate (c), that is the son of one of the n _{ r } individuals in P_{r} and assume that c ^{’s}dam is unrelated to the sire. In formula (2), the summation over the reference individuals i comprises a single term for the sire of the candidate and n _{ r } − 1 terms for the individual that are unrelated to the c members of this reference population.
Based on Additional files 1 and 5, expectations involved in the precision formulae (2) are:
E[x ^{2}_{ cm } X ^{2}_{ rim } ] = p _{ m }(1 − p _{ m }), and
\( \begin{aligned} {\text{E}}\left[ {x_{cl} x_{cm} X_{il} X_{im} } \right] \hfill \\ = \left( {1  p_{m} } \right)\left( {1  p_{l} } \right)p_{m} p_{l} + \Updelta_{lm} \left( {1  2p_{l} } \right)\left( {1  2p_{m} } \right) \hfill \\ + 2\Updelta_{lm}^{2} \left[ {r_{ml} \left( {\frac{{p_{m}^{3} + \left( {1  p_{m} } \right)^{3} }}{{p_{m} \left( {1  p_{m} } \right)}} + \frac{{p_{l}^{3} + \left( {1  p_{l} } \right)^{3} }}{{p_{l} \left( {1  p_{l} } \right)}}} \right) + \left( {1  r_{ml} } \right)\left( {1  2p_{m} } \right)\left( {1  2p_{l} } \right)} \right], \hfill \\ \end{aligned} \) when i is the sire of c; and E[x ^{2}_{ cm } X ^{2}_{ rim } ] = 4[p _{ m }(1 − p _{ m })]^{2} and \( {\text{E}}\left[ {x_{cl} x_{cm} X_{il} X_{im} } \right] = 4\Updelta_{lm}^{2} \left( {1  2p_{m} } \right)\left( {1  2p_{l} } \right) \), when i and c are unrelated.
Numerical evaluation
Simulation of allele frequencies
In the following numerical evaluation of the formulae derived above, allele frequencies were simulated following an inverse transform sampling (e.g. [32]): \( n_{M} \) allele frequency cumulative distribution function values u _{ m } were simulated in a uniform \( {\mathcal{U}}\left( {0,1} \right) \), and corresponding allele frequencies p _{ m }, i.e. such as \( u_{m} = \int_{{1/2n_{r} }}^{{p_{m} }} {f\left( p \right)dp} \), computed by \( p_{m} = \frac{{\left( {2n_{r}  1} \right)^{{\left( {2u_{m}  1} \right)}} }}{{1 + \left( {2n_{r}  1} \right)^{{\left( {2u_{m}  1} \right)}} }} \).
Basic situation: no LD and unrelated individuals
Convergence of Taylor series and quality of expectation of the reliability approximations were tested for different population sizes (\( n_{r} = 500, 1000, 1500 \;{\text{and}}\; 2500 \)), numbers of markers (\( n_{M} = 50, 100, 250, 1000, 1500, 2000 \;{\text{and}}\; 2500 \)) and proportions of the phenotypic variance explained by the molecular score (\( \nu^{2} = 0.1, 0.4 \;{\text{and}} \;0.7 \)). Given the set of allele frequencies \( p_{m} \left( {m = 1 \ldots n_{M} } \right) \), genotypes X of n _{ r } + 1 individuals were generated and the G matrix was built. The reliability of the candidate individual GEBV, \( r^{2} = \frac{{{\text{v}}(\hat{q}_{c} {\mathbf{X}})}}{{{\text{v}}\left( {q_{\text{c}} {\mathbf{X}}} \right)}} \) was computed as described in the section «Situation analyzed» , as well as approximations considering 1–10 elements in the Taylor series I − D γ + D ^{2} γ ^{2} − D ^{3} γ ^{3}··· The convergence of the series as predicted by the value (lower or higher than 1) of the matrix’s largest eigenvalue was checked numerically, by estimating the mean of this largest eigenvalue from five simulations in each case studied (\( n_{r} = 200\; {\text{to}} \;1000 ;n_{M} = 100 \;{\text{to}}\; 2000 \;{\text{and}} \;\nu^{2} = 0.1, 0.4, 0.7). \)
This limited number of replications was chosen after observation of a very limited variance of this eigenvalue. Finally, the asymptotic values of the suggested approximations [formulae (5) and (6)] were computed using the number of independent segments as described by [4]. The process was repeated 50 times and the means of those exact or approximated reliabilities computed.
Figure 1a and b illustrates the convergence of the Taylor series when 2000 markers are used, and Tables 5 and 6 give the results for both approximations when ν ^{2} = 0.4. The Taylor series converged when the proportion ν ^{2} of the phenotypic variance explained by the molecular score was low, with oscillations and divergence observed when \( \nu^{2} = 0.4 \) or 0.7 with the first approximation and ν ^{2} = 0.7 with the second approximation. These observations were in accordance with the deviation to one of the largest eigenvalue of the matrix involved in the series (Fig. 2a, b). When the series converged, the approximations rapidly reached a plateau, at the 3rd (respectively, 2nd) order for the first (respectively, second) approximation.
Table 6 shows that the second Taylor series converges always when \( \nu^{2} = 0.4 \). The proposed approximation was generally biased upwards. This overestimation of the precision was generally limited but increased as the number of markers and the reference population size decreased. The maximum overestimation observed was 37.5 % (0.22 instead of 0.16 with a standard error less than 0.02). Based on the results in Table 5, it appears that when the first Taylor series converges, the proposed approximation is also slightly overestimated. The expectation of the approximations, as given in formulae (5) and (6) are very close to the observations.
No LD and nonindependence between reference and candidate population
The quality of the approximation was tested as above, by considering the case of a candidate having one of its parents in the reference population and all other individuals being unrelated. Tables 7 and 8, which summarize the results of the simulation, show that the second approximation is still the most efficient (systematic convergence of the Taylor series and consistency between first order approximation and its expectation). Again, an overestimation of about 20 % is observed with this approximation.
Example of the use of the second approach
As an illustration of formula (3) different situations that differ in the relationships between the candidate and reference populations were compared. Coefficients of formula (3) were estimated using the elements in Table 3. An effective reference population size of 200, the genotyping of 10,000 markers and a heritability of 0.4 were assumed. Scenarios included no individuals related to the candidate in the reference population, its sire, both parents, 1–10 halfsibs (or uncles), and a combination of parental and halfsib information.
The results are in Fig. 3. The linearity of the precision increases with the number of halfsibs, which is consistent with the approximation, but unsatisfactory, as discussed below.
Equivalent number of independent loci
This number was computed using formula (8), for various effective population sizes (\( N_{e} = 100 \;{\text{to}} \;1000 \)), heritabilities (\( h^{2} = 0.1 \;{\text{to}}\; 0.5 \)), total numbers of loci (\( n_{M} = 1000 \;{\text{to}}\; 10,000) \) and reference population sizes (\( n_{R} = 1000 \;{\text{to}}\; 2500 \)).
Figure 4 shows how equivalent numbers of independent loci (\( n_{{M_{i} }} \)) vary with the total number of markers (\( n_{M} ) \) and reference population size (n _{ R }). As n _{ M } increases, the number \( n_{{M_{i} }} \) rapidly converges to a value which strongly depends on the size of the reference population size. This dependence on n _{ R } of the equivalent number of independent loci does not exist in the Goddard’s effective number of loci and clearly shows the difference in nature between these concepts. Three phenomena, observed when considering the extreme case of two markers (see Additional file 5), explain this behavior:

(1)
The trace T of (E[X ^{′}_{ r } X _{ r }] + λ_{β} I)^{−1} is a decreasing function of n_{ r }: as a consequence, the larger is the population size, the smaller is T, which is proportional to the marker effects conditional variances \( v\left(\varvec{\beta}\right)  cov\left( {\varvec{\beta},{\mathbf{y}}} \right)v\left( \varvec{y} \right)^{  1} cov( {{\mathbf{y}},\varvec{\beta}}) \)) and the higher is the variance of the estimated molecular score (\( {\text{v}}\left( {q_{c} {\mathbf{y}}} \right) = {\mathbf{x}}_{{\mathbf{c}}} cov\left( {\varvec{\beta},{\mathbf{y}}} \right)v\left( \varvec{y} \right)^{  1} cov\left( {{\mathbf{y}},\varvec{\beta}} \right){\mathbf{x}}^{\prime}_{{\mathbf{c}}} ) \).

(2)
The trace T is always higher in the situation of LD than for independent markers \( \left( {{\text{T}}_{LD} > {\text{T}}_{LE} } \right) \).

(3)
The rate of decrease is higher for T_{ LD } than for T_{ LE }. On the whole, the reliability for a given number of observed markers corresponds to the reliability that is reached with a larger number of independent loci when the size of the reference population is larger.
Figure 5 indicates that the equivalent number of independent loci increases with heritability and effective population size. This last observation was expected since with larger effective population sizes, the LD between two loci decreases and this increases the effective number of loci. The effect of heritability is less direct.
Discussion
The objective of this paper was to explore approximations of the precision of genomic selection when the selection candidate has relatives in the reference population. Two approximations were developed and numerically compared.
These approximations were based on Taylor expansions of a matrix inverse M ^{−1}. In both cases, the initial matrix is the sum of the identity matrix and a perturbation (M = I + E). Convergence of these series is not guaranteed and depends on the behavior of the perturbation (I − E + E ^{2} − E ^{3} → (I + E)^{−1} if \( {\mathbf{E}}^{\text{t}} \to 0 \) when t → ∞). With the first approximation, derived from the appendix in [18], this convergence failed when the number of makers was too small (less than 1500 in our example) or the heritability was greater than 0.1. This was only observed when ν ^{2} = 0.7 with the second approximation. This is fully consistent with the deviation to one of the largest eigenvalue of the E matrix.
The expectation of the proposed approximation, when data were simulated with the model corresponding to the hypotheses underlying their algebraic development, was very close to the mean value after 50 simulations. Thus, extremely fast estimation of the precision is possible, which allows intensive optimization and comparison of selection schemes.
When individuals are unrelated and markers are in linkage equilibrium, we obtain an estimation of the GEBV accuracy which differs from that of Goddard et al. [18]. This is surprising since that approach was said to be based on the Taylor approximation used here. Their formula may be obtained in a simpler way [see Additional file 6]. However, relaxing the assumption of “absence of betweenindividual relationship” is not straightforward using this approach.
A strong limit of our new approximation comes from the limitation to the first order term of the Taylor series. Deriving algebra was only possible at this stage. The side effect is that no genotypic covariance terms between reference individuals appear in this approximation. As a consequence, only the direct relationships between candidate and reference individuals play a role in the estimation, but not the structure within the reference population. This is unfortunate, because accuracies of genomic prediction are obviously affected by the construction of the reference population. Our last numerical example, in which there is a linear trend with the number of halfsibs, reveals this drawback: two halfsibs of the candidates are treated as unrelated and the information that they carry is just the double of that of a single halfsib. Future developments should focus on this limitation, for instance to derive the expectation of the x _{ c } C ^{−1} B ^{2} x ^{′}_{ c } term.
The Ushaped density function \( f( p) \) of allele frequencies was defined as in [16]\( . \) A Beta distribution \( {\mathcal{B}}( {\phi_{a} ,\phi_{b} }) \) for the allele frequencies was assumed by Gianola et al. [30], following Wright [34]. Assuming that the frequency distribution is centered on 0.5, i.e. Φ _{ a } = Φ _{ b } = Φ, this quantity Φ can be adjusted to fit the distribution of Goddard. Using the Chi^{2} test as a fitting option, we observed that the adjusted \( \hat{\phi } \) rapidly decreased as the population size increased (Fig. 6), with a slower and slower evolution as the population size grew larger (with \( n_{r} = 200{,}000 \) the adjusted \( \hat{\phi } \) is 0.9750000). Using a Beta distribution could give more generality to the results. If the expectation of τ and τ_{2} are easily derived from the moments generating function of Beta distribution (\( {\text{E}}[ {{\tau }}] = \frac{{n_{r} a}}{2a + 1} \) and \( {\text{E}} [ {{{\tau }}_{2} } ] = n_{r} \frac{{4a^{2} + 16a + 18}}{{4a^{2} + 8a + 3}} \)), deriving the expectation of parameters \( \sum\nolimits_{m} {\rho_{m} } \), ∑_{ m } ρ ^{2}_{ m } and \( \sum\nolimits_{m} {\frac{{\rho_{m}^{2} }}{{\sigma_{m}^{2} }}} \) is not simple. However, these quantities are quite easily obtained by numerical integration. Thus, adjusting a Beta distribution to observed allele frequencies and numerically computing formula (3) parameters would be a feasible and more versatile implementation of our second genomic precision approximation.
Our work focused on the BLUP precision of the molecular score \( r^{2} \left( {q_{ci} ,\hat{q}_{ci} {\mathbf{X}}} \right) = \frac{{v\left( {\hat{q}_{ci} } \right)}}{{v\left( {q_{ci} } \right)}} \) but left aside the proportion of the genetic variance that is captured by the markers \( \left( {\frac{{v ( {q_{ci} } )}}{{v ( {g_{ci} } )}}} \right) \). This last term could be treated as in Goddard et al. [18]: \( \frac{{v\left( {q_{ci} } \right)}}{{v ( {g_{ci} })}} = b = \frac{{n_{M} }}{{n_{M} + M_{e} }} \) with M _{ e } the number of independent segments. As noted in the section on the general framework, the quantity \( \frac{{v ( {\hat{q}_{ci} } )}}{{v ( {g_{ci} } )}} = b \times r ( {q_{ci} ,\hat{q}_{ci} {\mathbf{X}}} ) \) is only an approximation of these GEBV reliabilities i.e. \( r^{2} \left( {g_{ci} ,\hat{q}_{ci} {\mathbf{X}}} \right) = \frac{{cov^{2} \left( {g_{ci} ,\hat{q}_{ci} {\mathbf{X}}} \right)}}{{v\left( {g_{ci} {\mathbf{X}}} \right)v\left( {\hat{q}_{ci} {\mathbf{X}}} \right)}} \). Equality between those quantities is obtained when \( {\mathbf{X}} = {\mathbf{W}} \)(identity between statistical and genetical models), a condition assumed in Goddard [16] where markers and QTL are modeled as a series of uncorrelated pairs.
All the developments shown in this paper are based on the hypothesis that the reliability of GEBV based on nonindependent markers for a trait controlled by \( n_{Q} \) QTL that are in incomplete LD with the markers can be approached by the reliability of GEBV when there are n _{ M } independent segments that carry a single QTL in LD with a single marker. A few difficulties arose when applying this approach proposed by Goddard [16]. How many independent markers should be considered? The reasoning in Goddard [16] was based on the idea of an effective number of loci (M _{ e }) corresponding to a given variance of realized relationships. Here, we proposed the alternative equivalent number of independent loci (M _{ i }) which corresponds to a given reliability. We showed that this M _{ i } number depends on the size of the reference population and on heritability, a dependence that does not occur with M _{ e }. If we invert the argument, controlling the level of realized relationships variance with the effective number of loci (M _{ e }) does not seem to be a good approach to control the estimated GEBV reliability.
As detailed by Hayes et al. [17], the effective number of independent chromosome segments depends on the population structure. The higher is the mean relationship level, the smaller is this effective number. However, we suggest the use of this number as estimated from a set of unrelated individuals, or of its expectation prior to any observation, assuming independence between individuals. Without formal proof, the idea was that longterm LD was considered by using an effective (or equivalent) number of independent loci while shortterm nonindependence was taken into account with our formalization of the matrix’s expectations that is developed in Additional file 1. A complete proof of the procedure is still needed.
Regardless of the definition of \( M_{e} \) or M _{ i }, there is no reason that the number of independent loci must equal the number of QTL, which is unknown, contrary to the hypothesis about pairs of markerQTL (in practice, since the QTL effects are random variables, many segments will only have very small effects on the trait, thus simulating the more likely situation of a limited number of “real” QTL). Equating \( {\mathbf{X}} \) and W as well as σ ^{2}_{β} and σ ^{2}_{α} has no clear justification. The variance \( v\left( {\hat{q}_{c} {\mathbf{X}}} \right) \) of the molecular score should not be σ ^{2}_{β} x _{ c } X ^{′}_{ r } (X _{ r } X ^{′}_{ r } + Iλ_{β})^{−1} X _{ r } x ^{′}_{ c } but σ ^{2}_{α} x _{ c } X ^{′}_{ r } (X _{ r } X ^{′}_{ r } + Iλ_{β})^{−1}(W _{ r } W ^{′}_{ r } + Iλ_{α})(X _{ r } X ^{′}_{ r } + Iλ_{β})^{−1} X _{ r } x ^{′}_{ c } . This other formula assembles two sets of unknown parameters: the variances σ ^{2}_{α} and σ ^{2}_{β} , and the genotypes X and W. It is often assumed that \( {{\sigma }}_{{{\upbeta }}}^{2} = \sigma_{g}^{2} /\left( {n_{M} \bar{\tau }} \right) \) (e.g. [1]), which results in an overestimation of the \( {{\uplambda }}_{{{\upbeta }}} \) parameter since LD is not considered. Working on the number of independent loci (\( M_{e}\, {\text{or}}\,M_{i} \)) apparently solves this difficulty. The QTL variance \( {{\sigma }}_{{{\alpha }}}^{2} = \sigma_{g}^{2} /\left( {n_{Q} \bar{\tau }} \right) \) could be derived based on a hypothesis about the number of QTL. The situation is more difficult for the genotype matrices since the W _{ r } matrix is not observed.
If the framework considered so far (n _{ M } markersQTL pairs with strong LD within pairs and no LD between pairs) is partly retained, a slight improvement is possible considering the element b of the genetic variability explained by SNPs. The idea would be to replace, in the formulae used in this paper, σ ^{2}_{q} by b × σ ^{2}_{g} . Element b can be derived by considering that the markers’ (β) and QTL’ (α) effects are fixed in the genetic and statistical models. Leaving aside the singularity of X ^{′}_{ r } X _{ r } when the number of SNPs is large, the marker effects are now estimated by \( {\hat{\varvec{\upbeta }}} = \left( {{\mathbf{X}}^{\prime}_{{\mathbf{r}}} {\mathbf{X}}_{{\mathbf{r}}} } \right)^{  1} {\mathbf{X}}^{\prime}_{{\mathbf{r} }}{\mathbf{y}} \) and the molecular score defined as \( {\hat q}= {\mathbf{X}}_{{\mathbf{r}}} {\hat{\varvec{\upbeta }}} \), while the genetic value was g = W _{ r } α. Given the genotype matrices, the sample genetic variability is v _{ g } = α ^{′} W ^{′}_{ r } W _{ r } α and the sample molecular score variability y ^{′} X _{ r }(X ^{′}_{ r } X _{ r })^{−1} X ^{′}_{ r } y with an expectation v _{ q } = α ^{′} W ^{′}_{ r } X _{ r }(X ^{′}_{ r } X _{ r })^{−1} X ^{′}_{ r } W _{ r } α. The part of the genetic variability explained by the SNPs is the ratio b = v _{ q }/v _{ g }.
Expectations of the matrices’ product elements \( \left\{ {{\mathbf{X}}^{\prime}_{{\mathbf{r}}} {\mathbf{X}}_{{\mathbf{r}}} } \right\}_{{\varvec{ml}}} \) are \( 2n_{r} \Updelta_{ml} \) off diagonal and 2n _{ r } p _{ m }(1 − p _{ m }) = n _{ r } σ ^{2}_{ m } in the diagonal, with similar expressions for \( {\mathbf{W^{\prime}_{\mathbf{r}}}} {\mathbf{X}}_{{\mathbf{r}}} \) and W ^{′}_{ r } W _{ r } elements.
Following Goddard [16], approximating expectations of the matrices’ functions by the function of their expectation, and assuming that (1) markers are independent, (2) each QTL q is in LD with only one marker m(q), with a LD value \( \Updelta_{qm\left( q \right)} \), and (3) individuals are unrelated: v _{ g } = n _{ r } ∑ _{ q } α ^{2}_{ q } σ ^{2}_{ q } , we get \( v_{q} \sim 4n_{r} \sum\nolimits_{q} {\frac{{\Updelta_{qm\left( q \right)}^{2} }}{{\sigma_{m\left( q \right)}^{2} }}\alpha_{q}^{2} } = n_{r} \sum\nolimits_{q} {r_{qm\left( q \right)}^{2} \alpha_{q}^{2} \sigma_{q}^{2} } \), and \( {\text{b}} = \frac{{\mathop \sum \nolimits_{q} r_{qm\left( q \right)}^{2} \alpha_{q}^{2} \sigma_{q}^{2} }}{{\mathop \sum \nolimits_{q} \alpha_{q}^{2} \sigma_{q}^{2} }} \), corresponding to Eq. (4) in [16].
The ratio b is the weighted mean of LD r ^{2}. Unfortunately, neither α ^{2}_{ q } nor σ ^{2}_{ q } are known. The unweighted mean \( \frac{{\mathop \sum \nolimits_{q} r_{qm\left( q \right)}^{2} }}{{n_{q} }} = \bar{r}^{2} \) may be a fruitful approximation. Following Sved [33], the expectation of \( r_{qm\left( q \right)}^{2} \) is \( \frac{1}{{1 + 4N_{e} c}} \) with c being the distance, in Morgan, between the QTL and its marker. Let L be the total length of the genome, and assume an equal distance L/n _{ M } between each successive marker \( {\text{b}}\sim \int_{0}^{{L/2n_{M} }} {\frac{1}{{1 + 4N_{e} c}}\frac{1}{{{L \mathord{\left/ {\vphantom {L {2n_{M} }}} \right. \kern0pt} {2n_{M} }}}}dc = \frac{{n_{M} }}{{2N_{e} L}}\left[ {{ \log }\left( {1 + {{2N_{e} L} \mathord{\left/ {\vphantom {{2N_{e} L} {n_{M} }}} \right. \kern0pt} {n_{M} }}} \right)} \right]} \).
The expectation of the reliability \( {\text{E}}\left[ {r_{{q_{\text{c}} ,\hat{q}_{\text{c}} }}^{2} } \right] \), which is a ratio of variances \( {\text{E}}_{{\mathbf{X}}} \left[ {{\text{v}}\left( {\hat{q}_{c} {\mathbf{X}}} \right)/{\text{v}}\left( {q_{\text{c}} {\mathbf{X}}} \right)} \right] \) was approximated by the ratio of the variance expectations \( {\text{E}}_{{\mathbf{X}}} \left[ {{\text{v}}\left( {\hat{q}_{c} {\mathbf{X}}} \right)} \right]/{\text{E}}_{{\mathbf{X}}} \left[ {{\text{v}}\left( {q_{\text{c}} {\mathbf{X}}} \right)} \right] \). The usual second degree approximation (E[N/D] = E[N]/E[D] − cov[N, D]/E^{2}[D] + v[D]E[N]/E^{3}[D]) could not be used here due to algebra complexity. However, in the case of unrelated individuals and independent markers, numerical evaluation of the difference between exact and approximated results for various reference population sizes and numbers of markers shows a very small underestimation of the reliability (Table 9).
The theory presented here was developed by considering a single selection candidate. When candidates are diversely related to the reference population, as suggested in Goddard et al. [18], the candidates should be examined one by one. Moreover, nonindependence between candidates should be considered. A further step towards the modeling of genomic selection could be an approximation of the mean genetic values of selected individuals when GEBV reliabilities are heterogeneous.
A few other hypotheses were made in this paper, including additivity and i.i.d. of QTL effects, and the use of GBLUP. As long as the objective is to model and optimize breeding plans, only relative values are interesting and we assumed that these hypotheses were not critical.
Conclusions
The objective of this paper was to provide a further step towards the development of deterministic models that describe genomic breeding plans. Such deterministic models carry low computational burden and thus allow design optimization through intensive numerical exploration.
We proposed two alternative approximations of the estimation of GEBV reliability in the case of nonindependence between candidate and reference populations. Both were derived from the Taylor series heuristic approach suggested by Goddard [16]. A numerical exploration of their properties showed that the series were not equivalent in terms of convergence to the exact reliability, that the approximations may overestimate GEBV precision and that they perfectly converged toward their theoretical expectations.
Formulae derived for these approximations were simple to handle in the case of independent markers. A few parameters that describe the markers’ genotypic variability (allele frequencies, linkage disequilibrium) can be estimated from genomic data corresponding to the population of interest or estimated after assumption about their distribution.
When markers are not in linkage equilibrium (i.e. there is LD), replacing the real number of markers and QTL by an effective or equivalent number of independent loci, as proposed by Goddard [16] and Hayes et al. [17], is a practical solution. Research efforts are still needed to overcome some strong limits of this approach.
References
Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genomewide dense marker maps. Genetics. 2011;157:1819–29.
Schaeffer LR. Strategy for applying genomewide selection in dairy cattle. J Anim Breed Genet. 2006;123:218–23.
König S, Simianer H, Willam A. Economic evaluation of genomic breeding programs. J Dairy Sci. 2009;92:382–91.
McHugh N, Meuwissen THE, Cromie AR, Sonesson AK. Use of female information in dairy cattle genomic breeding programs. J Dairy Sci. 2011;94:4109–18.
de Roos P, Schrooten WC, Veerkamp RF, van Arendonk JAM. Effects of genomic selection on genetic improvement, inbreeding, and merit of young versus proven bulls. J Dairy Sci. 2011;94:1559–67.
Pryce JE, Goddard ME, Raadsma HW, Hayes BJ. Deterministic models of breeding scheme designs that incorporate genomic selection. J Dairy Sci. 2010;93:5455–66.
Meuwissen THE, Hayes BJ, Goddard ME. Accelerating improvement of livestock with genomic selection. Annu Rev Anim Biosci. 2013;1:221–37.
Sonesson AK, Meuwissen THE. Testing strategies for genomic selection in aquaculture breeding programs. Genet Sel Evol. 2009;41:37.
IbánẽzEscriche N, Fernando RL, Toosi A, Dekkers JCM. Genomic selection of purebreds for crossbred performance. Genet Sel Evol. 2009;41:12.
Wolc A, Arango J, Settar P, Fulton JE, O’Sullivan NP, Preisinger R, et al. Persistence of accuracy of genomic estimated breeding values over generations in layer chickens. Genet Sel Evol. 2011;43:33.
Tribout T, Larzul C, Phocas F. Efficiency of genomic selection in a purebred pig male line. J Anim Sci. 2012;45:4164–76.
Shumbusho F, Raoul J, Astruc JM, Palhiere I, Elsen JM. Potential benefits of genomic selection on genetic gain of small ruminant breeding programs. J Anim Sci. 2013;91:3644–57.
Daetwyler HD, Villanueva B, Woolliams JA. Accuracy of predicting the genetic risk of disease using a genomewide approach. PLoS One. 2008;3:e3395.
Legarra A, RobertGranié C, Manfredi E, Elsen JM. Performance of genomic selection in mice. Genetics. 2008;180:611–8.
Van Raden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23.
Goddard ME. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009;136:245–57.
Hayes B, Visscher P, Goddard M. Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res. 2009;91:47–60.
Goddard ME, Hayes BJ, Meuwissen THE. Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet. 2011;128:409–21.
Buch LH, Kargo M, Berg P, Lassen J, Sørensen AC. The value of cows in reference populations for genomic selection of new functional traits. Animal. 2011;6:880–6.
Clark SA, Hickey JM, Daetwyler HD, van der Werf JHJ. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet Sel Evol. 2012;44:4.
Wientjes YCJ, Veerkamp RF, Calus MPL. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics. 2013;193:621–31.
Habier D, Fernando RL, Garrick DJ. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics. 2013;194:597–607.
Erbe M, Gredler B, Seefried FR, Bapst B, Simianer H. A function accounting for training set size and marker density to model the average accuracy of genomic prediction. PLoS One. 2013;8:e81046.
Weller JI. Economic aspects of animal breeding. London: Chapman & Hall; 1994.
Dekkers JCM. Prediction of response to markerassisted and genomic selection using selection index theory. J Anim Breed Genet. 2007;124:331–41.
Daetwyler HD, PongWong R, Villanueva B, Woolliams A. The impact of genetic architecture on genomewide evaluation methods. Genetics. 2010;185:1021–31.
Habier D, Fernando RL, Dekkers JCM. The impact of genetic relationship information on genomeassisted breeding values. Genetics. 2007;177:2389–97.
Pszczola M, Strabel T, Mulder HA, Calus MPL. Reliability of direct genomic values for animals with different relationships within and to the reference population. J Dairy Sci. 2012;95:389–400.
Wientjes YCJ, Veerkamp RF, Bijma P, Bovenhuis H, Schrooten C, Calus MPL. Empirical and deterministic accuracies of acrosspopulation genomic prediction. Genet Sel Evol. 2015;47:5.
Gianola D, De Los Campos G, Hill WG, Manfredi E, Fernando R. Additive genetic variability and the bayesian alphabet. Genetics. 2009;183:347–63.
Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, Cornes BK, et al. Assumptionfree estimation of heritability from genomewide identitybydescent sharing between full siblings. PLoS Genet. 2006;2:e41.
Sigman K. Lecture notes introduction to discretetime Markov chains. http://www.columbia.edu/~ks20/stochasticI/stochasticIMCI.pdf. 2009.
Sved JA. Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor Popul Biol. 1971;2:125–41.
Wright S. The distribution of gene frequencies in populations. Proc Natl Acad Sci USA. 1937;23:307–20.
La Gillois M. relation d’identité en génétique. Ann Inst Henri Poincaré. 1964;B2:1–94.
Harris DL. Genotypic covariances between inbred relatives. Genetics. 1964;50:1319–48.
Jacquard A. Logique du calcul des coefficients d’identité entre deux individus. Populations. 1966;21:751–76.
Acknowledgements
This work was partly done when the author was on sabbatical leave in the Animal Genetic and Breeding Unit (AGBU) in Armidale, Australia. This sabbatical was supported by a grant from AGBU and from INRA (métaprogramme Selgen). Andrew Swan, Julius van der Werf, Mike Goddard, Anne Ricard and Bruno Goffinet are thanked for their many useful comments. Rob Banks is particularly thanked for his help at many levels.
Competing interests
The author declares that he has no competing interests.
Author information
Authors and Affiliations
Corresponding author
Additional files
12711_2016_183_MOESM1_ESM.pdf
Additional file 1. Computation of \( {\text{E}}\left[ {{\text{X}}_{\text{im}}^{{{\text{d}}_{\text{i}} }} {\text{X}}_{\text{jm}}^{{{\text{d}}_{\text{j}} }} \cdots {\text{X}}_{\text{Km}}^{{{\text{d}}_{\text{K}} }} } \right] \) as a function of betweenchromosome identity coefficients, in the case of independent markers. Using an extension of the identity coefficients theory, the document shows how to compute the elements of the expectation E[XX^{′}TXX^{′}] (i.e. \( {\text{E}}\left[ {X_{i} X_{j} } \right], {\text{E}}\left[ {X_{i} X_{j}^{2} } \right], {\text{E}}\left[ {X_{i} X_{j}^{3} } \right], {\text{E}}\left[ {X_{i}^{2} X_{j}^{2} } \right], \) \( {\text{E}}\left[ {X_{i} X_{j} X_{k}^{2} } \right] \) and E[X _{ i } X _{ j } X _{ k } X _{ l }]) when markers are independent and individuals are related.
12711_2016_183_MOESM2_ESM.pdf
Additional file 2. Expectations of \( \mathop \sum \limits_{m} \sigma_{m}^{2} , \mathop \sum \limits_{m} \sigma_{m}^{4} , \mathop \sum \limits_{m} \rho_{m} \), ∑_{ m } ρ ^{2}_{ m } , ∑_{ m } ρ ^{2}_{ m } /σ ^{2}_{ m } and ∑_{ m }1/σ ^{2}_{ m } . The expectations of the listed quantities are computed assuming either a Ushaped or a uniform distribution of allele frequencies.
12711_2016_183_MOESM3_ESM.pdf
Additional file 3. Precision formulae when the candidate is related to reference individuals. The approximated formulae derived in the main text are applied to the case of a candidate for which the sire belongs to the reference population.
12711_2016_183_MOESM4_ESM.pdf
Additional file 4. Equivalent numbers of independent loci. (1) equation of the equivalent number of independent loci which gives the precision \( {\text{E}}\left[ {r_{{q_{\text{c}} ,\hat{q}_{c} }}^{2} } \right]\sim \frac{{{\text{E}}_{{\mathbf{X}}} \left[ {{\text{v}}(\hat{q}_{c} {\mathbf{X}})} \right]}}{{{\text{E}}_{{\mathbf{X}}} \left[ {{\text{v}}\left( {q_{\text{c}} {\mathbf{X}}} \right)} \right]}} \) obtained with the total number of nonindependent markers; (2) simple approximation in a very simplified situation; (3) relations between number of independent markers and size of the reference population.
12711_2016_183_MOESM5_ESM.pdf
Additional file 5. The case of markers in linkage disequilibrium. Derivation of the crossed terms expectation of genomic values (E[x _{ cl } x _{ cm } X _{ il } X _{ im }]) in the situation of LD between loci l and m. Many examples are given for diverse situations.
12711_2016_183_MOESM6_ESM.pdf
Additional file 6. Another demonstration of Goddard et al. [18] accuracy. A complete demonstration is given using the notations of the present paper.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Elsen, JM. Approximated prediction of genomic selection accuracy when reference and candidate populations are related. Genet Sel Evol 48, 18 (2016). https://doi.org/10.1186/s1271101601833
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1271101601833
Keywords
 Genomic Selection
 Candidate Population
 Genomic Estimated Breeding Values (GEBV)
 Independent Loci
 Reference Population Size