Consider (19) placing focus on the fitness function \(H\left( {\mathbf{y}}_{obs},{{\varphi ,\theta }}\right)\) defined in (20). The distribution \(\left[ {{\mathbf{y}} }_{miss}|{\mathbf{y}}_{obs},{\theta}\right]\) follows from the statistical model assumed. For example, if the joint distribution of \({\mathbf{y}}_{miss}\) and \({\mathbf{y}}_{obs}\), given \({\theta}\), is normal, the conditional distribution is normal as well, with mean vector and covariance matrix readily derived from theory. Next, one would need to assume a selection model representing the distribution of the missing data pattern \(\left( {\mathbf {r}}\right) .\) Four examples are presented below to illustrate concepts, motivated by those described in [22] and adapted to a Bayesian perspective.
Example 1: selection based on records not available for analysis
Country B buys frozen semen of m out of n bulls \(\left( m<n\right)\) in country A. The m bulls chosen exceed a minimum threshold of “performance” t based on information provided by country A. Country B develops a breeding program using such m bulls and collects records. Only records from country B are available for analysis. The performances in the two countries are regarded as distinct traits [34], a concept that has been employed in global dairy cattle breeding.
Assuming conditional (given \({\theta}\)) independence between records, the data generating model for the m records observed in country B is:
$$p\left( {\mathbf {y}}_{obs}|{\theta}\right) = { \prod \limits _{i=1}^{m}} p\left( y_{iB}|{\theta}\right) ,$$
(24)
where \(y_{iB}\) is the performance of bull i (\(i=1,2,\ldots,m\)). Following [22], the conditional probability of selection involves two binary indicator variables, \(r_{iA}\) and \(r_{iB},\) where the values 0 and 1 mean “missing” and “observed”, respectively. For \(i=1,2,\ldots,n,\) where n (number of bulls in country A),
$$\Pr \left( r_{iA}=0|{\mathbf {y}}_{obs},{\mathbf {y}}_{miss}\right) =1\text { and }\Pr \left( r_{iB}=1|{\mathbf {y}}_{obs},{\mathbf {y}}_{miss}\right) =1_{\left( t,\infty \right) }\left( y_{iA}\right) .$$
(25)
The value \(r_{iA}=0\) for all i means that all records from the exporting country (A) are not available (“missing”); \(r_{iB}=1\) means that records on bull i are observed in country B only if selection threshold t is exceeded by such bull in country A.
The density of all data (observed and missing) and of \({\mathbf {r}}\), is:
$$p\left( {\mathbf {y}}_{obs},{\mathbf {y}}_{miss},{\mathbf {r}}|{\theta}\right) = {\prod \limits _{i=1}^{m}} p\left( y_{iA},y_{iB}|{\theta}\right) 1_{\left( t,\infty \right) }\left( y_{iA}\right) {\prod \limits _{i=m+1}^{n}} p\left( y_{iA},y_{iB}|{\theta}\right) 1_{\left( -\infty ,t\right) }\left( y_{iA}\right) .$$
(26)
Missing data includes all n records from country A, and the \(n-m\) records that would have been observed in country B but were not because the corresponding bulls did not perform over threshold t in A. Integrating Eq. (26) with respect to the missing data yields:
$$\begin{aligned} & p\left( {\mathbf {y}}_{obs},{\mathbf {r}}|{\theta}\right) \\&=\left\{ { \prod \limits _{i=1}^{m}} { \int } p\left( y_{iA},y_{iB}|{\theta}\right) 1_{\left( t,\infty \right) }\left( y_{iA}\right) dy_{iA}\right\} \left\{ { \prod \limits _{i=m+1}^{n}} { \iint } p\left( y_{iA},y_{iB}|{\theta}\right) 1_{\left( -\infty ,t\right) }\left( y_{iA}\right) dy_{iA}dy_{iB}\right\} \\& = { \prod \limits _{i=1}^{m}} p\left( y_{iB}|{\theta}\right) \left[ { \prod \limits _{i=1}^{m}} \Pr \left( y_{iA}>t|y_{iB},{\theta}\right) { \prod \limits _{i=m+1}^{n}} \Pr \left( y_{iA}<t|{\theta}\right) \right] . \end{aligned}$$
(27)
Using observed data does not make use of the information on the selection process provided by the two terms between brackets.
Here, \({{\varphi}}=t\) (known) is the only parameter that governs the missing data process. The posterior density after accounting for selection is:
$$\begin{aligned}&p\left( {\theta} {|} {\mathbf{y}}_{obs},\mathbf {r}, Hyp\right) \\&\propto { \prod \limits _{i=1}^{m}} p\left( y_{iB}|{\theta}\right) p\left( {\theta} {|}Hyp\right) \left[ { \prod \limits _{i=1}^{m}} \Pr \left( y_{iA}>t|y_{iB},{\theta}\right) { \prod \limits _{i=m+1}^{n}} \Pr \left( y_{iA}<t|{\theta}\right) \right] \\&\propto p\left( {\theta} {|} {\mathbf{y}}_{obs},Hyp\right) H\left( {\mathbf {y}}_{obs},t,{\theta}\right) , \end{aligned}$$
(28)
where \(H\left( {\mathbf {y}}_{obs},t,{\theta}\right)\) is the term in brackets.
This simple selection scheme produces an analytically intractable problem. In the absence of selection, let performances in countries A and B have the bivariate distribution:
$$\left[ \begin{array}{c} y_{iA}\\ y_{iB} \end{array} \right] |\mu _{A},\mu _{B},{{\Sigma}{\sim}}N\left( \left[ \begin{array}{c} \mu _{A}\\ \mu _{B} \end{array} \right] ,\left[ \begin{array}{cc} \sigma _{A}^{2} & \rho \sigma _{A}\sigma _{B}\\ \rho \sigma _{A}\sigma _{B} & \sigma _{B}^{2} \end{array} \right] \right) ;i=1,2, \ldots n.$$
(29)
Assume such vectors are mutually independent, with \(\mu _{A}\) and \(\mu _{B}\) being country means, \(\rho\) the correlation coefficient between performances, and \(\sigma _{A}^{2}\) and \(\sigma _{B}^{2}\) the variances in countries A and B, respectively. Suppose all parameters are known, save for \(\mu _{B}\), the mean performance in the importing country. The conditional distribution \(\left[ y_{iA}|y_{iB},{\theta}\right]\) is:
$$y_{iA}|y_{iB},{{\theta}{{\sim }}}N\left( \mu _{A.B}=\mu _{A}+b\left( y_{iB}-\mu _{B}\right) ,v_{A.B}=\sigma _{A}^{2}\left( 1-\rho ^{2}\right) \right) ;$$
(30)
above, \(b_=\dfrac{\rho \sigma _{A}\sigma _{B}}{\sigma _{B}^{2}}\) is the regression of performance in A on that in B. Using (28) and assigning a flat prior to \(\mu _{B}\), the posterior density is:
$$\begin{aligned} p\left( \mu _{B}{|}{\mathbf{y}}_{obs},\mathbf {r},Hyp\right)&\propto \exp \left[ -\frac{ { \sum \limits _{i=1}^{m}} \left( y_{iB}-\mu _{B}\right) ^{2}}{2\sigma _{B}^{2}}\right] { \prod \limits _{i=1}^{m}} \left[ 1-\Phi _{i,A.B}\left( t\right) \right] { \prod \limits _{i=m+1}^{n}} \Phi _{i,A}\left( t\right) \\&\propto \exp \left[ -\dfrac{ { \sum \limits _{i=1}^{m}} \left( y_{iB}-\mu _{B}\right) ^{2}}{2\sigma _{B}^{2}}\right] { \prod \limits _{i=1}^{m}} \left[ 1-\Phi _{i,A.B}\left( t\right) \right] , \end{aligned}$$
(31)
where \(\Phi _{i,A.B}\) is the distribution function of Eq. (30), which depends on \(\mu _{B}.\) The expression involving \(\Phi _{i,A}\), the distribution function of performance in country A, is absorbed by the integration constant. If selection is ignored, the posterior distribution would be proportional to the Gaussian kernel above with mean (mode) \({\widehat{\mu }}_{B}=\dfrac{ { \sum \limits _{i=1}^{m}} y_{iB}}{m}.\) Under selection, however, the posterior mean cannot be written in closed form and locating the mode requires an iterative procedure. If \(\sigma _{A}^{2}=\sigma _{B}^{2}=1\) and \(\mu _{A}=0,\) for instance, the posterior density takes the form:
$$p\left( \mu _{B}{|}{\mathbf{y}}_{obs},\mathbf {r},Hyp\right) =\frac{\exp \left\{ -\frac{1}{2}\left[ m\left( \mu _{B}-{\widehat{\mu }}_{B}\right) ^{2}-2f\left( \mu _{B}\right) \right] \right\} }{ { \int } \exp \left\{ -\frac{1}{2}\left[ m\left( \mu _{B}-{\widehat{\mu }}_{B}\right) ^{2}-2f\left( \mu _{B}\right) \right] \right\} d\mu _{B}}$$
(32)
where
$$f\left( \mu _{B}\right) = { \sum \limits _{i=1}^{m}} \log \left[ 1-\Phi _{i}\left( \dfrac{t-\rho \left( y_{iB}-\mu _{B}\right) }{\sqrt{1-\rho ^{2}}}\right) \right] .$$
(33)
To illustrate how accounting for selection may lead to correct Bayesian inference, we simulated \(n=1000\) pairs from a bivariate standard normal distribution with \(\rho =0.8\). Selection operated on trait A by picking individuals with phenotypes that were above the mean or 1, 1.5 and 2 standard deviations over the mean. Such selection produced samples of sizes \(m=494,155,53\) and 18 on which performance for trait B was available. The only parameter treated as unknown was \(\mu _{B}\) with a flat prior attached to it. If selection were ignored, the posterior distribution would be \(\mu _{B}|y_{B}\sim N\left( {\widehat{\mu }}_{B},m^{-1}\right) ;\) if selection is accounted for, the posterior density is as in (32). On the one hand, Fig. 1 shows that ignoring selection grossly overstated the true value of the parameter: 0. On the other hand, the true value was assigned appreciable density in the “correct” posterior distributions, irrespective of the selection intensity applied. Note that the fitness function (missing data process) employed corresponds exactly to how selection was simulated. An incorrect formulation of the selection process would have probably produced distorted inferences. The example illustrates that a proper Bayesian analysis may capture true parameter values in situations of non-ignorable selection when the latter is modeled properly.
Example 2: pre-selected samples
The example is motivated by a situation in animal and plant breeding that has taken place during the genomic era and that it produces what is called “pre-selection bias”. It was studied by Patry and Ducrocq [35] in a dairy cattle setting but from a different perspective to the one employed in our paper. In New Zealand dairy cattle, Winkelman et al. [36] proposed a method of genetic evaluation that combined a Gaussian kernel that is constructed using genomic information with features of single-step BLUP methodology. The procedure did not use the notions of missing data or of fitness functions. With real records, they found that the proposed methodology delivered a smaller predictive bias and a higher predictive correlation than a previously used procedure that “blended” pedigree and genomic information. The correlation improved 1–2% for production traits, but negligibly for traits such as fertility or longevity. In a simulation study [37], produced 15 generations of selection. At that point, parents were preselected with various degrees of intensity using either parental averages, a genome-based choice, or at random. Subsequently, they estimated genomic breeding values of preselected animals with a single-step BLUP procedure that excluded all the information from pre-culled individuals. They were not able to detect bias in the evaluations of such animals. Wang et al. [38] considered the impact of genomic selection on estimates of variance components obtained from using different similarity kernels, and used simulation and real data from broilers. When genotyping was at random, estimates obtained with a single-step model did not exhibit bias in the simulated data sets; otherwise, estimates had a marked bias. The impact of such bias on estimated breeding values was noticeable. It is unclear to what extent the results from these three studies generalize to more general forms of selection, as consideration of a general prescription was not addressed. Simulations provide “local” guidance only: results may change drastically if the assumptions adopted or the structure of the data are varied. These researchers, however, seemed aligned with the view that accounting for the history of the selection process as completely as possible can attenuate the impact of selection on inference and prediction, rendering selection quasi-ignorable.
When genomic selection began to be applied [10], decisions had to be made on individuals (e.g., bulls) to be genotyped for single nucleotide polymorphisms (SNPs). Due to the high cost of SNP chips, not all candidates for selection could be genotyped; a similar situation occurs now with next-generation DNA sequences or with expensive epigenomic or metabolomic measurements. Suppose that m out of n (\(m<n\)) dairy bulls that possess pedigree-based estimates of breeding value are chosen for genotyping, with “genomic breeding values” estimated as if the m bulls were randomly sampled. A “pre-selection bias” is expected to accrue since the m bulls chosen may not be representative of the current population. Can the distortion in inference be tempered analytically?
For illustration, a wheat yield data set examined in several studies and available in the BGLR package was used [39,40,41]. The dataset spans \(n=599\) inbred lines of wheat, each genotyped for \(p=1279\) binary markers that denote presence or absence of an allele at a locus. The target phenotype was wheat grain yield in each line planted in “environment 1”. The dataset also includes a pedigree-based additive relationship matrix, \({\mathbf {A}},\) of size 599 \(\times\) 599 and several of the lines are completely inbred. A genomic relationship matrix [17] among lines was built as \({\mathbf {G}}=\mathbf {XX}^{\prime }/p\) where \({\mathbf {X}}\) was the \(599\times 1279\) matrix of marker codes (0,1) with each column centered and standardized. Using the entire dataset, i.e., without any selection practiced, genetic (pedigree or genome based) and residual variance components were estimated via maximum likelihood. The random effects models used were \({\mathbf {y}}={\mathbf {a}}+{\mathbf {e}}\) and \({\mathbf {y}}={\mathbf {g}}+{\mathbf {e}} ^{*}\) in pedigree-based and genome-enabled analyses, respectively, where \({\mathbf {a}}\) and \({\mathbf {g}}\) are pedigree and genomic breeding values to be learned, with \({\mathbf {e}}\) and \({\mathbf {e}}^{*}\) being model residuals. We posed \(\mathbf {a|}\sigma _{a}^{2}\sim N\left( {\mathbf {0}},{\mathbf {A}}\sigma _{a}^{2}\right) ,\) \(\mathbf {e|}\sigma _{e}^{2}\sim N\left( {\mathbf {0}} ,{\mathbf {I}}\sigma _{e}^{2}\right)\) as mutually independent, and likewise for \({\mathbf {g}}{{|}}\sigma _{g}^{2}\sim N\left( {\mathbf {0}},{\mathbf {G}}\sigma _{g} ^{2}\right) ,\) \({\mathbf {e}}^{*}{{|}}\sigma _{e^{*}}^{2}\sim N\left( {\mathbf {0}},{\mathbf {I}}\sigma _{e^{*}}^{2}\right) ;\) the \(\sigma ^{2\prime }s\) are variance components. Using the maximum likelihood estimates of variances as true values, best linear unbiased predictions of \({\mathbf {a}}\) and \({\mathbf {g}}\) were calculated, \(\widehat{{\mathbf {a}}}\) and \(\widehat{{\mathbf {g}} },\) respectively. Under a Bayesian framework the posterior distributions of the pedigree and genomic breeding values are \({\mathbf {a}}{|}{\mathbf{y}},\sigma _{a} ^{2},\sigma _{e}^{2}{\sim}N\left( \widehat{{\mathbf {a}}},{\mathbf {C}} _{a}\right)\) and \({\mathbf {g}}{|}{\mathbf {y}},\sigma _{g}^{2},\sigma _{e^{*}} ^{2}{\sim}N\left( \widehat{{\mathbf {g}}},\mathbf {C}_{g}\right)\), with variance components treated as known hyper-parameters. Here,
$$\widehat{{\mathbf {a}}}=\left( {\mathbf {I}}+\frac{\sigma _{e}^{2}}{\sigma _{a}^{2} }{\mathbf {A}}^{-1}\right) ^{-1}{\mathbf {y}}\text { and }\widehat{{\mathbf {g}} }=\left( {\mathbf {I}}+\frac{\sigma _{e^{*}}^{2}}{\sigma _{g}^{2}} {\mathbf {G}}^{-1}\right) ^{-1}\mathbf {y},$$
(34)
give the posterior expectations and
$${\mathbf {C}}_{a}=\left( {\mathbf {I}}+\frac{\sigma _{e}^{2}}{\sigma _{a}^{2} }{\mathbf {A}}^{-1}\right) ^{-1}\sigma _{e}^{2}\text { and }{\mathbf {C}}_{g}=\left( {\mathbf {I}}+\frac{\sigma _{e^{*}}^{2}}{\sigma _{g}^{2}}{\mathbf {G}}^{-1}\right) ^{-1}\sigma _{e^{*}}^{2},$$
(35)
are the posterior covariance matrices. In a frequentist setting, the posterior means correspond to \(BLUP\left( {\mathbf {g}}\right)\) and \(BLUP\left( {\mathbf {a}}\right) ,\) whereas \({\mathbf {C}}_{a}\) and \({\mathbf {C}}_{g}\) are interpreted as prediction error covariance matrices.
Using the 599 values in \(\widehat{{\mathbf {a}}},\) we selected lines with pedigree breeding value estimates larger than the threshold \(t=\) 0.20, 0.40 or 0.60, resulting in 231, 122 and 60 “top” lines, respectively. The posterior distributions of \({\mathbf {g}}\) were calculated before and after selection (ignoring the missing data process but using the same variance components) and compared. As depicted in Fig. 2, the analysis based on selected lines tended to overstate estimates of genomic breeding values relative to those obtained without selection. Ignoring selection introduces a selection “bias” that is impossible to evaluate because the true breeding values are unknown, except when data are simulated. Letting \(M_{a}\) and \(M_{g}\) denote pedigree and genome-based models, respectively, note that
$$\begin{aligned} E\left( \widehat{{\mathbf {a}}}|\mathbf {a},M_{a}\right)&=\left( {\mathbf {I}}+\frac{\sigma _{e}^{2}}{\sigma _{a}^{2}}{\mathbf {A}}^{-1}\right) ^{-1}E\left( {\mathbf {y}}|{\mathbf {a}}\right) =\left( {\mathbf {I}}+\frac{\sigma _{e}^{2}}{\sigma _{a}^{2}}{\mathbf {A}}^{-1}\right) ^{-1}\mathbf {a}, \end{aligned}$$
(36)
$$\begin{aligned} E\left( \widehat{{\mathbf {g}}}|\mathbf {g},M_{g}\right)&=\left( {\mathbf {I}}+\frac{\sigma _{e^{*}}^{2}}{\sigma _{g}^{2}}{\mathbf {G}}^{-1}\right) ^{-1}E\left( {\mathbf {y}}|{\mathbf {g}}\right) =\left( {\mathbf {I}}+\frac{\sigma _{e^{*}}^{2}}{\sigma _{g}^{2}}{\mathbf {G}}^{-1}\right) ^{-1} \mathbf {g}. \end{aligned}$$
(37)
Hence, both \(\widehat{{\mathbf {a}}}\) and \(\widehat{{\mathbf {g}}}\) have an “epistemic” bias [24]. Such bias differs from the notion used by Henderson [42, 43], who defined “prediction unbiasedness” as \(E\left( \widehat{{\mathbf {a}}}\right) =E\left( {\mathbf {a}}\right)\) under \(M_{a}\) or \(E\left( \widehat{{\mathbf {g}}}\right) =E\left( {\mathbf {g}}\right)\) under \(M_{g}\), i.e., posterior means (BLUP) are unbiased for the mean of the prior distributions, but not for the estimands \({\mathbf {a}}\) and \({\mathbf {g}}\). Selection introduces an additional distortion relative to “true” breeding values, pedigree or genome-defined.
How does one account for the distortion in inference? Our representation of selection will follow the protocol employed in the example. Threshold t is the only parameter governing selection here. Let “sel” and “nsel” denote selected and unselected individuals, respectively. Following (28), the posterior density of the genomic breeding values after accounting for selection and assuming that \(\left[ \widehat{{\mathbf {a}}}|\mathbf {y,g}\right] =\left[ \widehat{{\mathbf {a}}}|{\mathbf {g}}\right]\) (i.e., given \(\mathbf {g},\) \(\widehat{{\mathbf {a}}}\) is independent of \({\mathbf {y}}\)), one has:
$$\begin{aligned}&p\left( \mathbf {g|y}_{obs},\mathbf {r},Hyp\right) \\&\quad \propto { \prod \limits _{i=1}^{m}} p\left( y_{i}|g_{sel,i}\right) p\left( \mathbf {g|}Hyp\right) \left[ { \prod \limits _{i=1}^{m}} \Pr \left( {\widehat{a}}_{sel,i}>t|{\mathbf {g}}_{sel}\right) { \prod \limits _{i=m+1}^{n}} \Pr \left( {\widehat{a}}_{nsel,i}<t|{\mathbf {g}}_{nsel}\right) \right] \\&\quad \propto \left[ { \prod \limits _{i=1}^{m}} p\left( y_{i}|g_{sel,i}\right) p\left( {\mathbf {g}}_{sel}{|} Hyp\right) \right] \times \left[ { \prod \limits _{i=1}^{m}} \Pr \left( {\widehat{a}}_{sel,i}>t|g_{sel,i}\right) \right. \\&\quad \left. { \prod \limits _{i=m+1}^{n}} \Pr \left( {\widehat{a}}_{nsel,i}<t|g_{nsel,i}\right) \right] p\left( {\mathbf {g}}_{nsel}\mathbf {|g}_{sel},Hyp\right) . \end{aligned}$$
(38)
The first term in brackets is the posterior density of the genomic breeding values calculated from selected data only but ignoring selection. The joint fitness function [second term in brackets in (38)] assumes that the choice of an individual for genotyping is based only on whether or not t is exceeded, independently of what happens with other individuals, but conditionally on the unknown genomic breeding value of the individual in question. Integrating (38) with respect to \({\mathbf {g}}_{nsel}\) produces
$$\begin{aligned}&p\left( {\mathbf {g}}_{sel}{|}{\mathbf{y}}_{obs},\mathbf {r},Hyp\right) \\&\propto p\left( {\mathbf {g}}_{sel}{|}{\mathbf{y}}_{obs},Hyp\right) { \prod \limits _{i=1}^{m}} \Pr \left( {\widehat{a}}_{sel,i}>t|g_{sel,i}\right) \\&\times { \int } { \prod \limits _{i=m+1}^{n}} \Pr \left( {\widehat{a}}_{nsel,i}<t|g_{nsel,i}\right) p\left( {\mathbf {g}} _{nsel}\mathbf {|g}_{sel},Hyp\right) d{\mathbf {g}}_{nsel}. \end{aligned}$$
(39)
Since unselected individuals are not genotyped, there is no information available for writing \(p\left( {\mathbf {g}}_{nsel}\mathbf {|g}_{sel},Hyp\right) ,\) which is Gaussian with mean \({\mathbf {G}}_{nsel,sel}{\mathbf {G}}_{sel,sel} ^{-1}{\mathbf {g}}_{sel}\) and covariance matrix \({\mathbf {G}}_{nsel,nsel} -{\mathbf {G}}_{nsel,sel}{\mathbf {G}}_{sel,sel}^{-1}{\mathbf {G}}_{sel,nsel}\). Hence the integral in (39) does not convey information on \({\mathbf {g}}_{sel}\) and is treated as a constant. The preceding is an important matter and may adversely affect the ability of accounting for selection. Finally,
$$p\left( {\mathbf {g}}_{sel}{|}{\mathbf{y}}_{obs},\mathbf {r},Hyp\right) \propto \exp \left[ -\frac{1}{2\sigma _{e^{*}}^{2}}\left( {\mathbf {g}}_{sel} -\widetilde{{\mathbf {g}}}_{sel}\right) ^{\prime }\widetilde{{\mathbf {C}}} _{g,sel}^{-1}\left( {\mathbf {g}}_{sel}-\widetilde{{\mathbf {g}}}_{sel}\right) +f_{sel}\left( {\mathbf {g}}_{sel}\right) \right] ,$$
(40)
where,
$$\widetilde{{\mathbf {g}}}_{sel}=\widetilde{{\mathbf {C}}}_{g,sel}{\mathbf {y}}_{sel},$$
(41)
$$\widetilde{{\mathbf {C}}}_{g,sel}=\left[ {\mathbf {I}}_{sel,sel}+\frac{\sigma _{e^{*}}^{2}}{\sigma _{g}^{2}}{\mathbf {G}}_{sel,sel}^{-1}\right] ^{-1},$$
(42)
and
$$f_{sel}\left( {\mathbf {g}}_{sel}\right) = { \sum \limits _{i=1}^{m}} \log \left[ \Pr \left( {\widehat{a}}_{sel,i}>t|g_{sel,i}\right) \right] .$$
(43)
We note that it might be possible to approximate the distribution \(\left[ {\mathbf {g}}_{nsel}\mathbf {|g}_{sel},Hyp\right]\) by making an imputation from pedigree information, as in single-step methods [44]; however, this is a technical matter beyond the scope of our paper.
Simplifying assumptions are required in order to proceed. A canonical case is one where individuals are independently and identically distributed. Without selection, let
$$\left[ \begin{array}{c} g_{i}\\ {\widehat{a}}_{i} \end{array} \right] \sim N\left( \left[ \begin{array}{c} 0\\ 0 \end{array} \right] ,\left[ \begin{array}{cc} \sigma _{g}^{2} & \rho _{g{\widehat{a}}}\sigma _{g}\sigma _{{\widehat{a}}}\\ \rho _{g{\widehat{a}}}\sigma _{g}\sigma _{{\widehat{a}}} & \sigma _{{\widehat{a}}}^{2} \end{array} \right] \right) ;i=1,2,\ldots,n,$$
(44)
where \(\rho _{g{\widehat{a}}}\) is the expected correlation between unknown genomic breeding value and pedigree-based posterior mean (BLUP here) and the \(\sigma ^{2}\) are variance parameters. In real applications \(\rho _{g{\widehat{a}}}\) actually varies over candidates due to unequal amounts of information. In the simplest case, \(\sigma _{{\widehat{a}}}=\sqrt{Var\left( h_{a}^{2}y_{i}\right) }=h_{a}^{2}\sqrt{\sigma _{a}^{2}+\sigma _{e}^{2}}\), where \(\sigma _{a}^{2}\) and \(\sigma _{e}^{2}\) pertain to the pedigree-based model and \(h_{a}^{2}=\dfrac{\sigma _{a}^{2}}{\sigma _{a}^{2}+\sigma _{e}^{2}}\) is heritability. Then, \(E\left( {\widehat{a}}_{i}|g_{i}\right) =\rho _{g{\widehat{a}}}\dfrac{\sigma _{{\widehat{a}}}}{\sigma _{g}}g_{i},\) and \(Var\left( {\widehat{a}}_{i}|g_{i}\right) =h_{a}^{4}\left( \sigma _{a}^{2}+\sigma _{e} ^{2}\right) \left( 1-\rho _{g{\widehat{a}}}^{2}\right)\) for all i. Dispersion parameter estimates were \(\sigma _{a}^{2}=0.2859,\sigma _{e} ^{2}=0.5761,h_{a}^{2}=0.3316\) and \(\sigma _{g}^{2}=0.5315\). Since there is no information on \(\rho _{g{\widehat{a}}}\), using the unselected data we crudely estimated \(\rho _{{\widehat{g}}{\widehat{a}}}\) at 0.82 and took \(\rho _{g{\widehat{a}}}=0.75\) for the example. In order to account somehow for the fact that individuals were not identically distributed, the following modifications of the previous formulae were made using BLUP theory: \(\sigma _{{\widehat{a}}}\rightarrow \sqrt{\sigma _{{\widehat{a}}_{i}}^{2}}\); \(\sigma _{g}\rightarrow \sqrt{\sigma _{g_{i}}^{2}}\) and \(Var\left( {\widehat{a}}_{i}|g_{i}\right) =\sigma _{{\widehat{a}}_{i}}^{2}\left( 1-\dfrac{\sigma _{{\widehat{a}}_{i}}^{2}}{\sigma _{g_{i}}^{2}}\right)\) for \(i=1,2,\ldots,n\). Here, for example, \(\rho _{g_{i}{\widehat{a}}_{i}}\) is the correlation value specific to individual i and \(\sigma _{g_{i}}\) is the square root of the appropriate diagonal element of \({\mathbf {G}}\sigma _{g}^{2}\). The posterior density of genomic breeding values after selection was therefore:
$$\begin{aligned} & p \left({\mathbf {g}}_{sel} | {\mathbf{y}}_{obs},{\mathbf{r}}, {Hyp}\right) \\ & \quad \propto \exp \left\{ -\frac{{1}}{{2\sigma_{e^*}^{2}}} \left({\mathbf{g}}_{sel} - {\widetilde{\mathbf{g}}}_{sel} \right)^{\prime} {\widetilde{\mathbf{C}}}_{g,sel}^{-1} \left( {\mathbf{g}}_{sel} -{\widetilde{\mathbf{g}}}_{sel}\right) + \sum\limits_{i=1}^{m} \log [ 1-\Phi ( z_{i}) ] \right\} , \end{aligned}$$
(45)
where
$$z_{i}=\frac{t-\rho _{g_{i}{\widehat{a}}_{i}}\dfrac{\sigma _{\widehat{a_{i}}} }{\sigma _{g_{i}}}g_{i}}{\sqrt{\sigma _{{\widehat{a}}_{i}}^{2}\left( 1-\dfrac{\sigma _{{\widehat{a}}_{i}}^{2}}{\sigma _{g_{i}}^{2}}\right) }}.$$
(46)
$$f_{sel}\left( {\mathbf {g}}_{sel}\right) = { \sum \limits _{i=1}^{m}} \log \left[ 1-\Phi \left( z_{i}\right) \right] .$$
(47)
The posterior distribution cannot be recognized and Markov chain Monte Carlo sampling may be considered for inference of \({\mathbf {g}}_{sel}\). A candidate-generating distribution in a Metropolis scheme [25, 45] could be
$${\mathbf {g}}_{sel}^{*}{|}\mathbf {y},\sigma _{a}^{2},\sigma _{e^{*}} ^{2}{\sim}N\left( \widetilde{{\mathbf {a}}}_{sel}{,}\left( {\mathbf {I}}_{sel}+\frac{\sigma _{e}^{2}}{\sigma _{a}^{2}}{\mathbf {A}} _{sel,sel}^{-1}\right) ^{-1}\sigma _{e}^{2}\right) ,$$
(48)
where \(\widetilde{{\mathbf {a}}}_{sel}=\left( {\mathbf {I}}_{sel} +\frac{\sigma _{e}^{2}}{\sigma _{a}^{2}}{\mathbf {A}}_{sel,sel}^{-1}\right) ^{-1}{\mathbf {y}}_{sel}\). If \({\mathbf {g}}_{prop}^{*}\) is a draw from the proposal distribution, the probability of moving from state \({\mathbf {g}}^{now}\) to \({\mathbf {g}}_{prop}^{*}\) is \(\alpha =\min \left[ R\left( {\mathbf {g}} ^{now},{\mathbf {g}}_{prop}^{*}\right) ,1\right]\), where
$$R\left( {\mathbf {g}}^{now},{\mathbf {g}}_{prop}^{*}\right) =\exp \left[ -\frac{1}{2\sigma _{e^{*}}^{2}}Q\left( {\mathbf {g}}^{now},{\mathbf {g}} _{prop}^{*}\right) +\Delta _{sel}\left( {\mathbf {g}}^{now},{\mathbf {g}} _{prop}^{*}\right) \right] ,$$
(49)
and
$$\begin{aligned}&Q\left( {\mathbf {g}}^{now},{\mathbf {g}}_{prop}^{*}\right) \\&\quad =\left( {\mathbf {g}}_{prop}^{*}-\widetilde{{\mathbf {g}}}_{sel}\right) ^{\prime }\widetilde{{\mathbf {C}}}_{g,sel}^{-1}\left( {\mathbf {g}}_{prop}^{*}-\widetilde{{\mathbf {g}}}_{sel}\right) -\left( {\mathbf {g}}^{now} -\widetilde{{\mathbf {g}}}_{sel}\right) ^{\prime }\widetilde{{\mathbf {C}}} _{g,sel}^{-1}\left( {\mathbf {g}}^{now}-\widetilde{{\mathbf {g}}}_{sel}\right) , \end{aligned}$$
(50)
$$\Delta_{sel} \left( {\mathbf{g}}^{now},\,{\mathbf{g}}_{prop}^{*}\right) = {\sum\limits _{i=1}^{m}} \log \frac{1-\Phi ( z_{i}^{now} ) }{1-\Phi \left( z_{i} ^{prop}\right) }.$$
(51)
The next state in the chain is given by the rule (U is an uniform random deviate):
$${\mathbf {g}}^{new}=\left\{ \begin{array}{c} {\mathbf {g}}_{prop}^{*}\text { if }U\le R\left( {\mathbf {g}}^{now} ,{\mathbf {g}}_{prop}^{*}\right) \\ {\mathbf {g}}^{now}\text { otherwise} \end{array} \right. .$$
(52)
A Metropolis sampler with Eq. (48) as a proposal-generating process was used to estimate the posterior distribution having a density as given in Eq. (45) in scenarios where genotypes were available only for individuals whose \(\widehat{{\mathbf {a}}}\) values exceeded 0.2,0.4 and 0.6, producing 231, 122 and 60 selected lines, respectively. The posterior distribution of genomic breeding values of the 599 lines prior to selection was \({\mathbf {g}}{|}\mathbf {y},\sigma _{g}^{2},\sigma _{e^{*}}^{2}{\sim}N\left( \widehat{{\mathbf {g}}},\mathbf {C}_{g}\right) ,\) as presented earlier. We also computed posterior distributions of the genomic breeding values ignoring the selection process from the data in selected lines. Metropolis sampling was done by running three chains (one per selection intensity) of length 100,000 each. After diagnostic tests [45, 46], a conservative burn-in period of 20,000 samples was adopted. Subsequently, a single long-chain of 480,000 iterations was run, with 380,000 samples used for inference, per setting. Figure 3 (left panels) depicts scatter plots of posterior means of genomic breeding values in the absence of selection (GBLUP without selection, y-axis) versus either GBLUP ignoring selection or posterior means accounting for selection (x-axis). Ignoring selection tended to overstate the estimates of genomic breeding values calculated without selection and from a larger sample (\(n=599\) versus \(n=231,122,60\) in the selection schemes). Accounting for selection produced estimated genomic breeding values (posterior means denoted as PM in the plots) that were even further away from those calculated without selection. The three panels at the right of Fig. 3 show larger differences between posterior means without selection (GBLUPunsel) and with selection accounted for (PM in the y-axis) than between GBLUPunsel and GBLUPsel, i.e., with the selection process ignored. Our way of accounting for selection produced larger absolute errors (taking GBLUPunsel as reference) than when selection was ignored, i.e., GBLUPsel. Posterior densities of the genomic breeding values of lines 5, 81, 206, 309, 343 and 499 are presented in Fig. 4; the data with selection pertain to the setting were 60 lines had been selected out of the 599 candidates. Densities ignoring selection (in red) tended to match better those obtained without selection (in blue) than the densities obtained with a selection model incorporated into inference (green). However, there was much uncertainty within each of the settings, leading to overlap, although the “green” density function was centered further right along the x-axis than the “blue” or “red” densities.
The following messages can be extracted from the example. First, paradoxically, our attempt at accounting for selection seemed to distort inference on the selected lines beyond what was obtained by ignoring selection: there was a noticeable overstatement of estimated breeding values. Second, it was not easy to account for selection even when the process leading to missing data was known. For instance, part of the information on selection had to be ignored because of the inability of writing the conditional distribution of genomic breeding values of unselected individuals, given those of the selected ones. That may be what “single-step” methods (e.g., [44]) implicitly do by including ungenotyped (but pedigreed) individuals in the analysis. Third, we employed variance parameters estimated prior to selection. This action was chosen because of the impossibility of obtaining sufficiently precise estimates given the small sizes of the selected samples of lines. Finally, at least in small samples it is always difficult to disentangle the impact of non-randomness from that of noise, in view of the uncertainty remaining after analysis, irrespective of whether selection is ignored or modelled explicitly.
Example 3: multiple-trait sequential selection
A multiple-trait sequential selection scenario is described using two traits as an illustration, for simplicity. There is a set of S candidates (e.g., lines) with phenotypes \(y_{11},y_{12},\ldots,y_{1S}\) available for trait 1, where the first subscript denotes the trait. A subset \(S^{+}\) of the candidates is chosen to be measured for a second trait; the complementary subset \(S^{-}\) contains candidates that have phenotypes for trait 1 but not for trait 2. The dataset presented for analysis contains only the pairs \(\left\{ y_{1i},y_{2i}\right\}\) in \(S^{+}\); pairs in \(S^{-}\) are missing. Here, \({\mathbf{y}}_{obs}=\left( {\mathbf {y}}_{1S^{+}}^{\prime },{\mathbf {y}} _{2S^{+}}^{\prime }\right) ^{\prime }\) and \({\mathbf{y}}_{miss}=\left( {\mathbf {y}}_{1S^{-}}^{\prime },{\mathbf {y}}_{2S^{-}}^{\prime }\right) ^{\prime }\).
To define the distribution of the vector \({\mathbf {r}}\) describing the missing data pattern, an assumption about the selection process must be made. It will be assumed that a candidate is measured for trait 2 if its phenotype for trait 1 is larger than some \({{\varphi}}=y_{1,\min }\), a “minimum threshold of performance” for trait 1. Following [22], the conditional probability of selection (“fitness”) for candidate i is:
$$\Pr \left( r_{i}=1{{|}}{\mathbf{y}}_{obs}{,{\mathbf{y}} }_{miss}{,{\theta},}{{\varphi}}\right) =I_{\left( {{\varphi}} =y_{1,\min },\infty \right) }\left( y_{1i}\right) ;i=1,2,\ldots,S,$$
(53)
where \({\theta}\) are the unknown model parameters; \(I_{\left( y_{1,\min },\infty \right) }\left( y_{1i}\right) =1\) if \(y_{1i}>y_{1,\min }\) and 0 if \(y_{1i}\le y_{1,\min }\). If, given \({\theta}\), pairs \(\left\{ y_{1i},y_{2i}\right\}\) are mutually independent over individuals, the complete dataset and \({\mathbf {r}}\) have the joint density:
$$\begin{aligned}&p\left( {\mathbf{y}}_{obs},{\mathbf{y}}_{miss} ,{\mathbf {r}}|{{\varphi}}{,{\theta}}\right) \\&= { \prod \limits _{i\epsilon S^{+}}} p\left( y_{1i},y_{2i}|{\theta}\right) { \prod \limits _{i\epsilon S^{-}}} p\left( y_{1i},y_{2i}|{\theta}\right) I_{\left( -\infty ,{{\varphi}}\right) }\left( y_{1i}\right) \\&= { \prod \limits _{i\epsilon S^{+}}} p\left( y_{1i},y_{2i}|{\theta}\right) { \prod \limits _{i\epsilon S^{-}}} p\left( y_{2i}|y_{1i},{\theta}\right) p\left( y_{1i}|{\theta}\right) I_{\left( -\infty ,{{\varphi}}\right) }\left( y_{1i}\right) . \end{aligned}$$
(54)
Integrating out the missing data, i.e., \(\left\{ y_{1i},y_{2i}\right\}\) in \(S^{-}\), yields:
$$\begin{aligned}&p\left( {\mathbf{y}}_{obs},{\mathbf {r}}|{{{\varphi}},{\theta}}\right) \\&= { \prod \limits _{i\epsilon S^{+}}} p\left( y_{1i},y_{2i}|{\theta}\right) { \prod \limits _{i\epsilon S^{-}}} { \int } \left[ { \int } p\left( y_{2i}|y_{1i},{\theta}\right) dy_{2i}\right] I_{\left( -\infty ,{{\varphi}}\right) }\left( y_{1i}\right) p\left( y_{1i}|{\theta}\right) dy_{1i} \\&= { \prod \limits _{i\epsilon S^{+}}} p\left( y_{1i},y_{2i}|{\theta}\right) { \prod \limits _{i\epsilon S^{-}}} \Pr \left( y_{1i}<{{\varphi}}|{\theta}\right) . \end{aligned}$$
(55)
The posterior density is therefore:
$$\begin{aligned}&p\left( {\theta} {|} {\mathbf{y}}_{obs},\mathbf {r},Hyp\right) \\&\propto { \prod \limits _{i\epsilon S^{+}}} p\left( y_{1i},y_{2i}|{\theta}\right) p\left( {\theta}{|}Hyp\right) { \prod \limits _{i\epsilon S^{-}}} \Pr \left( y_{1i}<{{\varphi}}|{\theta}\right) \\&=\frac{p\left( {\theta} {|} {\mathbf{y}}_{obs},Hyp\right) \exp \left\{ { \sum \limits _{i\epsilon S^{-}}} \log \left[ \Pr \left( y_{1i}<{{\varphi}}|{\theta}\right) \right] \right\} }{ { \int } p\left( {\theta} {|} {\mathbf{y}}_{obs},Hyp\right) \exp \left\{ { \sum \limits _{i\epsilon S^{-}}} \log \left[ \Pr \left( y_{1i}<{{\varphi}}|{\theta}\right) \right] \right\} d{\theta}}. \end{aligned}$$
(56)
The wheat dataset was employed again to give a numerical illustration. The 599 lines have records in each of four distinct environments. To represent a scenario where selection does not occur, we fitted a four-variate linear model with the performances of the lines in different environments treated as distinct traits [34]. The model was:
$$\left( \begin{array}{c} {\mathbf {y}}_{1}\\ {\mathbf {y}}_{2}\\ {\mathbf {y}}_{3}\\ {\mathbf {y}}_{4} \end{array} \right) =\left( \begin{array}{c} {\mathbf {g}}_{1}\\ {\mathbf {g}}_{2}\\ {\mathbf {g}}_{3}\\ {\mathbf {g}}_{4} \end{array} \right) +\left( \begin{array}{c} {{\delta}}_{1}\\ {{\delta}}_{2}\\ {{\delta}}_{3}\\ {{\delta}}_{4} \end{array} \right) ,$$
(57)
where \({\mathbf {y}}_{i}\) \(\left( i=1,2,3,4\right)\) is a \(599\times 1\) vector of phenotypes for trait i, the \({\mathbf {g}}_{i}^{\prime }s\) are genomic breeding values for the trait (marked with the 1279 markers) and \({{\delta}}_{i}\) is a model trait-specific residual vector. Prior assumptions were:
$$\begin{aligned} \left( \begin{array}{c} {\mathbf {g}}_{1}\\ {\mathbf {g}}_{2}\\ {\mathbf {g}}_{3}\\ {\mathbf {g}}_{4} \end{array} \right) |{\mathbf {G}}_{0}&\sim N\left( \left( \begin{array}{c} {\mathbf {0}}\\ {\mathbf {0}}\\ {\mathbf {0}}\\ {\mathbf {0}} \end{array} \right) ,{\mathbf {G}}_{0}\otimes {\mathbf {G}}\right) , \end{aligned}$$
(58)
$$\left( \begin{array}{c} {\mathbf {r}}_{1}\\ {\mathbf {r}}_{2}\\ {\mathbf {r}}_{3}\\ {\mathbf {r}}_{4} \end{array} \right) |{\mathbf {R}}_{0} \sim N\left( \left( \begin{array}{c} {\mathbf {0}}\\ {\mathbf {0}}\\ {\mathbf {0}}\\ {\mathbf {0}} \end{array} \right),{\mathbf {R}}_{0}\otimes {\mathbf {I}}_{599}\right),$$
(59)
where \({\mathbf {G}}\) is the genomic relationship matrix among the 599 lines; \({\mathbf {G}}_{0}\) is a \(4\times 4\) between-trait genomic covariance matrix and \({\mathbf {R}}_{0}\) is a \(4\times 4\) residual covariance matrix; residuals were independent between individuals, but a full covariance structure within individuals was posed. The four traits correspond to different locations so environmental correlations are expected to be null. However, we specified an unstructured \({\mathbf {R}}_{0}\) to account for correlations that are potentially created by non-additive genetic effects not accounted for in our additive genetic model, but known to exist for wheat yield. The two covariance matrices were estimated using a crude maximum likelihood procedure with ad-hoc adjustments to ensure positive-definiteness (a less crude but more involved algorithm would have given estimates that would not require any such adjustment). The estimates were subsequently taken as known and treated as hyper-parameters. The matrices (rounded values) were:
$${\mathbf {G}}_{0}=\left[ \begin{array}{cccc} 0.831 & -0.319 & -0.247 & -0.350\\ -0.319 & 0.750 & -0.195 & -0.213\\ -0.247 & -0.195 & 0.757 & -0.176\\ -0.350 & -0.213 & -0.176 & 0.752 \end{array} \right] ,$$
(60)
and
$${\mathbf {R}}_{0}=\left[ \begin{array}{cccc} 0.830 & -0.225 & -0.289 & -0.202\\ -0.225 & 0.872 & -0.330 & -0.317\\ -0.289 & -0.3330 & 0.918 & -0.352\\ -0.202 & -0.317 & -0.352 & 0.895 \end{array} \right] .$$
(61)
The matrix of phenotypic correlations between traits \(\left( {\mathbf{{\Psi }}}\right)\), \({\mathbf {V}}_{0}={\mathbf {G}}_{0}+{\mathbf {R}}_{0},\) was:
$${\mathbf{{\Psi }}}=\left[ \begin{array}{cccc} 1 & -0.329 & -0.321 & -0.334\\ -0.329 & 1 & -0.3162 & -0.322\\ -0.321 & -0.3162 & 1 & -0.318\\ -0.334 & -0.322 & -0.318 & 1 \end{array} \right] .$$
The adjustment for positive-definiteness produced estimates of phenotypic correlations (negative and similar between pairs of traits). Traits turned out to be negatively correlated at genetic, residual and phenotypic levels. The correlation values should not be interpreted inferentially as the adjustment was done solely to facilitate calculation and for illustrative purposes.
Prior to selection, each line had a grain yield in each environment. The data available for analysis post-selection (phenotypes centered and standardized) included the lines exceeding the performance thresholds \(t_{1}=0\), \(t_{2}=0.6\) and \(t_{3}=1.3\) units above the mean in environments 1, 2 and 3, respectively, so only lines with performance above the minimum values for the three traits had records in environment 4. Only 10 such lines met the “culling levels”, so phenotypes presented to the hypothetical analyst consisted of a \(10\times 4\) matrix, lines in rows and traits in columns. Ignoring selection, the posterior distribution (given \({\mathbf {G}}_{0}\) and \({\mathbf {R}}_{0}\)) of the genomic breeding values using the selected dataset is multivariate normal, with the mean vector:
$$\widetilde{{\mathbf {g}}}_{sel}=\left( \begin{array}{c} \widetilde{{\mathbf {g}}}_{1,sel}\\ \widetilde{{\mathbf {g}}}_{2,sel}\\ \widetilde{{\mathbf {g}}}_{3,sel}\\ \widetilde{{\mathbf {g}}}_{4,sel} \end{array} \right) =\left( {\mathbf {G}}_{0}\otimes {\mathbf {G}}_{sel,sel}\right) \left( {\mathbf {G}}_{0}\otimes {\mathbf {G}}_{sel,sel}+{\mathbf {R}}_{0}\otimes {\mathbf {I}} _{10}\right) ^{-1}\left( \begin{array}{c} {\mathbf {y}}_{1,sel}\\ {\mathbf {y}}_{2,sel}\\ {\mathbf {y}}_{3,sel}\\ {\mathbf {y}}_{4,sel} \end{array} \right) ,$$
(62)
and covariance matrix
$${\mathbf {C}}_{g,sel}=\left( {\mathbf {G}}_{0}\otimes {\mathbf {G}}_{sel,sel}\right) \left[ {\mathbf {I}}-\left( {\mathbf {G}}_{0}\otimes {\mathbf {G}}_{sel,sel} +{\mathbf {R}}_{0}\otimes {\mathbf {I}}_{10}\right) ^{-1}\left( {\mathbf {G}} _{0}\otimes {\mathbf {G}}_{sel,sel}\right) \right] .$$
(63)
Above, \(\widetilde{{\mathbf {g}}}_{i,sel}\) is a \(10\times 1\) vector of posterior means of genomic breeding values in the ten selected lines for trait i; \({\mathbf {G}}_{sel,sel}\) is the \(10\times 10\) genomic relationship matrix between selected lines, and \({\mathbf {y}}_{i,sel}\) is their vector of phenotypes.
Under the multivariate selection scheme adopted, Eq. (56) is expressible as:
$$\begin{aligned} & p\left({\theta {|}} {\mathbf {y}}_{obs},\mathbf {r}, Hyp\right) \\ &\propto { \prod \limits _{i\epsilon S^{+}}} p\left( y_{1i},y_{2i},y_{3i},y_{4i}|{\theta}\right) p\left({\theta {|}}Hyp\right) { \prod \limits _{i\epsilon S^{-}}} \Pr \left( y_{1i}<t_{1},y_{2i}<t_{2},y_{3i}<t_{3}|{\theta}\right) \\& \propto { \prod \limits _{i\epsilon S^{+}}} p\left( y_{1i},y_{2i},y_{3i},y_{4i}|{\theta}_{sel}\right) p\left( {\theta}_{sel}{|}Hyp\right) \\ & \times { \prod \limits _{i\epsilon S^{-}}} \Pr \left( y_{1i}<t_{1},y_{2i}<t_{2},y_{3i}<t_{3}|{\theta} _{nsel}\right) p\left({\theta}_{nsel}|{\theta}_{sel} ,Hyp\right) \\ & \propto p\left( {\theta}_{sel}{|}{\mathbf {y}}_{obs},Hyp\right) { \prod \limits_{i\epsilon S^{-}}} \Pr \left( y_{1i}<t_{1},y_{2i}<t_{2},y_{3i}<t_{3}|{\theta} _{nsel}\right) p\left({\theta}_{nsel}|{\theta}_{sel} ,Hyp\right) . \end{aligned}$$
(64)
Above,
$$p\left( {\theta}_{sel}{|}{\mathbf{y}}_{obs},Hyp\right) \propto \exp \left[ -\frac{1}{2}\left( {\mathbf {g}}_{sel}-\widetilde{{\mathbf {g}}}_{sel}\right) ^{\prime }{\mathbf {C}}_{g,sel}^{-1}\left( {\mathbf {g}}_{sel}-\widetilde{{\mathbf {g}} }_{sel}\right) \right] ,$$
(65)
$$\begin{aligned}&\Pr \left( y_{1i}<t_{1},y_{2i}<t_{2},y_{3i}<t_{3}|{\theta} _{nsel}\right) \\&= { \int \limits _{-\infty }^{t_{1}}} { \int \limits _{-\infty }^{t_{2}}} { \int \limits _{-\infty }^{t_{3}}} \phi \left( {\mathbf {y}}_{i,nsel}|{\mathbf {g}}_{i,nsel},\left[ \begin{array}{ccc} R_{11} & R_{12} & R_{13}\\ R_{21} & R_{22} & R_{23}\\ R_{31} & R_{32} & R_{33} \end{array} \right] ,t_{1},t_{2},t_{3}\right) d{\mathbf {y}}_{i,nsel} \\&=\Phi \left( {\mathbf {y}}_{i,nsel}|{\mathbf {g}}_{i,nsel},{\mathbf {R}}_{0} ,t_{1},t_{2},t_{3}\right) ;i\epsilon S^{-}, \end{aligned}$$
(66)
where \(\phi \left( .\right)\) is the density of a trivariate normal distribution with the mean vector \({\mathbf {g}}_{i,nsel}=\left( g_{1i,nsel},g_{2i,nsel},g_{2i,nsel}\right) ^{\prime }\), \(R_{ij}\) is an appropriate element of \({\mathbf {R}}_{0},\) and \(\Phi \left( .\right)\) is a trivariate normal distribution function. Further, \(p\left( {\theta}_{nsel}|{\theta}_{sel},Hyp\right)\) is the density of a multivariate normal distribution with mean vector
$${\mathbf {m}}_{n|s}=\left( {\mathbf {G}}_{0}\otimes {\mathbf {G}}_{nsel,sel}\right) \left( {\mathbf {G}}_{0}^{-1}\otimes {\mathbf {G}}_{sel,sel}^{-1}\right) {\mathbf {g}}_{sel}=\left( {\mathbf {I}}_{4}\otimes {\mathbf {G}}_{nsel,sel} {\mathbf {G}}_{sel,sel}^{-1}\right) {\mathbf {g}}_{sel},$$
(67)
and covariance matrix
$${\mathbf {V}}_{n|s}={\mathbf {G}}_{0}\otimes {\mathbf {G}}_{nsel,nsel}-{\mathbf {G}} _{0}\otimes {\mathbf {G}}_{nsel,sel}{\mathbf {G}}_{sel,sel}^{-1}{\mathbf {G}}_{sel,nsel}$$
(68)
Collecting terms,
$$p\left( {\theta} {|} {\mathbf{y}}_{obs},\mathbf {r},Hyp\right) \propto \exp \left[ -\frac{1}{2}\left( Q_{sel}+Q_{nsel}\right) +\Delta \left( {\mathbf {y}} _{nsel},{\mathbf {g}}_{nsel}\right) \right]$$
(69)
where
$$\begin{aligned} Q_{sel}= & \left( {\mathbf {g}}_{sel}-\widetilde{{\mathbf {g}}}_{sel}\right) ^{\prime }{\mathbf {C}}_{g,sel}^{-1}\left( {\mathbf {g}}_{sel}-\widetilde{{\mathbf {g}} }_{sel}\right) , \end{aligned}$$
(70)
$$\begin{aligned} Q_{nsel}= & \left( {\mathbf {g}}_{nsel}-{\mathbf {m}}_{n|s}\right) ^{\prime }{\mathbf {V}}_{n|s}^{-1}\left( {\mathbf {g}}_{nsel}-{\mathbf {m}}_{n|s}\right) , \end{aligned}$$
(71)
$$\begin{aligned} \Delta \left( {\mathbf {y}}_{nsel},{\mathbf {g}}_{nsel}\right)= & { \sum \limits _{i\epsilon S^{-}}} \log \left[ \Phi \left( {\mathbf {y}}_{i,nsel}|{\mathbf {g}}_{i,nsel},{\mathbf {R}} _{0},t_{1},t_{2},t_{3}\right) \right] . \end{aligned}$$
(72)
To estimate the posterior distribution, a Metropolis algorithm was tailored as in Example 2. Here, the proposal distribution (with dimension 599) was a multivariate normal distribution with mean vector equal to:
$$\left[ \begin{array}{c} {{{\mu }}}_{1}\\ {{{\mu }}}_{2}\\ {{{\mu }}}_{3}\\ {{{\mu }}}_{4} \end{array} \right] =\left[ \begin{array}{c} G_{11}{\mathbf {A}}\left( G_{11}\mathbf {A+}2R_{11}{\mathbf {I}}_{599}\right) ^{-1}{\mathbf {y}}_{1}\\ G_{22}{\mathbf {A}}\left( G_{22}\mathbf {A+}2R_{22}{\mathbf {I}}_{599}\right) ^{-1}{\mathbf {y}}_{2}\\ G_{33}{\mathbf {A}}\left( G_{33}\mathbf {A+}2R_{33}{\mathbf {I}}_{599}\right) ^{-1}{\mathbf {y}}_{3}\\ G_{44}{\mathbf {A}}\left( G_{44}\mathbf {A+}2R_{44}{\mathbf {I}}_{599}\right) ^{-1}{\mathbf {y}}_{4} \end{array} \right] {,}$$
(73)
and block-diagonal covariance matrix
$${\mathbf {D}}={\oplus}_{i=1}^{4}G_{ii}{\mathbf {A}}\left[ {\mathbf {I}}_{599} - \left( G_{ii}\mathbf {A+}2R_{ii}{\mathbf {I}}_{599}\right) ^{-1} G_{ii}{\mathbf {A}}\right] .$$
(74)
The proposal distribution used an overdispersed covariance matrix. We ran four separate chains: two had 2500 iterations each, and the third and fourth ones had 5000 and 40,000 iterations, respectively. The R metric [46] indicated that convergence had been attained at about iteration 1500. Conservatively, the last 500 iterations of chains 1 and 2 were saved for inference; likewise, 3000 and 38,000 iterations were saved from chains 3 and 4, respectively. Posterior distributions of the genomic breeding values of the 10 selected lines were estimated from the 42,000 samples available for inference. Figure 5 displays posterior densities of 6 out of the 10 selected lines. Ideally, we would expect the blue and green curves to separate from the red curve (ignoring selection). Because selection was so severe (10 out of an initial 599), the three Bayesian analyses displayed great uncertainty, especially the one where the selection process was accounted for. This example suggests that, even when the selection or dropout process is known, accounting for it may not be a fruitful process.
Example 4: nor-optimal selection is not ignorable
Often, selection is of a stabilizing form and aimed at moving the population towards some optimum value. A model for such selection was introduced by [26] and used later by [29, 47]. Let \({\mathbf {y}} =\left[ \begin{array}{cc} {\mathbf {y}}_{1}^{\prime },&{\mathbf {y}}_{2}^{\prime } \end{array} \right] ^{\prime }\)be a vector of random variables; some of its components may be unobservable. Without selection, assume:
$$\left[ \begin{array}{c} {\mathbf {y}}_{1}\\ {\mathbf {y}}_{2} \end{array} \right] \sim N \left({\mathbf {m}}=\left[ \begin{array}{c} {\mathbf {m}}_{1}\\ {\mathbf {m}}_{2} \end{array} \right] ,{\mathbf {V}}=\left[ \begin{array}{cc} {\mathbf {V}}_{11} & {\mathbf {V}}_{12}\\ {\mathbf {V}}_{21} & {\mathbf {V}}_{22} \end{array} \right]\right),$$
(75)
where the \({\mathbf {m}}^{\prime }\)s are mean vectors and the \({\mathbf {V}}^{\prime }\)s are covariance matrices. Selection operates on \({\mathbf {y}}_{1}\) through the Gaussian fitness function
$$H({\mathbf {y}}_{1})=\exp \left[-\frac{1}{2}({\mathbf {y}}_{1}-{\lambda})^{\prime }{\boldsymbol{\Gamma}}^{-1}({\mathbf {y}}_{1}-{\lambda})\right],$$
(76)
where \({\lambda}\) is a vector-valued “optimum” and the positive-definite matrix \(\mathbf{\Gamma}\) describes the sharpness of multivariate selection. The fitness function is symmetric about \({\lambda}\) and has a maximum value of 1 when \({\mathbf {y}}_{1}={\lambda}\); values of \({\mathbf {y}}_{1}\) “far away” from \({\lambda}\) confer lower fitness. In a single-variable situation, fitness takes the form \(H\left( y\right) =\exp [-\dfrac{\left( y-\lambda \right) ^{2}}{2\gamma }]\). A smaller \(\gamma\) implies a sharper decay in fitness when y deviates from \(\lambda\); larger values denote a more gentle selection, with no selection at all if \(\gamma =\infty\). Write
$$H({\mathbf {y}}_{1})=\exp \left[-\frac{1}{2}({\mathbf {y}}-{\lambda}_{0})^{\prime }{\boldsymbol{\Gamma}}_{0}^{-}({\mathbf {y}}-{\lambda}_{0})\right],$$
(77)
where \({\lambda}_{0}^{\prime }=\left[ \begin{array}{cc} {\lambda}^{\prime }&{\mathbf {0}} \end{array} \right]\); \({\mathbf {0}}\) is a null vector with order equal to that of \({\mathbf {y}}_{2}\) and \({\boldsymbol{\Gamma}}_{0}^{-}=\left[ \begin{array}{cc} {\boldsymbol{\Gamma}}^{-1} & {\mathbf {0}}\\ {\mathbf {0}} & {\mathbf {0}} \end{array} \right]\). The density of \({\mathbf {y}}\) after selection acting on \({\mathbf {y}}_{1}\) (ignoring dependence on parameters in the notation) is:
$$p_{s}({\mathbf {y}}_{1},{\mathbf {y}}_{2})=\frac{H({\mathbf {y}}_{1})p({\mathbf {y}} _{1},{\mathbf {y}}_{2})}{\iint H({\mathbf {y}}_{1})p({\mathbf {y}}_{1},{\mathbf {y}} _{2})d{\mathbf {y}}_{1}d{\mathbf {y}}_{2}}=\frac{H({\mathbf {y}}_{1})}{\overline{H} }p({\mathbf {y}}_{1},{\mathbf {y}}_{2});$$
(78)
equivalently, the density in “survivors” is:
$$p_{s}({\mathbf {y}}_{1},{\mathbf {y}}_{2})\propto H({\mathbf {y}}_{1})p({\mathbf {y}} _{1},{\mathbf {y}}_{2}).$$
(79)
Using Eq. (77), the post-selection density becomes:
$$p_{s}({\mathbf {y}}_{1},{\mathbf {y}}_{2})\propto \exp \left\{-\frac{1}{2} \left[{({\mathbf {y}}-{\lambda}}_{0})^{\prime}{\boldsymbol{\Gamma}}_{0}^{-} ({\mathbf {y}}-{\lambda}_{0})+({\mathbf {y}}-{\mathbf {m}})^{\prime }{\mathbf {V}} ^{-1}({\mathbf {y}}-{\mathbf {m}})\right]\right\}.$$
(80)
Combining the two quadratic forms on \({\mathbf {y}}\) in Eq. (80) yields
$$\begin{aligned}&\left( {\mathbf {y}}-{\lambda}_{0}\right) ^{\prime}{\boldsymbol{\Gamma}}_{0}^{-}\left( {\mathbf {y}}-{\lambda}_{0}\right) +\left( {\mathbf {y}}-{\mathbf {m}}\right) ^{\prime }{\mathbf {V}}^{-1}\left( {\mathbf {y}}-{\mathbf {m}}\right) \\&\quad = \left( {\mathbf {y}}-{\mathbf {m}}_{s}\right) ^{\prime }\left({\boldsymbol{\Gamma}}_{0} ^{-}+{\mathbf {V}}^{-1}\right) \left( {\mathbf {y}}-{\mathbf {m}}_{s}\right) {+({\lambda}}_{0}-\mathbf {m)}^{\prime}{\boldsymbol{\Gamma}}_{0}^{-}\left({\boldsymbol{\Gamma}}_{0}^{-}+{\mathbf {V}}^{-1}\right) ^{-1}{\mathbf {V}}^{-1} {({\lambda}}_{0}-\mathbf {m)} \end{aligned}$$
(81)
Above,
$${\mathbf {m}}_{s}=\left({\boldsymbol{\Gamma}}_{0}^{-}+{\mathbf {V}}^{-1}\right) ^{-1}({\boldsymbol{\Gamma}}_{0}^{-}{\lambda}_{0}+{\mathbf {V}}^{-1}{\mathbf {m}})$$
(82)
Since the second component of Eq. (81) does not involve \({\mathbf {y}}\), (80) can be written as:
$$p_{s}({\mathbf {y}}_{1},{\mathbf {y}}_{2})\propto exp\left\{-\frac{1}{2}\left( {\mathbf {y}}-{\mathbf {m}}_{S}\right) ^{\prime }\left({\boldsymbol{\Gamma}}_{0}^{-} +{\mathbf {V}}^{-1}\right) \left( {\mathbf {y}}-{\mathbf {m}}_{S}\right) \right\}.$$
(83)
Hence, the joint distribution of \({\mathbf {y}}\) remains normal in survivors to selection but with different parameters. Therefore, the post-selection distribution is:
$$[{\mathbf {y}}]_{s}\sim N[{\mathbf {m}}_{s},{\mathbf {V}}_{s}=({\boldsymbol{\Gamma}}_{0}^{-}+{\mathbf {V}}^{-1})^{-1}].$$
(84)
Note that
$${\mathbf {V}}_{s}=({\boldsymbol{\Gamma}}_{0}^{-}+{\mathbf {V}}^{-1})^{-1}={\mathbf {V}} ({\mathbf {I}}+{\boldsymbol{\Gamma}}_{0}^{-}{\mathbf {V}})^{-1}$$
(85)
where \(({\mathbf {I}}+{\boldsymbol{\Gamma}}_{0}^{-}{\mathbf {V}})^{-1}\) is related to the “coefficient of centripetal selection” S [29]. In a scalar situation \(({\mathbf {I}}+{\boldsymbol{\Gamma}}_{0} ^{-}{\mathbf {V}})^{-1}=\dfrac{\gamma }{V+\gamma }=1-S,\) gives the fraction of variance \(\left( V\right)\) remaining after selection and \(S=\dfrac{V}{V+\gamma }\) measures the fraction removed by nor-optimal selection.
We use a canonical setting to show that selection is not ignorable, i.e., selection must be modelled for appropriate inference. Assume that the mean and the additive genetic and environmental variance components prior to selection are known. The setting is \(y=u+e,\) where \(u\sim N(0,h^{2})\) and \(e\sim N(0,1-h^{2})\) are independently distributed standardized breeding values and environmental effects, respectively, and \(h^{2}\) is trait heritability. The marginal distribution of phenotypes is \(y\sim N(0,1).\) Without selection,
$$\left[ \begin{array}{c} y\\ u \end{array} \right] \sim N\left( \left[ \begin{array}{c} 0\\ 0 \end{array} \right] ,\left[ \begin{array}{cc} 1 & h^{2}\\ h^{2} & h^{2} \end{array} \right] \right),$$
(86)
\(E(u|y)=h^{2}y\), and \(Var(u|y)=h^{2}(1-h^{2}).\) After a round of phenotypic selection towards \(\lambda\) with sharpness \(\gamma\), the joint distribution of breeding values and phenotypes remains bivariate normal. Post-selection,
$$E_{s}\left[ \begin{array}{c} y\\ u \end{array} \right] =\left[ \begin{array}{c} \lambda S\\ h^{2}\lambda S \end{array} \right] ,$$
(87)
and
$$Var_{s}\left[ \begin{array}{c} y\\ u \end{array} \right] =\left[ \begin{array}{cc} 1-S & h^{2}\left( 1-S\right) \\ h^{2}\left( 1-S\right) & h^{2}(1-h^{2}S) \end{array} \right] ,$$
(88)
respectively, where \(S=\dfrac{1}{1+\gamma }\). The additive variance is a fraction \(1-h^{2}S\) of the genetic variation prior to selection. Post-selection, the best predictor [42, 48] of u is:
$$E_{s}\left( u|y\right) =h^{2}\lambda S+\frac{h^{2}\left( 1-S\right) }{1-S}\left( y-\lambda S\right) =h^{2}y,$$
(89)
and
$$Var_{s}\left( u|y\right) =h^{2}(1-h^{2}S)-\frac{h^{4}\left( 1-S\right) ^{2}}{1-S}=h^{2}\left( 1-h^{2}\right) .$$
(90)
The parameters of the conditional distribution \(\left[ u|y\right]\) are as prior to selection, so the latter is ignorable from the point of view of learning u from y.
Suppose now that the model is \(y_{i}=\mu +u_{i}+e_{i}\) with \(u\sim N(0,h^{2})\) and \(e\sim N(0,1-h^{2})\) as before; \(h^{2}\) is known but \(\mu\) is unknown. Given a sample of size n, without selection, the posterior distribution of \(\mu\) after assigning a flat prior to the latter parameter is \(\mu |{\mathbf {y}}{\sim}N\left( \overline{y},n^{-1}\right)\) where \(\overline{y}=\dfrac{{\sum \nolimits _{i=1}^{n}} y_{i}}{n}\) is the maximum likelihood estimator of \(\mu\) [25]. The analyst, however, is presented with a sample of m individuals known to belong to a population subjected to stabilizing selection towards \(\lambda\) with coefficient of selection S. After selection, the marginal distribution of the phenotypes is:
$$N\left( \mu _{s}=\mu \left( 1-S\right) +\lambda S,V_{s}=\left( 1-S\right) \right) .$$
(91)
If a flat prior is assigned to \(\mu _{s},\) then \(\mu _{s}|{\mathbf {y}}{\sim}N\left( \dfrac{{ \sum \nolimits _{i=1}^{m}} y_{i}}{m},(1-S)m^{-1}\right)\). What is the posterior distribution of \(\mu\)? Changing variables \(\mu _{s}\longrightarrow \mu\),
$$\begin{aligned} p_{s}\left( \mu |{\mathbf {y}}\right)&\propto \left( 1-S\right) p_{s}\left( \mu _{s}|{\mathbf {y}}\right) \\&\propto \exp \left\{ -\frac{m}{2(1-S)}\left[ \mu \left( 1-S\right) +\lambda S-\dfrac{ { \sum \limits _{i=1}^{m}} y_{i}}{m}\right] ^{2}\right\} \\&\propto \exp \left\{ -\frac{m(1-S)}{2}\left[ \mu -\frac{\overline{y}-\lambda S}{1-S}\right] ^{2}\right\} . \end{aligned}$$
The preceding implies that the posterior distribution of \(\mu\) after selection changes to \(\left[ \mu |{\mathbf {y}}\right] _{s}\sim N\left[ \dfrac{\overline{y}-\lambda S}{1-S},\dfrac{1}{n(1-S)}\right]\). In short, the missing data process must be considered from the point of view of inferring the mean of the base population, but can be ignored for learning the additive genetic value u.