Dataset
We used phenotypic and genotypic data from three PB lines and their three-way terminal CB, provided by Topigs Norsvin and Norsvin. Sires from a Landrace line (LR) were mated to dams from a Large White line (LW) to produce F1 sows, which were crossed with sires from a synthetic boar line (S) to produce the three-way CB (i.e. Sx(LRxLW)). In total, we used 17,100 animals from line S, 6611 animals from line LR, 8587 animals from line LW, and 4173 three-way CB. The synthetic boar line S is a combination of the LW and Piétrain breeds and was created around 1975 [10]. All purebreds were housed in a nucleus environment, and the crossbreds in a commercial environment, possibly giving rise to GxE interaction between PB and CB performances. Phenotypes were pre-corrected for fixed effects and common litter effects using a larger dataset during the routine genetic evaluation of Topigs Norsvin. The traits included average daily gain during the test period between 30 and 120 kg (TGR), lifetime daily gain (LGR, from birth until 120 kg), daily feed intake during the test period (DFI), backfat (BFE) and loin depth (LDE). In the PB animals, BFE and LDE were measured at the end of the test period using an ultrasound device. In the CB animals, BFE and LDE were measured on the carcass after slaughter at the slaughterhouse.
About 80% of the animals were genotyped with the Illumina 50K single nucleotide polymorphism (SNP) chip and the remaining 20% were genotyped with a custom Illumina 25K SNP chip. All genotypes were imputed to 50K within each purebred line using the FImpute v2.2 software [11]. First, the genotypes were imputed within the purebred lines. Second, the imputed genotypes from all purebred lines were used as a reference population to impute the genotypes of crossbred animals. For quality control, we excluded SNPs with a GenCall rate lower than 0.15 (Illumina Inc., 2005) or a call rate lower than 0.95, SNPs that showed a strong deviation from Hardy–Weinberg equilibrium (χ2 > 600), SNPs that were located on sex chromosomes, and unmapped SNPs. The positions of the SNPs were based on the Sscrofa11.1 assembly. Finally, SNPs with a minor allele frequency (MAF) lower than 0.01 in any of the lines were excluded, yielding 35,595 SNPs that were used in all the analyses. All genotyped animals had a frequency of missing genotypes lower than 0.05 and were therefore retained for further analyses.
Derived expressions for the bounds of \({r}_{pc}\)
In a previous study, Duenk et al. [9] derived expressions that predict \({r}_{pc}\) from the genetic variance components of the purebred parental lines. They focused on predicting the decrease of \({r}_{pc}\) due to non-additive effects in a two-way cross, considering two scenarios (genetic models). The first genetic model assumed that there were only additive and dominance genetic effects, and the second genetic model assumed that there were only additive and additive-by-additive (AxA) epistatic genetic effects. The first step was to express the average effects for CB performance (\({\alpha }_{CB}\)) in the parental line of interest (i.e. the focal line) in terms of average effects for PB performance in the parental lines, because the difference between these average effects determines the value of \({r}_{pc}\) [2, 9]. The results showed that for the dominance model, \({\alpha }_{CB}\) in the focal line is equal to the average effect for PB performance in the mated line. As a result, the dominance model may represent a lower bound for \({r}_{pc}\), because the maximum difference between the average effects for PB and CB performances is then bounded by the difference in allele frequencies between the parental lines. For the AxA epistatic model, \({\alpha }_{CB}\) in the focal line is equal to the mean of the average effects for PB performance in both parental lines. As a result, the epistatic model may represent an upper bound of \({r}_{pc}\), because the minimum difference between the average effects for PB and CB performances is bounded by the difference in allele frequencies between the parental line and the cross. Duenk et al. [9] showed that the derived expressions indeed indicate approximate lower and upper bounds of \({r}_{pc}\), regardless of the actual genetic model.
Predicting bounds of \({r}_{pc}\)
The present study builds on the work of Duenk et al. [9] and aims at validating the derived equations for the bounds of \({r}_{pc}\) in real data by evaluating whether the estimated \({r}_{pc}\) falls between the predicted bounds. We predicted the bounds of \({r}_{pc}\) for each parental line based on information from all parental lines, following the expressions for the lower and upper bounds of \({r}_{pc}\) in a three-way cross derived by Duenk et al. [9]. These equations were derived by expressing the additive genetic values of the different groups of animals as products of genotype matrices with vectors of average effects in the parental lines, to allow re-writing the variance components that appear in Eq. (1) in terms of variance components in the PB lines. Full derivations are presented in the appendix of Duenk et al. [9]. The predicted lower bound of \({r}_{pc}\) in sire line S for the three-way CB is:
$${r}_{pc,S}^{L}=\frac{{\sigma }_{S(S),S\left(LW\right)}+{\sigma }_{S(S),S\left(LR\right)}}{{\sigma }_{S}\sqrt{\left({\sigma }_{S\left(LW\right)}^{2}+{\sigma }_{S\left(LR\right)}^{2}+2{\sigma }_{S\left(LW\right),S\left(LR\right)}\right)}},$$
(3)
where \({\sigma }_{S}\) is the standard deviation of the additive genetic values of individuals in line S for the trait expressed in line S, \({\sigma }_{S\left(LW\right)}^{2}\) and \({\sigma }_{S\left(LR\right)}^{2}\) are the variance of the additive genetic values of individuals in line S for the trait expressed in line LW or LR, respectively, and \({\sigma }_{S\left(LW\right),S\left(LR\right)}\) is the covariance between the additive genetic values of individuals in line S for the trait expressed in line LW and LR. For across-line variance components such as \({\sigma }_{S\left(LW\right)}^{2}\), we use the term focal line to refer to the line for which \({r}_{pc}\) is predicted (S), and the term trait line to refer to the line where the trait is expressed (LW). The predicted upper bound of \({r}_{pc}\) in sire line S uses the same parameters and is:
$${r}_{pc,S}^{U}=\frac{{\sigma }_{S}^{2}+0.5{\sigma }_{S(S),S\left(LW\right)}+0.5{\sigma }_{S(S),S\left(LR\right)}}{{\sigma }_{S}\sqrt{\left({\sigma }_{S}^{2}+0.25{\sigma }_{S(LW)}^{2}+0.25{\sigma }_{S(LR)}^{2}+{\sigma }_{S(S),S\left(LW\right)}+{\sigma }_{S(S),S\left(LR\right)}+0.5{\sigma }_{S(LW),S\left(LR\right)}\right)}},$$
(4)
where the parameters are the same as those in Eq. (3).The bounds of \({r}_{pc}\) in lines LW and LR were predicted in a similar way, but using expressions for dam lines of a three-way CB. The predicted lower bound of \({r}_{pc}\) in dam line LW is:
$${r}_{pc,LW}^{L}=\frac{{\sigma }_{LW(LW),LW(S)}}{{\sigma }_{LW}{\sigma }_{LW(S)}},$$
(5)
where \({\sigma }_{LW}\) is the standard deviation of the additive genetic values of individuals in line LW for the trait expressed in line LW, \({\sigma }_{LW\left(S\right)}^{2}\) is the variance of the additive genetic values of individuals in line LW for the trait expressed in line S, and \({\sigma }_{LW\left(LW\right),LW\left(S\right)}\) is the covariance between the additive genetic values of individuals in line LW for the trait expressed in line LW and S. The predicted upper bound of \({r}_{pc}\) in dam line LW is:
$${r}_{pc,LW}^{U}=\frac{{\sigma }_{LW\left(LW\right), LW\left(S\right)}+0.5{\sigma }_{LW}^{2}+0.5{\sigma }_{LW(LW),LW(LR)}}{{\sigma }_{LW}\sqrt{\left({\sigma }_{LW(S)}^{2}+0.25{\sigma }_{LW}^{2}+0.25{\sigma }_{LW\left(LR\right)}^{2}+{\sigma }_{LW\left(LW\right), LW\left(S\right)}+{\sigma }_{LW\left(S\right),LW(LR)}+0.5{\sigma }_{LW(LW),LW(LR)}\right)}},$$
(6)
where \({\sigma }_{LW\left(LR\right)}^{2}\) is the variance of the additive genetic values of individuals in line LW for the trait expressed in line LR, \({\sigma }_{LW\left(S\right),LW\left(LR\right)}\) is the covariance between the additive genetic values of individuals in line LW for the trait expressed in line S and LR, \({\sigma }_{LW\left(LW\right),LW\left(LR\right)}\) is the covariance between the additive genetic values of individuals in line LW for the trait expressed in line LW and LR, and the other parameters are the same as those in Eq. (5). The expression for the lower and upper bounds of \({r}_{pc}\) in dam line LR are obtained by replacing subscript LW with LR, and vice versa, in Eqs. (5) and (6).
Approximating variance components
Equations (3) to (6) show that we need the genetic variance components of the parental lines, which are not usually available, such as the additive genetic standard deviation in individuals of line S, for the trait expressed in line LW (\({\sigma }_{S\left(LW\right)}\)). These across-line variance component cannot be estimated directly, because there are no phenotypes of the individuals from line S for the trait expressed in line LW. Therefore, here we compare three methods for approximating such variance components.
The first method (VCGEBV) relies on the assumption that the GEBV are estimates of true additive genetic values. Thus, the GEBV may be used as proxies for the additive genetic values in the focal line, and the required variance components can be approximated by computing the covariances between these GEBV and their variances. A disadvantage of VCGEBV is that the variance of GEBV is expected to be smaller than the variance of the additive genetic values (\({\sigma }_{a}^{2}\)) due to shrinkage (i.e. the variance of GEBV is equal to \({r}^{2}{\sigma }_{a}^{2}\), where \({r}^{2}\) is the reliability of GEBV), resulting in biased estimates of variance components. Thus, we included a second method (VCGEBV-S), where the GEBV were corrected for shrinkage by dividing them by the square root of their reliabilities.
The third method for approximating variance components was based on REML estimates (VCREML). It uses estimates of genetic variances resulting from a univariate analysis of the trait line, and estimates of genetic covariances resulting from pairwise bivariate analyses between parental lines. On the one hand, this method ignores that the required across-line variance components should correspond to individuals in the focal line, and instead uses the components estimated in the trait line. The across-line variance in the focal line can differ from the variance in the trait line due to differences in genotype frequencies between lines. On the other hand, REML estimates are generally accurate, and commonly reported. Furthermore, this method may result in accurate approximations when the genotype frequencies in the focal line are similar to those in the trait line.
Note that, for each of these methods, all variance components in Eqs. (3) to (6) were approximated using the method described, including the within-line variance components such as \({\sigma }_{S}\).
Methods based on GEBV (VCGEBV and VCGEBV
-S)
With VCGEBV, the required variance components are approximated by using the GEBV as proxies for true additive genetic values. We first analysed phenotypes in each PB line separately using a univariate model, while including genotypes of all PB lines to obtain GEBV for all PB animals. For example, for the phenotypes in line S, the model was:
$${\mathbf{y}}_{S}={\mu }_{S}+\mathbf{Z}{\mathbf{u}}_{S}+{\mathbf{e}}_{S},$$
(7)
where \({\mathbf{y}}_{S}\) is the vector of corrected phenotypes, \({\mu }_{S}\) is the mean, \({\mathbf{u}}_{S}\) is the vector of additive genetic values with incidence matrix \(\mathbf{Z}\), and \(\mathbf{e}\) is the vector of independent random residuals. The additive genetic values were distributed as \({\mathbf{u}}_{S} \sim N\left(\mathbf{0},{\sigma }_{s}^{2}\mathbf{G}\right)\), where \({\sigma }_{s}^{2}\) is the additive genetic variance in line S, and \(\mathbf{G}\) is the multi-breed genomic relationship matrix that includes all PB animals and is constructed following Wientjes et al. [12]. Note that this approach is equivalent to estimating marker effects in line S, and then multiplying these estimates with marker genotypes of all PB animals [13, 14]. The model results in GEBV for all individuals from lines S, LW, and LR, for the trait expressed in line S (\({\widehat{\mathbf{u}}}_{All(S)}\)). The same approach was used to estimate GEBV of all PB animals for the trait expressed in lines LW (\({\widehat{\mathbf{u}}}_{All(LW)}\)) and LR (\({\widehat{\mathbf{u}}}_{All(LR)}\)), using the phenotypes of those lines.
The variances that appear in Eqs. (3) to (6) were approximated as the standard deviations of GEBV of individuals in the corresponding focal line. For example, \({\sigma }_{LW\left(S\right)}\) was approximated as \(\sqrt{var\left({\widehat{\mathbf{u}}}_{LW\left(S\right)}\right)}\), where \({\widehat{\mathbf{u}}}_{LW(S)}\) is a subset of \({\widehat{\mathbf{u}}}_{All(S)}\). The covariances that appear in Eqs. (3) to (6) were approximated as the covariance between GEBV of individuals in the focal line. For example, \({\sigma }_{LW\left(S\right), LW\left(LR\right)}\) was approximated as \(cov\left({\widehat{\mathbf{u}}}_{LW\left(S\right)}, {\widehat{\mathbf{u}}}_{LW\left(LR\right)}\right)\), where \({\widehat{\mathbf{u}}}_{LW\left(S\right)}\) and \({\widehat{\mathbf{u}}}_{LW\left(LR\right)}\) are the GEBV of individuals in line LW, for the trait expressed in line S and line LR, respectively. The advantage of VCGEBV is that the resulting variance components relate correctly to the individuals in the focal line.
Variances and covariances of GEBV can be poor approximations of genetic variance components, because GEBV are subject to shrinkage. For example, the variance of GEBV is a factor \({r}^{2}\) of the corresponding additive genetic variance, where \({r}^{2}\) is the reliability of the GEBV [15]. Therefore, with the second method (VCGEBV-S), we used the same approach as VCGEBV, but we divided the GEBV by the square-root of their individual reliabilities to correct for shrinkage. The individual reliabilities were computed by the MTG2 program [16], using the prediction error variance [17]. This procedure aims at scaling the GEBV such that their variance is independent of their reliabilities. The variances of and the covariances between the scaled GEBV were used to approximate the parameters in Eqs. (3) to (6). Note that this approach is similar to the approach of Calo et al. [18] and Blanchard et al. [19], except that here we perform scaling at the level of the GEBV while allowing for different individual reliabilities, whereas in [18] and Blanchard et al. [19] scaling is performed at the level of the variances and covariances using the sum of the reliabilities across individuals.
Method based on REML estimates (VCREML)
With the third method (VCREML), we used genomic REML estimates of variance components within and between lines to approximate the required parameters. Approximated variances and standard deviations resulted from univariate genomic analyses in the trait lines. For example, \({\sigma }_{S\left(LW\right)}\) and \({\sigma }_{LR\left(LW\right)}\) were approximated as the REML estimate of the additive genetic standard deviation in line LW (\({\sigma }_{LW}\)). The model used for this analysis was:
$${\mathbf{y}}_{LW}={\mu }_{LW}+\mathbf{Z}{\mathbf{u}}_{LW}+{\mathbf{e}}_{LW},$$
(8)
where \({\mathbf{y}}_{LW}\) is the vector of corrected phenotypes in line LW, \({\mu }_{LW}\) is the mean, \({\mathbf{u}}_{LW}\) is the vector of additive genetic values with incidence matrix \(\mathbf{Z}\), and \({\mathbf{e}}_{LW}\) is the vector of independent random residuals. The additive genetic values were distributed as \({\mathbf{u}}_{LW} \sim N\left(\mathbf{0},{\sigma }_{LW}^{2}{\mathbf{G}}_{LW}\right)\), where \({\sigma }_{LW}^{2}\) is the additive genetic variance in line LW, \({\mathbf{G}}_{LW}\) is the genomic relationship matrix and is constructed following method 2 of VanRaden [20]. The REML estimate \({\widehat{\sigma }}_{LW}\) is used as an approximation for the standard deviation of the additive genetic values for the trait expressed in line LW (i.e. \({\sigma }_{LW}\), \({\sigma }_{S\left(LW\right)}\) and \({\sigma }_{LR\left(LW\right)}\)). The same approach was used to approximate variance components for the trait expressed in lines S (\({\sigma }_{S}\), \({\sigma }_{LW(S)}\) and \({\sigma }_{LR(S)}\)) and LR (\({\sigma }_{LR}, {\sigma }_{S(LR)}\) and \({\sigma }_{LW(LR)}\)), using data of those lines.
Approximate covariances resulted from pairwise bivariate genomic analyses for each combination of two parental lines. For example, the covariance \({\sigma }_{S\left(LW\right),S(LR)}\) was approximated as the REML estimate of the additive genetic covariance between lines LW and LR (\({\sigma }_{LW,LR}\)), resulting from the bivariate analysis of performance in lines LW and LR. The model can be written as [21, 22]:
$$\left[\begin{array}{c}{\mathbf{y}}_{LW}\\ {\mathbf{y}}_{LR}\end{array}\right] =\left[\begin{array}{cc}{\mathbf{1}}_{LW}& {\mathbf{0}} \\ {\mathbf{0}} & {\mathbf{1}}_{LR}\end{array}\right]\left[\begin{array}{c}{\mu }_{LW}\\ {\mu }_{LR}\end{array}\right]+\left[\begin{array}{cc}{\mathbf{Z}}_{LW} & 0\\ 0& {\mathbf{Z}}_{LR}\end{array}\right]\left[\begin{array}{c}{\mathbf{u}}_{LW}\\ {\mathbf{u}}_{LR}\end{array}\right]+\left[\begin{array}{c}{\mathbf{e}}_{LW}\\ {\mathbf{e}}_{LR}\end{array}\right],$$
(9)
where \(\mathbf{y}\) is the vector of corrected phenotypes, \(\mu\) is the mean, \(\mathbf{1}\) is a column vector of 1s, \(\mathbf{u}\) is the vector of additive genetic values with incidence matrix \(\mathbf{Z}\), and \(\mathbf{e}\) is the vector of independent random residuals. Subscripts denote whether the terms relate to LW or LR animals. The distribution of the additive genetic values was:
$$\left[\begin{array}{c}{\mathbf{u}}_{LW}\\ {\mathbf{u}}_{LR}\end{array}\right]\sim N\left(\left[\begin{array}{c} {\mathbf{0}} \\ {\mathbf{0}} \end{array}\right], \left[\begin{array}{cc}{\sigma }_{LW}^{2}& {\sigma }_{LW,LR}\\ {\sigma }_{LW,LR}& {\sigma }_{LR}^{2}\end{array}\right]\otimes \mathbf{G}\right),$$
(10)
where \({\sigma }_{LW}^{2}\) is the additive genetic variance in line LW, \({\sigma }_{LR}^{2}\) is the additive genetic variance in line LR, and \({\sigma }_{LW,LR}\) is the additive genetic covariance between lines LW and LR. The genomic relationship matrix between all individuals (\(\mathbf{G}\)) was calculated following Wientjes et al. [12]. The variance components were estimated using REML in the MTG2 software [16]\(.\) The same approach was used to estimate covariances for each combination of two parental lines.
In addition to REML estimates of genetic variances and covariances, the bivariate analyses resulted in estimates of genetic correlations between parental lines (computed as \({\widehat{r}}_{g}=\frac{{\widehat{\sigma }}_{{1,2}}}{{\widehat{\sigma }}_{1}{\widehat{\sigma }}_{2}}\)). We report these genetic correlations in the Results section because they are related to (the predicted bounds of) \({r}_{pc}\). Both \({r}_{g}\) and \({r}_{pc}\) depend on the size of the non-additive genetic effects and the differences in allele frequencies between parental lines [9, 23].
Estimating \({r}_{pc}\) and validation of predicted bounds
We estimated \({r}_{pc}\) for each of the parental lines using a bivariate model that treats PB and CB performances as different, but genetically correlated traits. Note that we did not combine data from all PB lines in a single analysis, but we estimated \({r}_{pc}\) for each parental line, separately, using three bivariate models. The statistical model was similar to the bivariate model that was used to estimate REML estimates of covariances between lines (Eq. (9)), except that the data used were from PB and CB animals, instead of from lines LW and LR. The variance components were estimated using REML in the MTG2 software [16], and \({r}_{pc}\) was estimated as \({\widehat{r}}_{pc}=\frac{{\widehat{\sigma }}_{PB,CB}}{{\widehat{\sigma }}_{PB}{\widehat{\sigma }}_{CB}}.\)
We validated the predicted bounds by comparing them with the estimated \({r}_{pc}.\) Note that, in these comparisons, we consider the predicted bounds to be correct if the interval of the estimated \({r}_{pc}\) plus and minus one standard error of the estimate overlapped with the interval between the predicted bounds. Although the predicted bounds may capture the estimated \({r}_{pc}\), the bounds may not be very informative if they cover an extremely large range. For example, a predicted lower bound of − 1 and a predicted upper bound of 1 will always capture estimated \({r}_{pc}\), but such a prediction is not useful at all. Thus, we computed the range of bounds as the difference between the predicted upper and lower bounds of \({r}_{pc}\).