Comparison of molecular breeding values based on within- and across-breed training in beef cattle

Background Although the efficacy of genomic predictors based on within-breed training looks promising, it is necessary to develop and evaluate across-breed predictors for the technology to be fully applied in the beef industry. The efficacies of genomic predictors trained in one breed and utilized to predict genetic merit in differing breeds based on simulation studies have been reported, as have the efficacies of predictors trained using data from multiple breeds to predict the genetic merit of purebreds. However, comparable studies using beef cattle field data have not been reported. Methods Molecular breeding values for weaning and yearling weight were derived and evaluated using a database containing BovineSNP50 genotypes for 7294 animals from 13 breeds in the training set and 2277 animals from seven breeds (Angus, Red Angus, Hereford, Charolais, Gelbvieh, Limousin, and Simmental) in the evaluation set. Six single-breed and four across-breed genomic predictors were trained using pooled data from purebred animals. Molecular breeding values were evaluated using field data, including genotypes for 2227 animals and phenotypic records of animals born in 2008 or later. Accuracies of molecular breeding values were estimated based on the genetic correlation between the molecular breeding value and trait phenotype. Results With one exception, the estimated genetic correlations of within-breed molecular breeding values with trait phenotype were greater than 0.28 when evaluated in the breed used for training. Most estimated genetic correlations for the across-breed trained molecular breeding values were moderate (> 0.30). When molecular breeding values were evaluated in breeds that were not in the training set, estimated genetic correlations clustered around zero. Conclusions Even for closely related breeds, within- or across-breed trained molecular breeding values have limited prediction accuracy for breeds that were not in the training set. For breeds in the training set, across- and within-breed trained molecular breeding values had similar accuracies. The benefit of adding data from other breeds to a within-breed training population is the ability to produce molecular breeding values that are more robust across breeds and these can be utilized until enough training data has been accumulated to allow for a within-breed training set.


Background
One key advantage of genomic predictors is that they can be estimated early in the life of the animal and thus allow for increased accuracy of estimated breeding values (EBV), particularly for young animals, which have not yet produced progeny. However, the benefit of the inclusion of genomic predictions into EBV estimates is proportional to the amount of genetic variation that is explained by the genomic predictor [1]. To date, in beef cattle, the American Angus Association [2], Australian Angus Association, American Hereford Association [3], American Brahman Breeders Association [4], Australian Brahman Breeders Association [5], and American Simmental Association [6,7] exploit molecular information in their National Cattle Evaluations and associations for other breeds are moving towards this goal. Although the efficacy of within-breed trained genomic predictors looks promising [3][4][5][6][7][8], it is necessary to develop and evaluate across-breed predictors for the technology to be fully applied in the beef industry. Simulation studies have reported the efficacy of genomic predictors trained in one breed and utilized to predict genetic merit in differing breeds, as well as the efficacy of predictors trained using data from multiple breeds and then used to predict the genetic merit of purebreds that were either included or excluded from training data [9][10][11]. De Roos and colleagues [12] showed that by combining training populations, more accurate genomic predictions could be developed, particularly when the subpopulations had not diverged for more than a few generations and for lowly heritable traits. However, research in dairy cattle has shown that when the subpopulations diverged, genomic predictors from a multi-breed training population did not have higher accuracy than predictors from single-breed training sets, except when evaluation occurred in a breed that was not represented in the training set, in which case adding multiple breeds increased the accuracy of predictors, compared to using a singlebreed training set [13]. Work in other species [14] has shown that population structure can account for a substantial portion of the accuracy of genomic predictors but accounting for this structure can decrease the reliability of across-breed genomic predictors. Our objectives were to derive and evaluate genomic predictors using genotypes from the Illumina (San Diego, CA) BovineSNP50 platform for growth traits (weaning and yearling weights) in single-breed and multi-breed training data sets and evaluate them on field data.

Training populations
A total of three within-breed and two across-breed training populations were used: 1. A multiple-breed training population, which will be referred to as the MB population, that included five breeds (Angus, Hereford, Limousin, Red Angus, and Simmental) from a database containing 50K genotypes assembled from purebred beef cattle breeds. Animals in the database were primarily artificial insemination (AI) sires that had a substantial influence on their respective breeds and EBV with reasonably high accuracies. The only exception was the Limousin breed, for which a large number of DNA samples originated from previous DNA testing, such that only about 58% of the genotyped animals were AI sires. The MBV were trained on de-regressed EBV [15] (including weights to account for variable accuracy) for all five breeds together. Breed was fitted in the model as a fixed effect because EBV used in the training set were provided separately by each breed association, each using their own genetic base. In order to achieve a reasonable degree of independence, this training subset excluded any animal that was in the evaluation population. 2. Single-breed Angus (AN), Hereford (HH), or Limousin (LM) subsets that contributed to the MB MBV, collectively referred to as the single-breed training population (SB). The difference between MB and SB is that training was performed separately for each breed, as opposed to simultaneously for all five. The MBV trained on these single-breed training sets were computed both for subsets of the evaluation population that included the same breeds as the training set and subsets that included different breeds. 3. A multiple-breed training population consisting of AI sires from 13 breeds with high accuracy EBV and referred to as MB_2K. De-regressed EBV (adjusted for base differences and including weights to account for variable accuracy) were treated as phenotypes for training, following the methods of [15]. Due to the heterogeneity of breed-specific variance components, de-regressed breeding values and their associated weighting factors were scaled to standardize genetic variance as described by [16]. This training set represents a published across-breed prediction set from the US Meat Animal Research Center 2000 Bull Project [16], and was used for comparison with the MB MBV presented here. The MB_2K training set had 16 Limousin animals in common with the MB and Limousin SB sets. This small degree of overlap between the training sets was due to some animals having been genotyped by more than one research institution, since the data used for training the MB and SB sets did not include genotypes from the 2000 Bull Project.
The numbers of animals per breed in the SB, MB and MB_2K sets are presented in Table 1. For each trait, there were five training analyses (SB AN, SB HH, SB LM, MB, MB_2K). Each training analysis resulted in a prediction equation (a vector of estimated additive allelic effects corresponding to each SNP on the 50K chip), from which an MBV for each genotyped animal in the evaluation population could be computed. Genomic prediction equations were derived using GenSel [17], delivered via the Bioinformatics to Implement Genomic Selection (BIGS) platform (http://bigs.ansci.iastate.edu/). No pre-analysis filtering of SNPs (single nucleotide polymorphisms) based on minor allele frequency (MAF) was performed. The MB_2K predictions used a BayesCπ model [18]. The MB and SB predictions used a BayesC model with π set to 0.99 because the US beef industry (i.e., American Simmental Association and American Hereford Association) applied this approach to derive the genomic predictors that are included in National Cattle Evaluations. The de-regressed EBV for a genotyped individual was modeled as the sum of a fixed breed effect, the SNP effects times its genotype covariates, plus a random residual with variance σ 2 e /W, with W weights from [15]. The SNP effects had a prior distribution where a SNP effect was zero with probability π or was sampled from a normal distribution with a mean of zero and a SNP effect variance of σ 2 g with probability 1-π. The SNP effect variance and the residual variance had scaled inverse Chi-squared prior distributions.
Parameter π had a uniform (0,1) prior distribution in the BayesCπ model, but was assumed known in the BayesC model.

Evaluation population
The evaluation population consisted of genotyped animals from seven breeds in the herds of 24 seedstock producers from the Northern Plains region of the US plus any genotyped animals in their 4-generation pedigrees. The numbers of genotyped animals in the evaluation populations are summarized in Table 2. When evaluating the MB_2K MBV, animals in the evaluation population that were included in the MB_2K training set, had their MBV excluded from the analysis. Data was either extracted from existing breed association databases or using DNA samples extracted from semen or hair samples and did not require an approved animal use and care protocol.
The accuracies of the various MBV trained as described above, were evaluated based on the estimated genetic correlations between each of those MBV and the corresponding phenotypes in the evaluation population. Correlations were estimated using bivariate mixed linear models in which the traits were the MBV and the corresponding phenotypic trait. The genetic correlations between the trait and MBV reflect the accuracies of MBV since the square of these correlations represents the proportion of genetic variance explained by the genomic information [1,2,6,8,16,19]. The field data from the evaluation population contained weaning and yearling weights from 48 158 and 46 429 animals born in 2008 or later, respectively, with 128 050 animals in the pedigree, of which 2277 were genotyped and therefore had MBV. The average accuracy of the EBV of genotyped animals ranged from 0.44 for yearling weight in the Limousin breed to 0.84 for weaning weight in the Charolais breed.
The model for the phenotypic trait in the bivariate model included fixed effects for contemporary group,   [20]. Variance components for weaning and yearling weights were also estimated using single-trait linear mixed models based only on phenotypic data. Typical values for a collection of genetic correlations are reported based on the interquartile range that captures the middle 50% of the estimates.

Results and discussion
Genetic parameters for weaning and yearling weights estimated using single-trait analyses of the evaluation population field data from 24 herds are presented in Table 3. In general, heritability estimates were moderate to high and within the range of the estimates reported in the literature and summarized by Koots et al. [21], although some of the estimates of direct-maternal genetic correlations are greater than expected based on literature [22]. Estimates of genetic correlations between each MBV from the SB populations and its corresponding phenotypic trait are presented in Table 4. The MBV evaluated in this project generally accounted for less than 25% of the genetic variation (r g 2 ) in weaning and yearling weights ( Table 4). The estimated genetic correlations for the SB AN-trained MBV for weaning (0.36 ±0.07) and yearling (0.51±0.07) weights were similar to previously reported estimates i.e. ranging from 0.33 to 0.52 for weaning weight and from 0.34 to 0.64 for yearling weight [23,24]. With the exception of the SB HHtrained yearling weight MBV evaluated in Hereford, within-breed genetic correlations evaluated in the same breed as used for training were greater than 0.28, with typical values ranging from 0.36 to 0.42. However, the SB HH-trained yearling weight MBV performed poorly when evaluated in the same breed, with an estimated genetic correlation of 0.06±0.22. This may be an artifact of the fact that all Hereford field data used in evaluation were from a single herd. Figure 1 contains box plots of the genetic correlations of SB MBV with phenotypes, evaluated either in the same breed as that used for training or in a different breed. Unlike the moderate genetic correlations obtained when the within-breed MBV were evaluated in the same breed as used for training, the genetic correlations tended to be more variable and were centered close to zero when evaluation was in a different breed. The smaller genetic correlations are consistent with the expectation that the predictive power of the MBV decreases as the genetic distance between the animals used in training and evaluation increases [10,11,25] and implies that within-breed trained MBV are of minimal value for the genetic evaluation of other breeds.
Estimates of the genetic correlations between MB MBV and corresponding phenotypic traits are in Table 5. For breeds with both SB and MB MBV, the estimated genetic correlations tended to be similar (Figure 2), which suggests that either could be used with similar levels of efficacy. Considering the large contribution of AN to the training set, the estimated genetic correlations Genetic parameters are h 2 = direct heritability, m 2 = maternal heritability, r am = direct-maternal genetic correlation; genetic parameters were estimated in the field data set using pedigree-based REML.
for the SB and MB trained MBV were very similar. However, based on Table 5, the use of an MB trained MBV in a breed that was not included in the training data, is not advisable.
The breeds evaluated here represent populations that have diverged over many generations, approximately 200 years since breed formation occurred. Previous work by Pryce et al. [13] using the Holstein, Jersey, and Fleckvieh breeds showed that combining divergent subpopulations in the training set does not improve the accuracy of genomic predictors over within-breed derived predictors. Pryce et al. [13] reported that the accuracy of the predictors for milk genomic breeding values in Holstein based on training in Fleckvieh was equal to 0.22 but increased to 0.42 when a second breed (Jersey) was added to the training set, which suggests that the addition of several other breeds to the training set to predict a breed that was not in the training set is beneficial. However, the same results were not consistently seen here.
Typical values for the estimated genetic correlations for the MB_2K MBV across the seven breeds ranged from 0.25 to 0.35 for weaning weight and from 0.15 to 0.50 for yearling weight. These estimates are slightly lower than previously reported for within-breed trained MBV for growth traits [11,24].
Similar to the SB trained MBV results, estimates of genetic correlations for MB trained MBV were close to zero when evaluated in a breed that was not included in the training set (Table 5). For the two breeds (Charolais and Gelbvieh) not included in the MB training set, estimated genetic correlations for Gelbvieh tended to be low to moderate. The robustness of the MB trained MBV in Gelbvieh could be due to it having closer genetic ties with breeds included in training via crossbred animals in the pedigrees. However, the average AN contribution to the Gelbvieh evaluation data was only 8.45%. When evaluated in the same breed as used for training, SB trained MBV for weight traits based on the BovineSNP50 have higher accuracy than EBV based only on pedigree and performance information. Animals in the pedigree of the field data evaluation population bulls were excluded from training; genetic correlations and their standard errors are in bold characters when the MBV was evaluated in the breed in which it was trained.  Scatter plots of the relationship between MBV obtained from the SB and MB populations in Angus, Hereford, and Limousin breeds are presented in Figure 3 for weaning weight and in Figure 4 for yearling weight. Results show a strong linear association between the SB and MB trained MBV when applied to the same breed as used for training the SB MBV. This indicates that, on an individual breed basis, the SB and MB MBV account for much of the same variability in that breed.
There was a reduction in the proportion of genetic variance explained (r 2 g ), and thus a reduction in the variability of MBV among animals, when SB MBV were applied to animals from a different breed (Table 4), illustrating that these MBV do not separate animals in terms of genetic merit because they do not account for a substantial portion of the genetic variance. The decrease in variability indicates that SNPs that explained variability well in the training breed did not in another breed, even in a closely related breed. This result is consistent with the finding of Gibbs et al. [2] that haplotypes longer than 250 kb are conserved across closely related breeds [26]. An alternative explanation is that some breeds may be more diverse, i.e. having a greater number of haplotypes and low frequencies of these haplotypes. Furthermore, due to selection or drift, some breeds may be fixed, or close to fixation, for certain SNPs. The greater robustness of the MB trained MBV in explaining genetic variation across breeds compared to the SB trained indicate that there are multiple collections of SNPs that can capture the underlying genetic variability and that the addition of other breeds to the training set allows the model to select the collection of SNPs that works best across multiple breeds.
The average model incorporation frequencies, i.e. the proportion of iterations of the MCMC chain with which an individual SNP enters the model, were within 0.0012 of the prior model frequency of 0.01 = 1 -π for all SNPs and for each MBV. The numbers of SNPs with model frequencies and effects that exceeded various thresholds are presented in Table 6 and Table 7. The number of SNPs in the SB trained MBV that exceeded the prior model frequency of 0.01 was greater than that in the corresponding MB trained MBV. However, within the reduced set of SNPs in the MB trained MBV that exceeded the prior model frequency threshold, there  This decrease in the number of informative SNPs may also be due to the greater number of animals in the MB training data compared to the SB training data, resulting in a decrease in the noise associated with sampling in the MCMC algorithm. As suggested by [27], one benefit of adding breeds to the training set is the possibility of identifying SNPs that are in strong linkage disequilibrium with the QTL and with an allelic phase that is preserved across multiple breeds. Although this might help to identify important genomic regions harboring QTL, it is not associated with a noticeable increase in accuracy of MBV. The similar performance in individual breeds of SB and MB trained MBV supports the fact that common QTL are tracked across multiple breeds. If common QTL were not tracked, we would have expected a drop in performance when adding data from other breeds, because that would simply add noise to the data. The increase in the number of SNPs with high model frequencies and large effects in the across-breed trained MBV also supports the conclusion that common QTL are being tracked across breeds, since effects that were spread across several SNPs in each within-breed trained MBV are being assigned to a smaller set of SNPs across breeds. The MB trained MBV were, however, not consistently better than SB trained MBV.

Conclusions
The accuracy of within-or across-breed trained MBV are substantially lower for breeds that are not included in the training set, since the estimated genetic correlations between trait MBV and their corresponding phenotypes cluster around zero. This is true even for breeds that are closely related, such as Angus and Red Angus. The addition of training data from other breeds produces an MBV where the SNP effects are concentrated onto a smaller set of informative SNPs. However, the accuracy of the across-breed trained MBV is similar to the within-breed trained MBV.