Comparison of alternative approaches to single-trait genomic prediction using genotyped and non-genotyped Hanwoo beef cattle
- Joonho Lee†1View ORCID ID profile,
- Hao Cheng†1, 2,
- Dorian Garrick1, 3, 4,
- Bruce Golden4,
- Jack Dekkers1,
- Kyungdo Park5,
- Deukhwan Lee6 and
- Rohan Fernando1Email authorView ORCID ID profile
© The Author(s) 2017
Received: 20 June 2016
Accepted: 9 December 2016
Published: 4 January 2017
Genomic predictions from BayesA and BayesB use training data that include animals with both phenotypes and genotypes. Single-step methodologies allow additional information from non-genotyped relatives to be included in the analysis. The single-step genomic best linear unbiased prediction (SSGBLUP) method uses a relationship matrix computed from marker and pedigree information, in which missing genotypes are imputed implicitly. Single-step Bayesian regression (SSBR) extends SSGBLUP to BayesB-like models using explicitly imputed genotypes for non-genotyped individuals.
Carcass records included 988 genotyped Hanwoo steers with 35,882 SNPs and 1438 non-genotyped steers that were measured for back-fat thickness (BFT), carcass weight (CWT), eye-muscle area, and marbling score (MAR). Single-trait pedigree-based BLUP, Bayesian methods using only genotyped individuals, SSGBLUP and SSBR methods were compared using cross-validation.
Methods using genomic information always outperformed pedigree-based BLUP when the same phenotypic data were modeled from either genotyped individuals only or both genotyped and non-genotyped individuals. For BFT and MAR, accuracies were higher with single-step methods than with BayesB, BayesC and BayesCπ. Gains in accuracy with the single-step methods ranged from +0.06 to +0.09 for BFT and from +0.05 to +0.07 for MAR. For CWT, SSBR always outperformed the corresponding Bayesian methods that used only genotyped individuals. However, although SSGBLUP incorporated information from non-genotyped individuals, prediction accuracies were lower with SSGBLUP than with BayesC (π = 0.9999) and BayesB (π = 0.98) for CWT because, for this particular trait, there was a benefit from the mixture priors of the effects of the single nucleotide polymorphisms.
Single-step methods are the preferred approaches for prediction combining genotyped and non-genotyped animals. Alternative priors allow SSBR to outperform SSGBLUP in some cases.
Since breeding technologies using genome-wide single nucleotide polymorphism (SNP) panels became available, genomic selection was rapidly adopted for improvement of livestock and has replaced the traditionally used pedigree-based best linear unbiased prediction (PBLUP). The BayesA and BayesB hierarchical Bayesian models with locus-specific variances were proposed by Meuwissen et al. . BayesB can accommodate mixture models in which SNPs have zero effects with probability π [2, 3]. When π = 0, BayesB is known as BayesA. BayesC is another widely-used Bayesian mixture model, in which a common variance is used for all SNPs instead of locus-specific variances , and a modification of that method known as BayesCπ treats π as an unknown parameter with a uniform prior distribution .
In general, the number of individuals with genomic information is a small subset of the individuals represented in the population with pedigree and phenotypic information. “Single-step” methodologies were developed to take advantage of all pedigree, phenotypic and genomic information simultaneously [6, 7]. The single-step genomic BLUP (SSGBLUP) method uses a relationship matrix that is computed from marker and pedigree information. SSGBLUP was shown to yield a similar or higher accuracy compared to methods using only genotyped individuals [8–10]. Fernando et al.  proposed a class of single-step Bayesian regression methods (SSBR) to extend SSGBLUP to incorporate BayesB-like models for SNP effects (SSBR-B). Similar extensions of SSGBLUP with BayesC-like models result in SSBR-C and SSBR-Cπ. SSBR methods may promise higher prediction accuracies and provide computational benefits when many animals are genotyped. In SSGBLUP, the distribution of marker effects conditional on the variance of marker effects is assumed univariate normal, whereas in SSBR, the prior for marker effects can follow a t-distribution, a double exponential distribution or mixture distributions, which may be advantageous in some situations.
In this paper, prediction accuracies from PBLUP, BayesB, BayesC, BayesCπ, SSGBLUP and SSBR-B, SSBR-C, SSBR-Cπ were compared in terms of cross-validation accuracies.
Young Hanwoo bulls are routinely progeny-tested in batches at the Hanwoo Improvement Center (Seo-San, Chungnam, South Korea). DNA samples were collected from steers that included the progeny-tested offspring from the 46th to 51st selection batches. SNP genotypes were determined using Illumina Bovine SNP50 v1 (50 k) or Bovine HD (778 k) beadchips (Illumina, CA).
Carcass records were recorded at harvest at about 24 months of age. The carcass traits used in the analyses were back-fat thickness (BFT), carcass weight (CWT), eye-muscle area (EMA), and marbling score (MAR). Park et al.  reported heritabilities of 0.50, 0.30, 0.42 and 0.63 for BFT, CWT, EMA and MAR, respectively. Approval from the ethics committee was not required for these data since they were obtained from an existing industry database.
Of the 44 k SNPs that are included on both the 50 and 778 k beadchips, only autosomal SNPs with known map location were used. For quality control, SNPs that departed from the Hardy–Weinberg equilibrium (p < 10−6) based on a Chi square test, or had a minor allele frequency (MAF) lower than 0.01, or a missing rate higher than 0.1 were excluded from further analysis. For the genotyped animals, SNPs with missing genotypes were imputed using Beagle 3.3 . After these quality controls, 35,882 SNPs remained for analyses.
The numerator relationship matrix (NRM) based on pedigree information and the genomic relationship matrix (GRM) based on SNP genotypes were compared. Nineteen individuals, which showed unreasonable deviations between the NRM and GRM coefficients that were probably due to errors in the DNA sampling, were eliminated. Among these 19 individuals, five appeared to have been genotyped twice with different ID since their GRM relationship coefficients were near 1.0 while their NRM relationship coefficients were close to 0. For the other 14 individuals, either the GRM relationship coefficients were near 0 while those of the NRM were near 0.25 as would be the case for mistakenly recorded half-sib individuals, or the GRM relationship coefficients were near 0.25 while those of the NRM were near 0 as would be the case for half-sibs mistakenly recorded as unrelated. After elimination of these suspect individuals, the correlation coefficient between NRM and GRM increased from 0.856 to 0.866. Finally, 988 genotyped individuals remained for genomic prediction with a mean MAF of 0.243 and mean observed heterozygosity of 0.326.
Additional carcass records for 1438 non-genotyped progeny-tested steers were collected from the 39th to the 51st selection batches for the single-step and PBLUP analyses. Ancestors of the 2426 individuals with carcass records contributed to an 11-generation pedigree file that included 9637 animals.
Genotyped individuals were assigned to five mutually exclusive groups for cross-validation. K-means clustering based on pedigree relationship coefficients was used to minimize the relatedness between training and validation sets . The five groups included 172, 280, 199, 139 and 198 individuals, respectively. Each group was used as the validation set while the remaining genotyped individuals were included in the training set. In SSGBLUP, SSBR and PBLUP with phenotypes on all animals, non-genotyped individuals were included in the training set. Phenotypes were pre-adjusted for contemporary group and age effects using multiple-trait PBLUP because animals from some progeny-test batches were assigned to different groups and because some analyses included additional non-genotyped animals from the same batches as genotyped animals.
Single-trait statistical models
Bayesian methods using only genotyped animals
The prediction model based on the GRM constructed from one iteration was referred to as SSGBLUP-II and the GRM constructed from five iterations was referred to as SSGBLUP-III. To remove singularity, GRM can be blended with NRM  but this was not done in our study, nor were residual polygenic effects separately modeled in either SSGBLUP or SSBR. Instead, diagonal and off-diagonal elements of G were separately scaled so that their means equal the corresponding means of A gg, which is expected to remove the singularity of GRM in SSGBLUP that is introduced by centering the SNPs.
Single-step Bayesian regression methods
The π values in the subsequent analyses for BayesB, BayesC, SSBR-B and SSBR-C were chosen such that they provided the highest accuracies from fivefold cross-validation. Accuracies in BayesB and BayesC were compared using various π values i.e. 0.9999, 0.999, 0.995, 0.99, 0.98 and, then, in steps from 0.95 to 0.6 decreasing by 0.05.
Analyses were performed with GenSel  for BayesB, BayesC and BayesCπ methods using only genotyped animals. Estimated breeding values of PBLUP and SSGBLUP were obtained using the software BLUPF90  modified for genomic analyses . For SSBR methods, JWAS the Julia package for whole-genome analyses  was used.
For each validation set, prediction accuracy was calculated as the correlation between the vector of adjusted phenotypes and the vector of estimated breeding values, divided by the square root of trait heritability. Prediction accuracies from these fivefold cross-validation sets were pooled to obtain a single prediction accuracy that was relevant to the method and trait by weighting each of the five validation correlations by the number of individuals in that set. Regressions of adjusted phenotype on estimated breeding value were calculated for all prediction methods.
Genome-wide association studies
Genome-wide association studies (GWAS) were performed using the BayesB method with the π value that had given the highest prediction accuracy, in order to describe the genetic architecture for different traits in terms of window variance .
The π values that maximized the cross-validation accuracies in BayesB were 0.95, 0.98, 0.95, and 0.6 for BFT, CWT, EMA and MAR, respectively, and were used in SSBR-B. The corresponding π values in BayesC were 0.98, 0.9999, 0.98, and 0.6 for BFT, CWT, EMA and MAR, respectively, and were used in SSBR-C.
Several windows showed distinctly larger effects than the rest of the genome for BFT and EMA, but the window with the largest effect explained only about 1% of the genetic variance. For MAR, the windows showed smaller effects than those for BFT and EMA with the most significant window explaining less than 0.3% of the genetic variance. These results show that, for BFT, EMA and MAR, many QTL each with a small effect are widely distributed across the whole genome, which is consistent with the infinitesimal model. In contrast, for CWT, one window on chromosome 4 and two windows on chromosome 14 explained together more than 15% of the genetic variance while the other windows showed small effects. Using single-SNP association tests, Lee et al.  found similar results that indicated that SNPs on chromosome 14 were strongly associated with CWT in Hanwoo beef cattle. These differences in genomic architecture between the four traits probably explain the difference in the pattern of prediction accuracy between CWT and the three other traits as shown in Fig. 1. BayesB, which shrinks QTL with small effects to a greater extent than BayesC, may capture QTL with large effects better and therefore yield higher prediction accuracies . BayesB and BayesC methods with a high π value tend to capture the same few QTL with large effects, thus their similar prediction accuracies.
Prediction accuracies of models SSGBLUP-I and SSBR-C (π = 0) without estimated variances were identical and equal to 0.351 for BFT, 0.415 for CWT, 0.413 for EMA and 0.377 for MAR as expected since these models with assumed variance parameters are equivalent in terms of prediction of breeding values . In practice, variance components are often treated as unknown and are estimated in a separate analysis, e.g. restricted maximum likelihood (REML) followed by GBLUP, or jointly with an informative prior, e.g. BayesB, SSBR-B, etc. The variances of additive genetic effects, SNP effects and residual effects were estimated in the subsequent analyses described below.
Genomic methods versus pedigree-based BLUP
Methods using genomic information always outperformed PBLUP with the same phenotypic data. Using data from only genotyped animals, accuracies were higher with BayesB, BayesC and BayesCπ than with PBLUP-G for all traits. When data from both genotyped and non-genotyped individuals were used, prediction accuracies of the single-step methods were higher than those of PBLUP for all traits.
Single-step methods versus BayesB, BayesC and BayesCπ
For BFT and MAR, prediction accuracies of the single-step methods were higher than those of BayesB, BayesC and BayesCπ. Gains in accuracy with the single-step methods ranged from +0.06 to +0.09 for BFT and from +0.05 to +0.07 for MAR, whereas for EMA, there was no advantage and only a slight gain in accuracy was observed in PBLUP versus PBLUP-G. For CWT, SSBR always outperformed the corresponding Bayesian methods using only genotyped individuals and the gains in accuracy were +0.05 (SSBR-C (π = 0) vs. BayesC (π = 0)), +0.01 (SSBR-C (π = 0.9999) vs. BayesC (π = 0.9999)), +0.10 (SSBR-Cπ vs. BayesCπ) and +0.04 (SSBR-B (π = 0.98) vs. BayesB (π = 0.98)). However, although information from non-genotyped individuals was incorporated, for CWT prediction accuracy of SSGBLUP was lower than that of BayesC (π = 0.9999) and BayesB (π = 0.98) due to the benefits of mixture priors of the SNP effects for this particular trait.
Comparisons between single-step methods
Regression coefficient of adjusted phenotype on estimated breeding values for backfat (BFT), carcass weight (CWT), eye-muscle area (EMA) and marbling (MAR) traits
SSBR-C (π estimation)
SSBR-B (π = chosen)a
SSBR-C (π = chosen)b
SSBR-C (π = 0)
BayesC (π estimation)
BayesB (π = chosen)a
BayesC (π = chosen)b
BayesC (π = 0)
Prediction accuracies of all methods using genomic information were higher than those of pedigree-based BLUP. However, the degree of superiority of genomic selection differed between methods and traits.
We hypothesize that the advantage of including phenotypic observations from non-genotyped animals into an analysis using phenotypic observations from genotyped animals would be similar for pedigree methods (PBLUP compared to PBLUP-G) and for genomic methods (SSBR-C compared to BayesC). Simultaneous use of all pedigree, phenotypic and genomic information in single-step methods improved prediction accuracy relative to methods that only use data from genotyped animals for all traits, except EMA. For EMA, there was similarly little benefit from including the extra data in the PBLUP analyses (compared to PBLUP-G).
Both SSBR and SSGBLUP methods showed similar prediction accuracies when the genetic architecture appeared to approach the infinitesimal model as was the case for BFT, EMA, and MAR. However, for CWT, prediction accuracies of the SSBR methods were higher than those of SSGBLUP when there were only a few QTL with large effects. For that trait, the SSBR methods benefited from the use of the mixture priors.
The largest benefit of the SSBR methods was reached when an appropriate π was applied. However, it is computationally intensive to find this value of π through cross-validation. Methods for estimating π are beneficial, but they require large datasets. An appropriate π was more critical for the Bayesian methods that only used genotyped individuals than for the SSBR methods. For example, differences in prediction accuracies between BayesC (π = 0.9999) and BayesCπ reached values of 0.10 but only of 0.01 between SSBR-C (π = 0.9999) and SSBR-Cπ. Presumably, priors become less important in the single-step analyses where more data are used.
Three factors can result in increased accuracy. First, the inclusion of genomic information, which was revealed when genomic methods were compared to pedigree-based BLUP. Second, the use of additional phenotypic information from including non-genotyped individuals, which was shown by comparing Bayesian methods using only genotyped animals with their corresponding single-step methods. Third, the use of methods that exploit genomic regions with large effects, as was found for one of the four traits using either mixture priors or iterative weighted methods for computing GRM.
SSGBLUP with iterative calculation of weighted genomic matrices had the disadvantage that it reduced prediction accuracy and increased bias for traits that were not associated with genomic regions with large effects, whereas the Bayesian models with mixture priors performed comparably regardless of the genomic architecture. SSGBLUP with iterative calculation of weighted genomic matrices shrinks small effects to zero, and more so with each additional iteration. There is no statistical basis to determine the optimal number of iterations except by trial and error, and neither one nor five iterations resulted in improvements in this dataset.
In this study, which is based on a small population of Hanwoo cattle, prediction accuracy was higher for all genomic evaluations compared to pedigree-based BLUP. In such a situation, where the genomic reference population is relatively small, single-step methods, which can routinely account for genomic regions with large effects when they are present, are recommended for additional gains in accuracy.
The “single-step” methodologies, which take advantage of all pedigree, phenotypic and genomic information simultaneously, give similar or higher prediction accuracies compared to methods using only genotyped individuals. Compared to SSGBLUP, the SSBR methods showed additional benefit for the CWT trait, which is associated with QTL with large effects. There is no disadvantage in using SSBR methods for all traits.
JL and HC conceived the study, undertook the analysis and wrote the draft. DG, RF, JD, BG contributed to the analysis. DG, RF contributed to the final version of manuscript. All authors read and approved the final manuscript.
This work was supported by a grant from the Next Generation BioGreen 21 Program (PJ01111502), Rural Development Administration, Republic of Korea. This work was supported by the US Department of Agriculture, Agriculture and Food Research Initiative National 2 Institute of Food and Agriculture Competitive Grant No. 2015-67015-22947.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–29.PubMedPubMed CentralGoogle Scholar
- Garrick DJ, Dekkers JCM, Fernando RL. The evolution of methodologies for genomic prediction. Livest Sci. 2014;166:10–8.View ArticleGoogle Scholar
- Cheng H, Qu L, Garrick DJ, Fernando RL. A fast and efficient Gibbs sampler for BayesB in whole-genome analyses. Genet Sel Evol. 2015;47:80.View ArticlePubMedPubMed CentralGoogle Scholar
- Kizilkaya K, Fernando RL, Garrick DJ. Genomic prediction of simulated multibreed and purebred performance using observed fifty thousand single nucleotide polymorphism genotypes. J Anim Sci. 2010;88:544–51.View ArticlePubMedGoogle Scholar
- Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics. 2011;12:186.View ArticlePubMedPubMed CentralGoogle Scholar
- Legarra A, Aguilar I, Misztal I. A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009;92:4656–63.View ArticlePubMedGoogle Scholar
- Fernando RL, Dekkers JCM, Garrick DJ. A class of Bayesian methods to combine large numbers of genotyped and non-genotyped animals for whole-genome analyses. Genet Sel Evol. 2014;46:50.View ArticlePubMedPubMed CentralGoogle Scholar
- Misztal I, Aggrey SE, Muir WM. Experiences with a single-step genome evaluation. Poult Sci. 2013;92:2530–4.View ArticlePubMedGoogle Scholar
- Lourenco DAL, Misztal I, Tsuruta S, Aguilar I, Ezra E, Ron M, et al. Methods for genomic evaluation of a relatively small genotyped dairy population and effect of genotyped cow information in multiparity analyses. J Dairy Sci. 2014;97:1742–52.View ArticlePubMedGoogle Scholar
- Lourenco DA, Tsuruta S, Fragomeni BO, Masuda Y, Aguilar I, Legarra A, et al. Genetic evaluation using single-step genomic best linear unbiased predictor in American Angus. J Anim Sci. 2015;93:2653–62.View ArticlePubMedGoogle Scholar
- Park B, Choi T, Kim S, Oh SH. National genetic evaluation (system) of Hanwoo (Korean native cattle). Asian Australas J Anim Sci. 2013;26:151–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007;81:1084–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Saatchi M, McClure MC, McKay SD, Rolf MM, Kim J, Decker JE, et al. Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation. Genet Sel Evol. 2011;43:40.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang H, Misztal I, Aguilar I, Legarra A, Muir WM. Genome-wide association mapping including phenotypes from relatives without genotypes. Genet Res (Camb). 2012;94:73–83.View ArticleGoogle Scholar
- Su G, Christensen OF, Janss L, Lund MS. Comparison of genomic predictions using genomic relationship matrices built with different weighting factors to account for locus-specific variances. J Dairy Sci. 2014;97:6547–59.View ArticlePubMedGoogle Scholar
- Calus MPL, Schrooten C, Veerkamp RF. Genomic prediction of breeding values using previously estimated SNP variances. Genet Sel Evol. 2014;46:52.View ArticlePubMedPubMed CentralGoogle Scholar
- Aguilar I, Misztal I, Johnson DL, Legarra A, Tsuruta S, Lawlor TJ. Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J Dairy Sci. 2010;93:743–52.View ArticlePubMedGoogle Scholar
- Misztal I, Tsuruta S, Strabel T, Auvray B, Druer T, Lee DH. BLUPF90 and related programs (BGF90). In: Proceedings of the 7th world congress on genetics applied to livestock production: 19–23 August 2002; Montpellier. 2002.
- Cheng H, Garrick DJ, Fernando RL. JWAS: Julia implementation of whole-genome analyses software using univariate and multivariate Bayesian mixed effects model. http://QTL.rocks (2016).
- Wolc A, Arango J, Jankowski T, Dunn I, Settar P, Fulton JE, et al. Genome-wide association study for egg production and quality in layer chickens. J Anim Breed Genet. 2014;131:173–82.View ArticlePubMedGoogle Scholar
- Lee SH, Choi BH, Lim D, Gondro C, Cho YM, Dang CG, et al. Genome-wide association study identifies major loci for carcass weight on BTA14 in Hanwoo (Korean cattle). PLoS One. 2013;8:e74677.View ArticlePubMedPubMed CentralGoogle Scholar
- Wolc A, Arango J, Settar P, Fulton JE, O’Sullivan NP, Dekkers JC, et al. Mixture models detect large effect QTL better than GBLUP and result in more accurate and persistent predictions. J Anim Sci Biotechnol. 2016;7:7.View ArticlePubMedPubMed CentralGoogle Scholar