Using the Pareto principle in genome-wide breeding value estimation
© Yu and Meuwissen; licensee BioMed Central Ltd. 2011
Received: 23 September 2010
Accepted: 1 November 2011
Published: 1 November 2011
Genome-wide breeding value (GWEBV) estimation methods can be classified based on the prior distribution assumptions of marker effects. Genome-wide BLUP methods assume a normal prior distribution for all markers with a constant variance, and are computationally fast. In Bayesian methods, more flexible prior distributions of SNP effects are applied that allow for very large SNP effects although most are small or even zero, but these prior distributions are often also computationally demanding as they rely on Monte Carlo Markov chain sampling. In this study, we adopted the Pareto principle to weight available marker loci, i.e., we consider that x% of the loci explain (100 - x)% of the total genetic variance. Assuming this principle, it is also possible to define the variances of the prior distribution of the 'big' and 'small' SNP. The relatively few large SNP explain a large proportion of the genetic variance and the majority of the SNP show small effects and explain a minor proportion of the genetic variance. We name this method MixP, where the prior distribution is a mixture of two normal distributions, i.e. one with a big variance and one with a small variance. Simulation results, using a real Norwegian Red cattle pedigree, show that MixP is at least as accurate as the other methods in all studied cases. This method also reduces the hyper-parameters of the prior distribution from 2 (proportion and variance of SNP with big effects) to 1 (proportion of SNP with big effects), assuming the overall genetic variance is known. The mixture of normal distribution prior made it possible to solve the equations iteratively, which greatly reduced computation loads by two orders of magnitude. In the era of marker density reaching million(s) and whole-genome sequence data, MixP provides a computationally feasible Bayesian method of analysis.
Genomic selection (GS) is currently being adopted by the dairy cattle breeding industries around the world . Genome-wide breeding value (GWEBV) prediction plays a pivotal role for this new technology. Its accuracy depends on the statistical methods used, the genome, the population structure, and trait heritability. GWEBV estimation methods are categorized based on the assumptions of their prior distributions of marker effects. Genome-wide BLUP (GBLUP) methods e.g. , assume a normal prior distribution for all marker loci with a constant variance. In Bayesian methods, a more flexible prior distribution of SNP effects can be applied that allows for a few but with very large SNP effects whilst most are small or even zero. However, Bayesian methods often use Monte Carlo Markov chain (MCMC) algorithms which make them computationally demanding.
Meuwissen et al.  proposed BayesB for the estimation of SNP effects, which assumes that a fraction (1 - π) of the SNP have no effect and π SNP have an effect with a t-distributed prior that is more thick tailed than the normal distribution, i.e. it allows for a few SNP with very large effects and many SNP with small ones. Based on the work of  and  suggesting that normal priors give similar results as t-distributed priors, Luan et al.  used a mixture of two normal distributions as a prior, with probability π of the SNP effects coming from a normal distribution with a large variance and with probability (1 - π) from a normal distribution with a small variance. This was also justified by the observation that in practice GBLUP yields high accuracy , which suggests that the best predictions are obtained if the SNP with small effects are not neglected. This mixture prior distribution has two unknown parameters, π, the variance of the SNP with large effects. These parameters are difficult to estimate partly because the true distribution of the SNP effects is probably not a mixture of two normal distributions.
where M is the number of SNP, such that . Thus, assuming the overall genetic variance is known, applying of the Pareto principle reduces the number of unknown parameters of the prior distribution from 2 to 1, i.e. π.
The aim of this paper is to present this novel approach using the Pareto principle applied on individual marker loci (MixP), and to compare it with other single SNP based GWEBV prediction methods in a real Norwegian Red cattle (NRF) pedigree, and on a real wheat dataset.
The MixP method
For the estimation of SNP effects, we used the Iterative Conditional Expectation (ICE) algorithm of Meuwissen et al. . The ICE algorithm is similar to the Iterative Conditional Mode (ICM) algorithm of [8, 9], except that it estimates the mean instead of the mode of the SNP effects. The latter is because the posterior of the SNP effects may be bimodal , in which case estimation of the mode does not make much sense (both modes may be quite far away from the mean). In order to simplify the estimation process, we will use a mixture of two normal distributions, with a 0 mean and variances and , respectively, as a prior for the SNP effects, instead of the Spike and Slab distribution [11, 12] that was used by .
where is the likelihood of the data , i.e. , where . Analogously, L2iis defined with replacing . In the numerator, , where is the standard BLUP estimate of g i assuming that the prior variance of g i is . Analogously, is calculated assuming a prior variance of .
Thus, is the weighted mean of BLUP estimates of g i assuming that the prior variance of g i is either or and where the weights equal the posterior probability of either belonging to the first (πL1i) or to the second distribution ((1 - π)L2i). As described above, the calculation of L1irequires the inversion of the V1imatrix, but the Appendix shows how to avoid this matrix inversion.
Iteration is continued until the sum of the squares of the changes become less that 10-5 times the sum of the squares of the solutions. We refer this method as MixP.
Other statistical methods
Two other estimation methods were compared. BayesB is implemented as in , where the SNP with a small effect are assumed to come from a normal distribution with a small variance (instead of having no effect as in ). The variance of these small effects is sampled, and thus these small effects may be seen as modeling the polygenic background genes. As in the original BayesB, some SNP are allowed to have a big effect. The number of iterations for BayesB is 10000, and that for burn-in is 3000. The degrees of freedom for the inverse χ2 distribution is 4.2. GBLUP is also used, as described in .
Norwegian Red cattle
The Norwegian Red cattle pedigree used in this study consisted of 19523 individuals distributed over eight generations kindly provided by GENO AS (http://www.geno.no). The data also contained an identifier indicating whether a bull was genotyped or not. A total of 2165 bulls were genotyped including 104 imported bulls. We based our simulations on this real pedigree and assumed that the genotyped animals were those in the pedigree. The cattle population data was first sorted out to make sure that the parents were placed before their offspring. The 1915 oldest genotyped bulls were marked as the training set, and the youngest 250 bulls were marked as the evaluation set, i.e. for which EBV are predicted. The total genetic variance V g and number of QTL were assumed as known parameters in this study.
The parents of unknown origin were sampled from an ideal population of effective size 200 in each scenario, which was simulated for 10000 generations to achieve a mutation drift balance and linkage disequilibrium between the loci. The genome consisted of 1 Morgan/108 base-pairs. The mutation rate was 10-8 per base-pair per meiosis. Markers and QTL loci were randomly selected amongst those with an allele frequency greater than 0.05. The number of markers corresponded to those inherited from the ideal population, whereas various numbers of QTL and heritabilities were simulated. QTL effects were sampled from a Laplace distribution with a 0 mean and scale parameter = 1. The genotypes of the last generation were gene-dropped into the sorted real population pedigree. No additional mutation occurs at this stage. The 1915 training bulls were genotyped and phenotyped, and the 250 evaluation bulls were only genotyped. The heritabilities, , used here should be interpreted as per chromosome heritabilities, i.e. they are about 30 times smaller than the total trait heritabilities (or reliabilities in case of daughter-yield-deviations).
Three scenario parameters, i.e., chromosome heritability , marker density (Nmkr = 100, 200, 500, 1000, 1500), and number of QTL per chromosome (NQTL = 5, 30, 100) varied in the simulation study. QTL were sampled randomly from the mutated loci in the ideal population simulation. Marker loci were then sampled randomly with a minimum allele frequency (MAF) of 0.05. Each scenario was repeated 100 times. The simulated genetic and environmental variances were also used to analyze the data by GBLUP, BayesB and MixP. The π value used for MixP was NQTL/Nmrk, i.e. the ratio of number of QTL simulated to that of markers used.
Real data analysis
The publicly available wheat dataset as described in  was used to test the methods on a real dataset. This dataset consisted 599 wheat lines. Grain yields in four different environments were recorded. These lines were typed at 1,447 loci. Markers with an allele frequency between 0.05 and 0.95 were used for the analysis. Ten replicates of a ten-fold cross-validation were done by randomly and evenly dividing the lines into 10 groups. In each replicate, each group was used as a validating set in turn. The correlation coefficients between GWEBV and phenotypes were averaged across all the validation folds and replicates for grain yield in each environment. The estimates of π for method MixP were optimized through a grid search with a step size 0.01 between 0.01 and 0.50. The heritability of grain yields used for the analysis was 0.34, 0.30, 0.41 and 0.37 for environment 1 to 4, respectively .
Accuracy of GWEBV estimations
From Figures 1 and 2, we can see that results from MixP and BayesB were very similar and better than the GBLUP methods. A comparison between Figures 1 and 2 showed that the accuracy increased with and Nmkr. Accuracy trends on NQTL differed: the accuracy of MixP and BayesB decreased with increasing NQTL, as was found by . The GBLUP method was as good as BayesB and MixP when the number of QTL is 100.
It may also be noted that the accuracy of GBLUP was very much independent of the number of QTL, which is expected since GBLUP does not give extra weight to the SNP with big effects. Hence, it does not benefit from the fact that a few genes have a big effect. This can also be seen from the formula for the accuracy of GBLUP, as shown by  and , which depends on the number of independent segments in the genome, but not on the actual number of QTL.
Results with the wheat dataset
Accuracies of GBLUP, MixP and BayesB for the prediction of grain yield in four environments called 1-4 in wheat
MixP (π = 0.2)
Relative computer CPU times using GBLUP, MixP and BayesB (the CPU time of GBLUP was set to 1
Number of SNP
Here we applied the Pareto principle to estimate genome-wide EBV (GWEBV). Because a simple prior distribution was used, i.e. a mixture of normal distributions, an iterative method to calculate the BayesB type of GWEBV was obtained, which was called MixP. MixP yielded accuracies that were very similar to BayesB (Figures 1 and 2). The latter is not surprising because the prior distributions of both MixP and BayesB are mixture distributions, putting a lot of weight on a few SNP with large effects and little weight on many SNP with small effects.
For 100 or more QTL per chromosome, i.e. for more than 3000 QTL in a 30 chromosome genome with a marker density of 1500 per chromosome, there was no significant difference between MixP and GBLUP. Thus, the advantage of allowing for large SNP effects decreased when the number of QTL became large, which was also observed by Daetwyler et al. . Because the estimated π values for grain yield were high (between 0.16 and 0.49), it appears that grain yield was also affected by many QTL, which explained the small differences in accuracy between MixP and GBLUP (Table 1). With the same wheat dataset, Crossa et al.  reported a slightly higher accuracy by a few percents than that found here, but this may be explained by the fact that they simultaneously fitted a polygenic effect, which was not done here. The use of the Pareto principle reduced the number of hyper-parameters, i.e. parameters of the prior distribution, from 3 (π, , and ) to 2 (π and the total genetic variance V g ). We will assume here that the total genetic variance was known. An estimate of V g could be obtained easily using the GBLUP model and a variance component estimation, which would provide an approximately unbiased estimate. If the total genetic variance was known, the number of parameters would be reduced to 1 (π), which could be estimated by cross-validation  in real data situations.
The reduction in number of parameters due to the use of the Pareto principle avoids the need for a multi-parameter cross-validation (for π, , and ), which implies searching through a grid of parameter combinations. If however the multi-parameter cross validation results in a parameter combination that does not adhere to the Pareto principle and is significantly better than the best Pareto principle parameter combination, multi-parameter cross-validation should be preferred to the Pareto principle. The latter will require a very large dataset to clearly demonstrate a significant deviation of the Pareto rule.
The mean of the posterior distribution was used here to estimate SNP effects and to predict genetic values, which implemented the ICE algorithm of . Alternatively, the posterior mode could have been used, which would have resulted in the ICM algorithm [8, 9]. However, the use of a mixture of normals as a prior may result in a bimodal posterior distribution, see . In the case of a mixture of Laplacian distributions, bimodality did indeed occur . In the case of a bimodal posterior, the mode of the distribution depends on which of the two modes happens to have the higher probability density, and may thus differ widely even when the posterior distributions are quite similar. It is well established in statistics that the posterior mean maximizes the accuracy of the SNP effects and consequently yields most accurate GEBV. Since this argument did not depend on the posterior being bimodal or not, we estimated the mean of all the considered posterior distributions, as described by equation (3).
Alternatively, the two hyper-parameters (π, ) can be estimated by MCMC estimation which 1) increases computation time significantly, and 2) may result in hyper-parameter estimates that are adapted to the current data, and will not perform well when used to predict records outside the current data, as is needed for genomic selection. An advantage of the MCMC algorithm is that the sampling also results in standard errors of the estimates. In the current algorithm, standard errors may be obtained from a parametric bootstrapping approach . In this approach, replicated simulated data sets are obtained, where the estimates of the real data are to be taken as the true effects and randomly sampled error terms from the normal distribution are added to obtain simulated records. By estimating the SNP effects of the replicated data sets, the standard deviation of the SNP effects is obtained.
The Pareto principle assumes a small but non-zero effect for the SNP with small effects, whereas the original BayesB of  assumed no effect for such SNP. The assumption that the small SNP effects are real, implicitly implies that a polygenic effect is fitted where the marker-based relationship matrix is used to model the relationships between the animals. In the current and other simulation studies, this is usually not needed, as there are no background genetic effects next to the simulated QTL. However in real data, the fitting of a polygenic effect was found to outperform methods that did not allow for such polygenic effects . In fact, in many cases, the fitting of only one polygenic effect, i.e. GBLUP, was found to yield the highest accuracy . However, it may be expected that the increasing density of the SNP chips facilitates pinpointing QTL positions and methods such as BayesB and MixP will become relatively more accurate. It may also be noted that with π = 0.5, MixP reduces to GBLUP, i.e. optimizing the value of π can automatically result in GBLUP.
where N is the number of training records, L is the genome size, 4N e L is the actual number of chromosome segments from population genetics results, v is the ratio of effective and actual segments, so 4N e Lv is the effective number of segments. This proportional reduction of the genome size and heritability can greatly reduce the simulation time, i.e., only around 1/30 of the computer time was needed here. Thus, using , we can compare the results obtained here with other results using a full genome size.
The main advantage of MixP over BayesB is that MixP is more than two orders of magnitude faster (Table 2). This is due to the fact that BayesB requires MCMC sampling and thus many cycles in order to obtain an accurate estimate of the mean of the parameters, whereas MixP requires a limited number of iterations. Hence, MixP will be able to handle high density SNP chips such as the currently available bovine 500-800k SNP chips, whereas the MCMC based methods such as BayesB are computationally too demanding to apply to half a million or more SNP and many thousands of training records.
Calculation of multivariate log-likelihood using Henderson's mixed model equations
Thus, the calculation of y'V-1y requires only some products of vector (y'y and y'b) and the calculation of , which was needed anyway in order to update the solutions in MixP.
We thank GENO Breeding Company for providing the pedigree of the Norwegian Red cattle. We also thank the funding from the Norwegian Research Council. The helpful comments of three anonymous reviewers are gratefully acknowledged.
- Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME: Invited review: Genomic selection in dairy cattle: progress and challenges. J Dairy Sci. 2009, 92: 433-443. 10.3168/jds.2008-1646.View ArticlePubMedGoogle Scholar
- Meuwissen TH, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.PubMed CentralPubMedGoogle Scholar
- Habier D, Fernando RL, Dekkers JCM: Genomic selection using low-density marker panels. Genetics. 2009, 182: 343-353. 10.1534/genetics.108.100289.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen THE, Solberg TR, Shepherd R, Woolliams JA: A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol. 2009, 41: 2-10.1186/1297-9686-41-2.PubMed CentralView ArticlePubMedGoogle Scholar
- Luan T, Woolliams JA, Lien S, Kent M, Svendsen M, Meuwissen TH: The accuracy of genomic selection in Norwegian Red cattle assessed by cross-validation. Genetics. 2009, 183: 1119-1126. 10.1534/genetics.109.107391.PubMed CentralView ArticlePubMedGoogle Scholar
- VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, Schenkel FS: Invited review: reliability of genomic predictions for North American Holstein bulls. J Dairy Sci. 2009, 92: 16-24. 10.3168/jds.2008-1514.View ArticlePubMedGoogle Scholar
- Juran JM: The non-Pareto principle; mea culpa. 1975, Quality Progress, 89-Google Scholar
- Besag J: On the statistical analysis of dirty pictures. J R Stat Soc [Ser B]. 1986, 48: 259-302.Google Scholar
- Meuwissen THE: Accuracy of breeding values of 'unrelated' individuals predicted by dense SNP genotyping. Genet Sel Evol. 2009, 41: 35-10.1186/1297-9686-41-35.PubMed CentralView ArticlePubMedGoogle Scholar
- Park T, Casella G: The bayesian lasso. J Amer Statistical Assoc. 2008, 103: 681-686. 10.1198/016214508000000337.View ArticleGoogle Scholar
- George EI, McCulloch RE: Variable selection via Gibbs sampling. J Amer Statistical Assoc. 1993, 88: 881-889. 10.2307/2290777.View ArticleGoogle Scholar
- Ishwaran H, Rao JS: Spike and slab variable selection: frequentist and Bayesian strategies. Ann Statist. 2005, 730-773.Google Scholar
- Habier D, Fernando RL, Dekkers JCM: The impact of genetic relationship information on genome-assisted breeding values. Genetics. 2007, 177: 2389-2397.PubMed CentralPubMedGoogle Scholar
- Hayes B, Goddard ME: The distribution of the effects of genes affecting quantitative traits in livestock. Genet Sel Evol. 2001, 33: 209-229. 10.1186/1297-9686-33-3-209.PubMed CentralView ArticlePubMedGoogle Scholar
- Crossa J, de los Campos G, Pérez P, Gianola D, Burgueño J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, Arief V, Banziger M, Braun HJ: Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics. 2010, 186: 713-724. 10.1534/genetics.110.118521.PubMed CentralView ArticlePubMedGoogle Scholar
- Daetwyler HD, Pong-Wong R, Villanueva B, Woolliams JA: The impact of genetic architecture on genome-wide evaluation methods. Genetics. 2010, 185: 1021-1031. 10.1534/genetics.110.116855. [http://view.ncbi.nlm.nih.gov/pubmed/20407128]PubMed CentralView ArticlePubMedGoogle Scholar
- Daetwyler HD, Villanueva B, Woolliams JA: Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One. 2008, 3: e3395-10.1371/journal.pone.0003395.PubMed CentralView ArticlePubMedGoogle Scholar
- Goddard M: Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009, 136: 245-257. 10.1007/s10709-008-9308-0.View ArticlePubMedGoogle Scholar
- Stone M: Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Stat Methodol. 1974, 36: 111-147.Google Scholar
- Efron B, Tibshirani RJ: An introduction to the bootstrap. 1994, Chapman & Hall/CRC Monographs on Statistics & Applied ProbabilityGoogle Scholar
- Henderson CR: Applications of linear models in animal breeding. 1984, University of GuelphGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.