Estimates of missing heritability for complex traits in Brown Swiss cattle
© Román-Ponce et al.; licensee BioMed Central Ltd. 2014
Received: 24 January 2013
Accepted: 28 April 2014
Published: 4 June 2014
Genomic selection estimates genetic merit based on dense SNP (single nucleotide polymorphism) genotypes and phenotypes. This requires that SNPs explain a large fraction of the genetic variance. The objectives of this work were: (1) to estimate the fraction of genetic variance explained by dense genome-wide markers using 54 K SNP chip genotyping, and (2) to evaluate the effect of alternative marker-based relationship matrices and corrections for the base population on the fraction of the genetic variance explained by markers.
Two alternative marker-based relationship matrices were estimated using 35 706 SNPs on 1086 dairy bulls. Both pedigree- and marker-based relationship matrices were fitted simultaneously or separately in an animal model to estimate the fraction of variance not explained by the markers, i.e. the fraction explained by the pedigree. The phenotypes considered in the analysis were the deregressed estimated breeding values (dEBV) for milk, fat and protein yield and for somatic cell score (SCS).
When dEBV were not sufficiently accurate (50 or 70%), the estimated fraction of the genetic variance explained by the markers was around 65% for yield traits and 45% for SCS. Scaling marker genotypes with locus-specific frequencies of heterozygotes slightly increased the variance explained by markers, compared with scaling with the average frequency of heterozygotes across loci. The estimated fraction of the genetic variance explained by the markers using separately both relationships matrices followed the same trends but the results were underestimated. With less accurate dEBV estimates, the fraction of the genetic variance explained by markers was underestimated, which is probably an artifact due to the dEBV being estimated by a pedigree-based animal model.
When using only highly accurate dEBV, the proportion of the genetic variance explained by the Illumina 54 K SNP chip was approximately 80% for Brown Swiss cattle. These results depend on the SNP chip used and the family structure of the population, i.e. more dense SNPs and closer family relationships are expected to result in a higher fraction of the variance explained by the SNPs.
Genome-wide dense marker arrays that are available for livestock populations cover all chromosomes with dense single nucleotide polymorphism (SNP) markers . Many dairy cattle populations are currently being genotyped using these arrays [2–4]. The main objective is to apply genomic selection (GS) . GS allows prediction of the genetic merit of young animals based on marker information in the absence of own performance data. The marker effects are estimated in a reference population, which must have both genotypic and phenotypic records. In the case of dairy bulls, phenotypic data come from genetic evaluations in the form of daughter yield deviation (DYD) or deregressed estimated breeding values (dEBV) .
Identity by descent (IBD) alleles refer to alleles that descend from a common ancestor in the base population . The coefficient of coancestry between two animals is defined as the probability that two randomly sampled alleles from the two animals are IBD , and twice the coancestry is defined as their numerator relationship . This approach leads to the estimation of a matrix of relationships based on the pedigree information. The latter is fundamental to estimate the genetic parameters for complex traits such as heritability (defined as the proportion of the phenotypic variance in a population that is attributed to additive genetic effects). The relationship matrix based on pedigree data dates back to a base population, for which parents are unknown and which is considered unrelated, unselected and non-inbred. The choice of the base population affects the estimate of the additive genetic variance .
However, the relationship matrix can also be estimated from genome-wide genetic markers such as panels of SNPs [10–12]. Methods have been developed to construct such marker-based relationship matrices [12–15]. Recently, these relationship matrices have been used to dissect the additive genetic variance of complex traits .
The proportion of the genetic variance not captured by markers (C miss ) represents the variance that cannot be used by GS and affects the maximum accuracy that can be achieved by GS . The term ‘missing heritability’  describes the fact that marker-phenotype associations identified in genome-wide association studies do not explain all the genetic variance in complex traits (e.g. height in humans). Some strategies have been proposed to reduce C miss : (1) increasing the sample size in order to also detect genes with smaller effects, (2) expanding the studies to non-European samples in human genetics, (3) enlarging the collection of phenotypes to explore gene-gene interactions, (4) changing the structure of the training population, mainly in terms of the relatedness of the included individuals, and (5) moving to the genomic selection approach instead of estimating the marker effect for each SNP individually [13, 19, 20]. In animal breeding, some results suggest that the Illumina Bovine54K chip array (Illumina Inc., San Diego, CA) does not capture all the additive genetic variation for all dairy traits [21–23], even when using the GS approach, it estimates simultaneously all the SNP effects.
The main objective of this study was to estimate the fraction of the genetic variance not explained by the 54 K Illumina SNP chip. Two alternative marker-based relationship matrices were used for analysis.
Genotypic and phenotypic data
The phenotypic data available were the EBV for fat yield (FAT), milk yield (MILK), protein yield (PROT) and somatic cell score in milk (SCS) for each bull, which were calculated by the Italian National Association of Brown Swiss (ANARB). The EBV were deregressed as proposed by Garrick , in order to eliminate the shrinkage contained in the EBV and to remove ancestral information. The deregressed EBV (dEBV) were used as phenotypic records for the bulls with heritability equal to the reliability of the EBV.
Three subsets were formed according to the reliability of EBV as follows: animals with a reliability of at least 50% for each trait; animals with a reliability greater than 70% for each trait; animals with a reliability of at least 90% for each trait.
Relationship matrices: A and G
A pedigree file was extracted from the Italian Brown Swiss herd book. Pedigree was traced back five generations and the pedigree file included 6826 entries. The completeness in the pedigree was 100% up to the grandparents, and decreased to ~90% thereafter. The equivalent number of known generations as calculated by the software Pedig  was on average 5.14 and the median was 5.23. The pedigree file was used to estimate the additive genetic relationships (A) with an adapted version of the procedure proposed by Meuwissen and Luo , as implemented in ASREML .
Correction for the base population
where F st is the average inbreeding in the population, i.e. the average of the diagonal elements of G minus 1, and F is is the inbreeding of animal i relative to the population average inbreeding F st , which is calculated as:
where ∅ jis is the kinship of animal j and i relative to the base population inbreeding, F st .
Estimation of variance components
where y is the vector of the dEBV; μ is the overall mean; Z 1 and Z 2 are the incidence matrices for pedigree-based and genomic random animal effects, respectively; a is the vector of the random additive genetic animal effects using the pedigree-based relationship matrix, with a ~ N(0, A σ2 a ); u is the vector of random additive genetic effect using the genomic relationship matrix, with u ~ N(0, G σ2 u ); and finally, e is the vector of random residual effects. Because the number of daughters per bull was high for all bulls, the reliabilities of the dEBV were high and varied little between bulls, and a homogeneous error variance structure was assumed.
where σ2 g is the total genetic variance, σ2 u is the variance due to marker-based relationships and σ2 a is the variance due to pedigree-based relationships.
The two additive genetic variances were also estimated by fitting each separately: the additive genetic animal variance using the pedigree-based relationship matrix () and the additive genetic variance using the genomic relationship matrix (). The estimate of was used to calculate an alternative estimate for the fraction of genetic variance not addressed by the markers on the SNP chip (Cmiss 2) as follows: . The estimate Cmiss 2 has the advantage that σ2a 0 is known to yield an unbiased estimate of the genetic variance, but it has the disadvantage that σ2u 0 is likely to include more genetic variance than that explained by QTL that are in LD with the markers . E.g. if only some of the chromosomes contain markers, these markers can explain genetic variance at the unmarked chromosomes, because the markers trace family relationships. If, in the latter case, the pedigree-based relationship matrix is fitted simultaneously with the marker-based relationship matrix, the variance due to the unmarked chromosomes is expected to be included in the polygenic variance, σ2 a , because the pedigree-based relationship matrix more closely resembles the family relationships at the unmarked chromosomes than at the marked chromosomes, which may show relationships that (randomly) deviate from the pedigree. Thus, Cmiss 2 is expected to underestimate the fraction of missing genetic variance.
Descriptive statistics for de-regressed estimated breeding values (dEBV) and reliabilities (r 2 ) for production traits*
Number of observations
Somatic cell score
Proportion of genetic variance not explained by markers
Proportion of genetic variance not explained by markers ( C miss ) ± standard error (SE) for dEBV for production traits* 1
0.363 ± 0.069
0.373 ± 0.068
0.363 ± 0.072
0.369 ± 0.070
0.305 ± 0.074
0.337 ± 0.076
0.357 ± 0.074
0.342 ± 0.077
0.358 ± 0.075
0.199 ± 0.101
0.245 ± 0.098
0.345 ± 0.077
0.363 ± 0.074
0.344 ± 0.078
0.357 ± 0.076
0.206 ± 0.098
0.235 ± 0.095
0.486 ± 0.095
0.532 ± 0.091
0.492 ± 0.101
0.530 ± 0.097
0.061 ± 0.197
Proportion of genetic variance not explained by markers ( C miss 2 ) for dEBV for production traits* 1
Results for dMILK, dPROT and dSCS were similar to those described above for dFAT for both genomic relationship matrices. Estimates of C miss for dMILK70 and dPROT70 hardly differed from those for dMILK50 and dPROT50, respectively. The subsets with dEBV90 resulted in estimates of C miss of 0.199 (±0.101) for dMILK90 and 0.206 (±0.098) for dPROT90 when using G Y . These estimates were not significantly different from those obtained with the larger datasets for the same traits (dEBV50 or dEBV70), although they were systematically lower for all traits.
The highest estimates for C miss were obtained for dSCS50, with 0.532 (±0.091) for G V . When using G Y , the corresponding C miss estimate was lower (0.486 ± 0.095). The smallest C miss estimate was obtained for dSCS90: 0.061 (±0.197) using G Y . The variance component analysis with G V on the same dataset did not converge. This was the smallest dataset and, although the average reliability was the highest, estimates of C miss were not significantly different from 0.
In general, estimates of Cmiss 2 decreased as the reliability of the dEBV increased. Estimates of Cmiss 2 differed from estimates of C miss , probably because C miss2 is expected to underestimate the fraction of the missing genetic variance.
We estimated the fraction of the genetic variance not accounted by SNPs in the marker panel (C miss ) based on the Illumina 54 K SNP chip for complex traits in dairy cattle. The results showed that the estimates of C miss depended on the reliability of the phenotypic traits considered, i.e. the dEBV used as response values. When the accuracy of the dEBV increases, i.e. when the correlation between dEBV and the true breeding value increases, the proportion of the genetic variance explained by SNPs tended to increase. When the reliability of the dEBV is low, the family/pedigree information greatly contributes to the estimation of the EBV, which results in a larger fraction of the variance being explained by A and, in turn, in upward biases of C miss . Because the estimates of the C miss values, are expected to be overestimated due to the use of (family information in) dEBV, the best estimates of C miss are obtained for data sets with high reliabilities, which resulted in estimates around 0.2. This implies that the maximum accuracy of GEBV is √(1-C miss ) ≈ 0.9, which agrees with the result of Daetwyler , who studied the increase in the accuracy of GEBV with increasing training population sizes.
For all production traits, the fraction of the genetic variance not explained by the SNPs was significantly different from 0, even when the phenotypes were very accurate (reliability > 90%), and were, therefore, very close to the true breeding values. Correction for the base population did not affect the fraction of the genetic variance explained by markers for any of the marker-based relationships here used. The differences in C miss estimates between using G V and G Y were negligible for all traits and all subsets. Similarly, when using EBV instead of dEBV (results not shown), the results were virtually the same.
If original performance records of production and SCS phenotypes are used to estimates C miss , instead of dEBV, the upward biases mentioned above are not expected to occur. The error variances would be higher than when using dEBV, but the value of σ2a would not be inflated, because family information does not contribute to own phenotype (in contrast to dEBV phenotypes).
The sources of phenotypic information used in genomic analyses are very heterogeneous and vary from individuals with highly reliable information, i.e. progeny-tested bulls, and animals with phenotypes with low levels of accuracy, i.e. young cows. To take into account these differences in reliability in a weighted analysis, it is necessary to know the value of C miss for each phenotype . In addition, a polygenic effect must be included in the model to account for unmarked genetic effects. Knowledge of the fraction of the genetic variance not explained by markers is also required to predict the accuracy of the genomic predictions for each individual in the population, since it affects the maximum accuracy that can be achieved .
The base population correction of the genomic relationship matrix generally affected neither the proportion of genetic variance captured by markers, nor the genetic variance captured by the pedigree-based relationship matrices, which agrees with [17, 30] but not with . The latter authors, however, scaled the relationships in the opposite direction, i.e. when G relationships were too high, they scaled all relationships downwards, which further decreased the differences in relationships that were already small since relationships are bound by a maximum of 1 (and vice-versa when G relationships were too small). Moreover, the correction for the base population facilitates the integration of relationship matrices A and G into a single matrix (H), according to Legarra et al. , Christensen and Lund , and Meuwissen et al. .
We also estimated Cmiss 2 using the pedigree-based estimate of genetic variance. The denominators of C miss and Cmiss 2 were significantly different from each other but both estimates revealed that the genomic relationship matrix could explain more than 95% of genetic variance if sufficiently reliable phenotypes are used (with reliabilities greater than 95%).
It should be noted that the estimates of C miss and Cmiss 2 depend on the SNP chip used, i.e. more dense SNP chips are expected to yield lower estimates of C miss and Cmiss 2 (a larger fraction of the variance is explained by the SNPs), and also on the family structure of the population . Populations with more closely related individuals are expected to yield high LD between SNPs and QTL, even when they are physically quite far apart and, therefore, lower estimates of C miss . The population structure of the Italian Brown Swiss population reflects that of a typical dairy breeding population, and, thus, our results probably apply also to other dairy breeding populations.
The fraction of genetic variance explained by genetic markers from high-density SNP panels was significantly different from 0 for the complex traits analyzed when the phenotypes are not highly accurate. The minimum fraction of the genetic variance not explained by the markers (C miss ) was equal to 0.2, which was estimated based on the most accurate phenotypes. This value agrees with other values reported in the literature. Correction of the genomic relationship matrix for the variance of the allele frequency of each locus (G Y ) instead of the average frequency of heterozygotes (G V ), hardly explained any additional genetic variance. Our estimate of C miss of 0.2 implies that about 80% of the genetic variance is explained by the Illumina 54 K SNP chip. Values for C miss are expected to depend on the density of the chip (a larger SNP chip is expected to explain a larger fraction of the genetic variance) and on family relationships in the population, i.e. closer family relationships are expected to reduce C miss .
The helpful comments of three reviewers are gratefully acknowledged. We gratefully acknowledge the Italian Brown Cattle Breeders’ Association (ANARB) for collecting, handling and sharing data. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 222664. (“Quantomics”). This article reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained herein.
- Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, O’Connell J, Moore SS, Smith TPL, Sonstegard TS, Van Tassell CP: Development and characterization of a high density SNP genotyping assay in cattle. PLoS ONE. 2009, 4: e5350-PubMed CentralView ArticlePubMedGoogle Scholar
- Berry DP, Kearney F, Harris B: Genomic selection in Ireland. Interbull Bull. 2009, 39: 29-34.Google Scholar
- Schenkel FS, Sargolzaei M, Kistemaker G, Jansen GB, Sullivan P, Van Doormaal BJ, VanRaden PM, Wiggans GR: Reliability of genomic evaluation of Holstein cattle in Canada. Interbull Bull. 2009, 39: 51-58.Google Scholar
- VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, Schenkel FS: Invited review: Reliability of genomic predictions for North American Holstein bulls. J Dairy Sci. 2009, 92: 16-24.View ArticlePubMedGoogle Scholar
- Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.PubMed CentralPubMedGoogle Scholar
- Calus MPL: Genomic breeding values prediction: Methods and procedures. Animal. 2010, 4: 157-164.View ArticlePubMedGoogle Scholar
- Wright S: Coefficients of inbreeding and relationship. Am Nat. 1922, 56: 330-338.View ArticleGoogle Scholar
- Malécot G: Les Mathématiques de l’Hérédité. 1948, Paris: Masson et CieGoogle Scholar
- van der Werf JH, de Boer IJ: Estimation of additive genetic variance when base populations are selected. J Anim Sci. 1990, 68: 3124-3132.PubMedGoogle Scholar
- Fernando RL: Proceedings of the 6th World Congress in Genetics Applied to Livestock Production: 11–16 January 1998; Armidale. 1998, 329-336. Genetic evaluation and selection using genotypic, phenotypic and pedigree information, 26,Google Scholar
- Habier D, Fernando RL, Dekkers JCM: The impact of genetics relationship information on genome-assisted breeding values. Genetics. 2007, 177: 2389-2397.PubMed CentralPubMedGoogle Scholar
- VanRaden PM: Efficient methods to compute genomic predictions. J Dairy Sci. 2008, 91: 4414-4423.View ArticlePubMedGoogle Scholar
- Christensen OF, Lund MS: Genomic prediction when some animals are not genotyped. Genet Sel Evol. 2010, 42: 2-PubMed CentralView ArticlePubMedGoogle Scholar
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM: Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010, 42: 565-569.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen THE, Luan T, Woolliams JA: The unified approach to the use of genomic and pedigree information in genomic evaluations revisited. J Anim Breed Genet. 2011, 128: 429-439.View ArticlePubMedGoogle Scholar
- Lee SH, Goddard ME, Visscher PM, van der Werf JHJ: Using the realized relationship matrix to disentangle confounding factors for the estimation of genetic variance components of complex traits. Genet Sel Evol. 2010, 42: 22-PubMed CentralView ArticlePubMedGoogle Scholar
- Dekkers JC: Prediction of response to marker-assisted and genomic selection using selection index theory. J Anim Breed Genet. 2007, 124: 331-341.View ArticlePubMedGoogle Scholar
- Maher B: Personal genomes: The case of the missing heritability. Nature. 2008, 456: 18-21.View ArticlePubMedGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Golstein DB, Hindoff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boenhnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TFC, McCarrol SA, Visscher PM: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747-753.PubMed CentralView ArticlePubMedGoogle Scholar
- Makowsky R, Pajewski NM, Klimentidis YC, Vazquez IA, Duarte CW, Allison DB, de los Campos G: Beyond missing heritability: prediction of complex traits. PLoS Genet. 2011, 7: e1002051-PubMed CentralView ArticlePubMedGoogle Scholar
- Garrick DJ, Taylor JT, Fernando RL: Deregressing estimated breeding values and weighting information for genomic regression analyses. Genet Sel Evol. 2009, 41: 55-PubMed CentralView ArticlePubMedGoogle Scholar
- Daetwyler HD: Genome-Wide Evaluation of Populations. PhD Thesis. 2009, Wageningen: Wageningen UniversityGoogle Scholar
- Haile-Mariam M, Nieuwhof GJ, Beard KT, Konstatinov KV, Hayes BJ: Comparison of heritabilities of dairy traits in Australian Holstein-Friesian cattle from genomic and pedigree data and implications for genomic evaluations. J Anim Breed Genet. 2013, 130: 20-31.View ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC: PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007, 81: 559-575.PubMed CentralView ArticlePubMedGoogle Scholar
- Boichard D, Maignel L, Verrier E: The value of using probabilities of gene origin to measure genetic variability in a population. Genet Sel Evol. 1997, 29: 5-23.PubMed CentralView ArticleGoogle Scholar
- Meuwissen THE, Luo Z: Computing inbreeding coefficients in large populations. Genet Sel Evol. 1992, 24: 305-313.PubMed CentralView ArticleGoogle Scholar
- Gilmour AR, Gogel BJ, Cullis BR, Thompson R: ASREML User Guide Release 3.0. 2009, Queensland, Australia: The Department of Primary Industries and FisheriesGoogle Scholar
- Goddard ME, Hayes B, Meuwissen THE: Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet. 2011, 128: 409-421.View ArticlePubMedGoogle Scholar
- Butler D, Cullis B, Gilmour A, Gogel B: ASReml-R Reference Manual, Version 3. 2009, Queensland, Australia: The Department of Primary Industries and FisheriesGoogle Scholar
- Sorensen DA, Kennedy BW: Estimation of genetic variances from unselected and selected populations. J Anim Sci. 1984, 59: 1213-1223.Google Scholar
- Forni S, Aguilar I, Misztal I: Different genomic relationship matrices for single-step analysis using phenotypic, pedigree and genomic information. Genet Sel Evol. 2011, 43: 1-PubMed CentralView ArticlePubMedGoogle Scholar
- Legarra A, Aguilar I, Misztal I: A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009, 92: 4656-4663.View ArticlePubMedGoogle Scholar
- Jensen J, Su G, Madsen P: Partitioning additive genetic variance into genomic and remaining polygenic components for complex traits in dairy cattle. BMC Genet. 2012, 13: 44-PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.