- Open Access
Estimation of heritability from limited family data using genome-wide identity-by-descent sharing
© Ødegård and Meuwissen; licensee BioMed Central Ltd. 2012
- Received: 1 February 2012
- Accepted: 8 May 2012
- Published: 8 May 2012
In classical pedigree-based analysis, additive genetic variance is estimated from between-family variation, which requires the existence of larger phenotyped and pedigreed populations involving numerous families (parents). However, estimation is often complicated by confounding of genetic and environmental family effects, with the latter typically occurring among full-sibs. For this reason, genetic variance is often inferred based on covariance among more distant relatives, which reduces the power of the analysis. This simulation study shows that genome-wide identity-by-descent sharing among close relatives can be used to quantify additive genetic variance solely from within-family variation using data on extremely small family samples.
Identity-by-descent relationships among full-sibs were simulated assuming a genome size similar to that of humans (effective number of loci ~80). Genetic variance was estimated from phenotypic data assuming that genomic identity-by-descent relationships could be accurately re-created using information from genome-wide markers. The results were compared with standard pedigree-based genetic analysis.
For a polygenic trait and a given number of phenotypes, the most accurate estimates of genetic variance were based on data from a single large full-sib family only. Compared with classical pedigree-based analysis, the proposed method is more robust to selection among parents and for confounding of environmental and genetic effects. Furthermore, in some cases, satisfactory results can be achieved even with less ideal data structures, i.e., for selectively genotyped data and for traits for which the genetic variance is largely under the control of a few major genes.
Estimation of genetic variance using genomic identity-by-descent relationships is especially useful for studies aiming at estimating additive genetic variance of highly fecund species, using data from small populations with limited pedigree information and/or few available parents, i.e., parents originating from non-pedigreed or even wild populations.
- Haplotype Block
- Additive Genetic Variance
- Additive Genetic Effect
- Effective Locus
- Common Environmental Effect
Estimates of additive genetic variance are commonly based on data from large pedigreed populations incorporating all known relationship information. Additive genetic relationships can be defined as twice the identity-by-descent (IBD) probability of two randomly drawn alleles, which can be estimated from pedigree data. The advantages of these pedigree-based analyses are that they do not require any knowledge about the genetic architecture of the traits and that the additive relationships are easily inferred from a known pedigree. However, these methods also have some major limitations. First, such analyses ignore relationships beyond those included in the known pedigree. Second, the assumed relationships are expected relationships (based on expected sharing of IBD alleles) rather than actual relationships. In fact, the pedigree relationship is exact only under an infinitesimal model , i.e., assuming that the additive genetic effects of the quantitative traits are controlled by an infinite number of unlinked loci. Under a more realistic finite-locus model (and assuming that some of the loci are linked), the actual relationships will be distributed around the expectation, with variable relationships among full-sibs and other relatives . By assuming (incorrectly) homogeneous relationships among the same type of relatives (e.g., sibs), in pedigree-based analyses, the genetic (co)variance components are estimated based on between-family variation only, since the Mendelian sampling deviations of non-parents cannot be separated from the residual (or permanent environmental) effects on the same animals . Estimation of genetic variance based on pedigree relationships is further complicated by the fact that common environmental effects may be important for some relatives, especially full-sibs (e.g., maternal environment, rearing environment, litter effects, etc.), which means that the genetic variance must be estimated from covariances among phenotypes of more distant relatives (e.g., half sibs, cousins, etc.).
Due to linkage between loci within the same chromosome, parents tend to pass on long segments of DNA to their offspring. Hence, the "effective" number of segregating loci within a full-sib family will be much lower than the corresponding number for the whole population, even for species with a larger genome. For example, recent reports have indicated that the effective number of segregating loci among full-sib pairs in humans is only about 80 [4, 5]. When the effective number of segregating loci is low, the actual relationships among full-sibs vary substantially among sib-pairs. Visscher et al.  have estimated that actual relationships among human full-sibs vary from 0.37 to 0.62, and used these relationships to quantify the additive genetic variation of human height based on within-family segregation only, i.e., free from non-genetic factors. In this study, the heritability values were based on more than 3000 sib pairs. With such a large dataset, including numerous families, the main challenge is not to estimate between-family variation, but rather to separate genetic effects from other effects that act on a family level. Visscher et al.  pointed out that one limitation of their method was that it required large datasets with densely genotyped individuals. Indeed, for a sib-pair design (twin study), a large number of full-sib pairs would be needed. However, for livestock, aquaculture species and laboratory animals, population structures are usually very different from those in humans, with much larger progeny groups of either full- or half sibs (or both). Therefore, the aim of the current study was to test whether genetic variance could be accurately estimated with relatively small datasets and a limited number of families, using a population structure typical of a high fecundity species (e.g., insects, crustaceans, fish or poultry), and whether the results could also be generalized to species in which only one of the sexes (usually males) has a large reproductive potential (e.g., mammalian livestock).
Genomic identity-by-descent relationships
Data sets and data structures
Simulated data structures
Animals per family
Input variance components
All combinations of structures and scenarios were run in 50 replicates and the results were averaged over replicates. However, for single-family structures, in which all animals are necessarily within the same family environment, there was no practical difference between Scenarios 1 and 2 (the environmental family effect will be included in the overall mean). Hence, 1200 different datasets were generated and analyzed.
where L is a lower triangular Cholesky decomposition of G, and z is a vector of standard normal deviates of length N (number of animals in the dataset). This assumes that (1) genetic variance is evenly distributed across the genome, and (2) gene effects are normally distributed, or that the aggregated effect of many genes, i.e., the breeding values, are approximately normally distributed, even when individual gene effects are not (due to the central limit theorem). It is also assumes that the different founder alleles at an "effective locus" have unique allelic effects, because an "effective locus" contains many genes and thus contains a unique combination of alleles at these genes. All datasets were generated using the MATLAB® software http://www.mathworks.com.
where Xβ includes fixed effects of each family, or in absence of common environmental family effects, the overall mean only. If fitted, common environmental family effects were included as fixed effects due to the fact that the number of families included was very small (and the number of observations per family large) and thus the associated variance was difficult to estimate.
Model 1: Genomic IBD animal model
In this model, the additive genetic effects were assumed: where G was calculated over all genomic regions, i.e. it was assumed that the inheritance of the DNA segments from parents to offspring could be accurately traced using marker information, and genetic variance was evenly distributed over genome segments (irrespective of the simulation scenario). The genomic IBD animal model is equivalent to a genome-wide gametic model, i.e., a model in which the original gametes received from the sire and dam and their associated actual relationships are reconstructed using genomic data . The relationship between two individuals in the animal model is twice the average of the four gametic relationships (coancestry) for the two individuals, and the animal genetic variance is twice the gametic variance.
Model 2: Pedigree-based animal model
This is the classical animal model, assuming where A is the numerator relationship matrix (inferred from the pedigree). Furthermore, since the classical model uses only between-family variation to estimate additive genetic effect variance, environmental effects common to full-sibs could not be included in the model for this simple data structure (irrespective of whether they were present or not).
For both models, variance components were estimated with restricted maximum likelihood methodology using the ASREML software package .
The genomic IBD model assumes that all animals are genotyped with a sufficiently dense marker map covering the entire genome. However, in some studies, selective genotyping of phenotypically extreme (high/low) animals within each family may be used to save costs. This may be a useful approach for QTL (Quantitative Trait Loci) detection, but our aim was to evaluate whether such data could be used to estimate quantitative genetic variation as well. In these analyses, we assumed a single family with 200, 500 or 1000 full-sibs, and for which only individuals with phenotypes deviating more than one residual standard deviation from the mean were genotyped. However, because including only the genotyped (phenotypically extreme) animals in the analysis would probably yield overestimated variance components, the non-genotyped animals were also included in the analysis. For the analyses, genomic IBD relationships among genotyped individuals were combined with pedigree relationships of non-genotyped individuals in a common relationship matrix [8, 9].
When assuming no common environmental variance (Scenario 2), the pedigree-based analyses were also unbiased but they were less precise than the genomic IBD analyses (Figure 3). As expected, if common environmental effects were not included in the data, the precision of the estimated heritability was improved, in particular for the smallest datasets using the classical model, while the precision of the IBD model was unaffected for the largest datasets (1000 individuals). The differences between the two models were most pronounced with larger datasets with a few families. For the IBD genomic model, within-family variation dominated estimation of genetic variance, and thus reducing family sizes to give room for more families led to more imprecise estimates of genetic variance.
This study shows that tracing genomic IBD relationships using genomic information has clear advantages, not only for prediction of individual breeding values  but also for estimation of genetic (co)variance components. Both the current and earlier studies have shown that genetic variance can be estimated based on within-family variation. In contrast, estimation of genetic variance in a classical genetic analysis is based only on between-family variation. Hence, for the latter, it is imperative that genetic and non-genetic family effects are properly separated by the model, which puts major limitations on the usefulness of family data, e.g., resemblance among full-sibs may also be due to similarities in the environment. Furthermore, for an accurate estimation of genetic variance in a classical model, many families must be included in the study and selection of data should be avoided. However, by using actual IBD sharing among sibs instead of expected relationships, genetic variation can be quantified solely from within-family variation , which also facilitates proper separation of genetic and non-genetic family effects (as the latter do not affect within-family variation). The current study shows that with the genomic IBD approach, genetic variance can be accurately inferred from a single family, and for a given number of observations, including more families gives less accurate results (even in the absence of common environmental effects). For a classical analysis (and in absence of common environmental effects), Falconer and Mackay  showed that the optimal family size for a specific number of observations under a full-sib design was n = 2/h2. However, using the genomic IBD approach, this formula is no longer valid. For a genomic IBD analysis, including only a single large family will maximize the precision of the predicted Mendelian sampling deviations of all family members, and there is also no need to separate genetic and environmental family effects. Furthermore, the number of families is necessarily lower than the number of individuals, and prediction of Mendelian sampling deviations is therefore more informative than prediction of family means with respect to genetic variance. Estimation of genetic variance from family means (classical model) is also sensitive to selection of data (both within and among families), and failure to account for this is expected to give downward biased estimates of the genetic variance. In populations undergoing artificial selection, parents of the phenotyped animals are usually selected, and unbiased genetic analysis requires that the selection history is properly included in the data, i.e., the analysis should involve data on multiple generations, which is not always available. However, selection among parents will have little impact on the within-family genetic variance (in absence of inbreeding). Since the genomic IBD model uses mainly within-family variation, it is expected to be more robust to selection among parents.
Even with selective genotyping, the genomic IBD model could estimate genetic variance relatively precisely for large families (N = 1000), although there was a tendency towards overestimated heritability values and less precise results for smaller family sizes. This may be explained by the fact that only the phenotypically most extreme individuals are genotyped and only these are informative with respect to partitioning of residual and Mendelian sampling variances.
The current study assumes that inheritance of the haplotype blocks from parents to offspring is known. In real data, this is never the case but we may observe genome-wide marker genotypes and this information can be used to trace inheritance of the haplotype blocks. Furthermore, since the number of recombinations per gamete is limited, sharing of haplotype blocks within a family can be estimated with a high degree of accuracy, even with sparsely distributed genome-wide markers . Reconstructing paternal and maternal haplotype blocks is equivalent to reconstructing the original gametes received from the sire and dam, making the genomic IBD animal model and a genomic IBD gametic model equivalent.
In species with a low reproductive potential, the proposed single-family mating design is of little relevance, since large full-sib groups cannot be produced. However, some species have a high reproductive potential among males, while the reproduction of females is often limited (e.g., in mammalian livestock). For such species, a sire gametic model may be more relevant. In such a model, the sires' gametes are reconstructed and the actual relationships between them estimated. Genetic variance can then be estimated from variation among the sires' gametes (half the within-family genetic variance), rather than variation among individuals (sires' and dams' gametes). In this model, genetic variation due to the dams' gametes will be included in the random residual term and genetic variance may be estimated from samples of offspring (e.g., daughters of dairy bulls).
The proposed method can also be generalized to IBD tracing in more complex pedigrees, i.e., beyond a single generation, allowing information from various types of relatives to be exploited by linkage analysis e.g., . More distant relatives are generally less related but their relationships are also expected to deviate more from their expected values . Hence, distant relatives may provide additional value to estimate genetic variance, especially for populations with smaller full- and half-sib groups. However, tracing haplotype blocks over multiple generations will be more challenging (shorter DNA blocks due to more recombination) and will require denser marker maps for accurate tracing compared with the one-generation (sib) approach.
The proposed method will underestimate the total genetic variance in cases where a fraction of the genome is not covered by the markers, i.e., if some of the "effective loci" are not accounted for in the G matrix. For instance, if a fraction q of the genome is not covered by markers, the total variance will also be underestimated by a fraction q when the single full-sib family design is used. When the design contains several families, the between-family variances are quite accurately predicted, even if part of the genome is not covered by markers, which will recover some of the underestimation. The underestimation may be completely recovered by including a polygenic effect in the model, which has a covariance structure equal to the pedigree-based relationship matrix, requiring several families of data.
Scenarios 1 and 2 in the current study assumed that genetic variance is evenly distributed over genomic regions, as assumed in the genomic BLUP (GBLUP) model . The main difference between the GBLUP model and the genomic IBD model is that the first model uses identity-by-state (IBS) relationships, while the latter uses IBD relationships (based on marker alleles traced back to a common ancestor). The assumption that genetic variance is distributed evenly across genomic regions has been shown to be an appropriate approximation for a number of traits [15, 16]. However, there are also examples of the opposite assumption, e.g., genetic variation in resistance against infectious pancreas necrosis in Atlantic salmon seems largely controlled by a single major QTL [17, 18]. For the latter type of traits, some of the underlying assumptions of both the pedigree-based and genomic IBD models are violated. First, within-family genetic variance will vary greatly among families, depending on the actual parental genotypes ("effective alleles") for the genomic region that primarily affects the trait (although it will still on average be ); in the example on Atlantic salmon, the within-family genetic variance will depend on whether or not the parents segregate for the major QTL. Second, IBD relationships in the most important linkage group(s) will dominate genetic covariance between relatives, not the overall genomic or expected (pedigree-based) IBD relationships. Still, even for such data, the genomic IBD model could estimate genetic variance more accurately than the classical pedigree-based analysis. Hence, although the genomic IBD relationships are not necessarily representative of the genetic covariance structure among sibs in this situation, they are still more informative than the pedigree-based relationships. In this setting, the differences between the classic pedigree and genomic IBD models increased with the size of the dataset (no practical difference with 200 individuals but a substantial difference with 1000 individuals). However, estimation of genetic variance within a single family was, as expected, highly prone to sampling effects. In Scenario 3, the real number of different breeding values represented within a single full-sib family is actually limited to four (two "effective alleles" per parent), which explains the large between-replicate deviations in the estimated heritability. Thus, in real data, for which the underlying genetics of the trait is generally unknown, it is recommended to use more than one family for quantitative genetic analysis, even when applying the genomic IBD approach.
The proposed genomic IBD method is particularly relevant for quantitative genetic studies aiming at estimating additive genetic variance of highly fecund species, using data on populations with limited pedigree information and/or few available parents. For example, genetic variance may be estimated based on a few full-sib-families with parents sampled from the wild or from non-pedigreed domesticated populations. In principle, the genomic IBD model (or equivalent gametic model) requires only a single large family for proper and accurate estimation of heritability for quantitative traits. In contrast, classical pedigree-based estimation requires the establishment of a sizeable pedigreed population consisting of numerous full- and (preferably) half-sib families to produce estimates with acceptable accuracy. Furthermore, the proposed genomic IBD model is expected to be less affected by selection among parents and will facilitate the separation of genetic and non-genetic family effects (e.g., effects of common rearing).
This work was supported by the grant 203699 (New statistical tools for integrating and exploiting complex genomic and phenotypic data sets), financed by the Research Council of Norway. The helpful comments of two anonymous reviewers are gratefully acknowledged.
- Fisher RA: The correlation between relatives on the supposition of Mendelian inheritance. Trans Royal Soc Edinburgh. 1918, 52: 399-433.View ArticleGoogle Scholar
- Hill WG: Variation in genetic composition in backcrossing programs. J Heredity. 1993, 84: 212-213.Google Scholar
- Ødegård J, Meuwissen THE, Heringstad B, Madsen P: A simple algorithm to estimate genetic variance in an animal threshold model using Bayesian inference. Genet Sel Evol. 2010, 42: 29-10.1186/1297-9686-42-29.PubMed CentralView ArticlePubMedGoogle Scholar
- Gagnon A, Beise J, Vaupel JW: Genome-wide identity-by-descent sharing among CEPH siblings. Genet Epidem. 2005, 29: 215-224. 10.1002/gepi.20090.View ArticleGoogle Scholar
- Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, Cornes BK, Montgomery GW, Martin NG: Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2006, 2: e41-10.1371/journal.pgen.0020041.PubMed CentralView ArticlePubMedGoogle Scholar
- Goddard M: Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009, 136: 245-257. 10.1007/s10709-008-9308-0.View ArticlePubMedGoogle Scholar
- Gilmour AR, Gogel BJ, Cullis BR, Thompson R: ASReml user guide release 3.0. 2009, Hemel Hempstead: VSN International Ltd, 30Google Scholar
- Aguilar I, Misztal I, Johnson DL, Legarra A, Tsuruta S, Lawlor TJ: Hot topic: A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J Dairy Sci. 2010, 93: 743-752. 10.3168/jds.2009-2730.View ArticlePubMedGoogle Scholar
- Christensen O, Lund M: Genomic prediction when some animals are not genotyped. Genet Sel Evol. 2010, 42: 2-10.1186/1297-9686-42-2.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.PubMed CentralPubMedGoogle Scholar
- Falconer DS, Mackay TFC: Introduction to quantitative genetics. 1996, Essex: Longman Group LtdGoogle Scholar
- Habier D, Fernando RL, Dekkers JCM: Genomic selection using low-density marker panels. Genetics. 2009, 182: 343-353. 10.1534/genetics.108.100289.PubMed CentralView ArticlePubMedGoogle Scholar
- Fernando R, Grossman M: Marker assisted selection using best linear unbiased prediction. Genet Sel Evol. 1989, 21: 467-477. 10.1186/1297-9686-21-4-467.PubMed CentralView ArticleGoogle Scholar
- Hill WG, Weir BS: Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet Res (Camb). 2011, 93: 47-64. 10.1017/S0016672310000480.View ArticleGoogle Scholar
- Visscher PM, Macgregor S, Benyamin B, Zhu G, Gordon S, Medland S, Hill WG, Hottenga J-J, Willemsen G, Boomsma DI, Liu Y-Z, Deng H-W, Montgomery GW, Martin NG: Genome partitioning of genetic variation for height from 11,214 sibling pairs. Am J Human Genet. 2007, 81: 1104-1110. 10.1086/522934.View ArticleGoogle Scholar
- Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, de Andrade M, Feenstra B, Feingold E, Hayes MG, Hill WG, Landi MT, Alonso A, Lettre G, Lin P, Ling H, Lowe W, Mathias RA, Melbye M, Pugh E, Cornelis MC, Weir BS, Goddard ME, Visscher PM: Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet. 2011, 43: 519-525. 10.1038/ng.823.PubMed CentralView ArticlePubMedGoogle Scholar
- Moen T, Baranski M, Sonesson AK, Kjøglum S: Confirmation and fine-mapping of a major QTL for resistance to infectious pancreatic necrosis in Atlantic salmon (Salmo salar): population-level associations between markers and trait. BMC Genomics. 2009, 10: 368-10.1186/1471-2164-10-368.PubMed CentralView ArticlePubMedGoogle Scholar
- Houston RD, Haley CS, Hamilton A, Guy DR, Mota-Velasco JC, Gheyas AA, Tinch AE, Taggart JB, Bron JE, Starkey WG, McAndrew BJ, Verner-Jeffreys DW, Paley RK, Rimmer GSE, Tew IJ, Bishop SC: The susceptibility of Atlantic salmon fry to freshwater infectious pancreatic necrosis is largely explained by a major QTL. Heredity. 2010, 105: 318-327. 10.1038/hdy.2009.171.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.