- Open Access
Using the realized relationship matrix to disentangle confounding factors for the estimation of genetic variance components of complex traits
Genetics Selection Evolution volume 42, Article number: 22 (2010)
In the analysis of complex traits, genetic effects can be confounded with non-genetic effects, especially when using full-sib families. Dominance and epistatic effects are typically confounded with additive genetic and non-genetic effects. This confounding may cause the estimated genetic variance components to be inaccurate and biased.
In this study, we constructed genetic covariance structures from whole-genome marker data, and thus used realized relationship matrices to estimate variance components in a heterogenous population of ~ 2200 mice for which four complex traits were investigated. These mice were genotyped for more than 10,000 single nucleotide polymorphisms (SNP) and the variances due to family, cage and genetic effects were estimated by models based on pedigree information only, aggregate SNP information, and model selection for specific SNP effects.
Results and conclusions
We show that the use of genome-wide SNP information can disentangle confounding factors to estimate genetic variances by separating genetic and non-genetic effects. The estimated variance components using realized relationship were more accurate and less biased, compared to those based on pedigree information only. Models that allow the selection of individual SNP in addition to fitting a relationship matrix are more efficient for traits with a significant dominance variance.
Complex traits are important in evolution, human medicine, forensics and artificial selection programs [1–4]. Most complex traits show a mode of inheritance that may be caused by many functional genes with additive and dominance effects, and possibly epistatic interactions, and environmental effects [5, 6].
Traditionally, pedigree information has been used to estimate heritabilities and genetic effects for complex traits [7–10]. In many family studies, non-genetic factors such as familial or shared environmental effects can be confounded with genetic factors . In particular for full-sibs there is confounding between shared environmental effects, additive genetic effects and non-additive genetic effects.
Recently, it has become feasible to generate individual genotype information on large numbers of single nucleotide polymorphisms (SNP) across the whole genome, and genome-wide association studies have been performed in a number of species [12, 13]. It is expected that SNP and causal genes will be in linkage disequilibrium (LD), making it possible to genetically dissect variation in complex traits in a more effective way . Indeed, it has been shown that whole-genome dense SNP analyses can provide extra benefits compared to classical approaches based on pedigree information only .
In this study, we propose novel strategies that utilize dense SNP data for the genetic dissection of complex traits. First, we estimate a realized relationship matrix based on aggregate SNP information [16–18]. The realized relationship matrix in a classical mixed linear model makes it possible to obtain more accurate and reliable estimates for the narrow sense heritability, compared to traditional pedigree-based analysis [19, 20]. Second, we explicitly search for additional additive and dominance effects that may not have been already captured, by using a Bayesian model selection approach. In the process, a stochastic model selection of random SNP effects is carried out nested in a mixed linear model with additive polygenic effects. Additional genetic effects found in this process make it possible to estimate additive genetic and dominance variances with greater precision for some traits which have significant dominance effects. We examine the estimates by using a validation step where unobserved phenotypes in an independent validation set are predicted. We use phenotypic data for four complex traits and genotypic data for ~2200 mice with ~11,000 SNP across the whole genome.
Publicly available data including pedigree, genotypic and phenotypic information on heterogeneous stock mice were used ; http://gscan.well.ox.ac.uk/. The total number of animals was 2,296 from 85 unrelated families. The available pedigree spanned four generations. In this complex pedigree, there were 172 full-sib families with an average size of ~11 (SD ~8). The mice were reared in a total of 536 cages, and the number of animals per cage ranged from two to seven. This number was considered as a cage density factor for analyses. Figure 1 describes the family structure for one of the 85 unrelated families, which contains 44 members and five nuclear (full-sib) families. Cage information is displayed below each animal when known and indicates a fair degree of confounding between cages and families. Genotypes were available for 12,112 SNP on most animals in the pedigree, and we used the 11,730 SNP located on the autosomal chromosomes. The reason for excluding the sex chromosomes was that modeling them would complicate the analyses without greatly changing the estimates. The phenotypes were already adjusted for environmental fixed effects, e.g. sex, age, year and season [21, 22]. However, the effects due to cage, cage density and family were further modeled with and without using information on SNP and additive polygenic effects. Four complex traits were investigated i.e. coat color (CC) (a score from light to dark), weight at 10 weeks (WT), recovery from ear punctuation (REP), and freezing time during cue (FDC). The reasons for choosing these are: CC has a number of major genes with relatively large effects and the environmental variance is small, WT is a typical quantitative trait with the variance probably affected by numerous genes, REP is a quantitative trait with a moderate heritability, and FDC is a quantitative trait with a low heritability.
Preliminary analysis for each trait
The intra-class correlation of phenotypes for groups having relationship k based on pedigree information was estimated (k = 1/16, 1/8, 1/4 and 1/2). For example, the intra-class correlation for the group with relationship k = 1/2 was that for full-sibs. However, for relationship k = 1/16, 1/8, and 1/4, it was difficult to group and classify them because of the complicated pedigree structure. In order to estimate intra-class correlations for the group with relationship k, pairs of relationship k were used, but in a way that there were no relationships between individuals of different pairs, i.e. relationship = k within each pair and relationship = 0 for individuals of different pairs. Because of this restriction, not all pairs of relationship k could be used simultaneously. Therefore, we sampled 10,000 independent pairs for each relationship k for each trait. The number of pairs for relationship k, and the average number of pairs in 10,000 samples are given in Table 1. The variance between these sampled pairs scaled by total variance would be the intra-class correlation  for individuals having a relationship k. Estimated intra-class correlations were averaged over the 10,000 sampling sets. These correlations are, approximately, the summary statistics that are modeled in the variance component analyses.
Mixed linear model implementing a numerator relationship matrix based on pedigree information
A mixed linear model analysis was used to estimate random polygenic, cage and family effects, and the fixed effect of cage density. The model can be expressed as,
where y is a vector of N r phenotypic observations, β is a vector of fixed effects including the overall mean and the cage density as covariates, f is a vector of N f random environmental family effects, c is a vector of N c random environmental cage effects, u is a vector of N random additive polygenic effects for all animals derived from pedigree information (N = 2296), and e is a vector of N r residuals. It is assumed that f, c and u are normally distributed with a mean of 0 and a variance of , and , respectively. X, W, U and Z are incidence matrices for the effects. The variance covariance matrix (V) of phenotypic observations for the model can be written as,
where A is the numerator relationship matrix based on pedigree information only, and I is an identity matrix. In order to see if estimates for genetic and environmental family effects are dependent, a simple comparison is carried out for model 1, by omitting subsequently the term u (model 1-u) or f (model 1-f). Variance components and effects are estimated by a residual maximum likelihood (REML) method [24, 25]. The ratio of each variance component over the total phenotypic variance was calculated.
Mixed linear model implementing a realized relationship matrix based on genome wide SNP information
When SNP information is available, the realized relationship matrix (G) can be estimated and implemented in the model [16–18]. To estimate G, we used the method introduced by Oliehoek et al. (2006) since it is robust and best-performed among tested methods in their study. The details to estimate G are in Appendix A. The model can be written as,
where g is a vector of N random genome-wide effects for all animals. It was assumed that g is normally distributed with mean 0 and variance . The variance covariance matrix of phenotypic observations for this model is,
Bayesian approach to model specific SNP effects
Effects of specific quantitative trait loci (QTL) may not be fully captured by model 2, and a Bayesian approach can be used to explicitly search for sets of SNPs that explain additional genetic variance. In the first instance, we model only additive effects of QTL. The model can be written as,
where n q is the number of SNP associated with the QTL, ∝ i is the random additive effects of the ith SNP which is normally distributed with mean 0 and variance , Λ i is a column vector having coefficients 0, 1 or 2 representing indicator variables of the genotype for each animal at the ith SNP. The variance covariance matrix of phenotypic observations is,
In addition to additive SNP effects, dominant SNP effects are modeled for SNP having three genotypes and its heterozygosity > 10%. The model can be written as,
where σ i is the random dominance effects of the ith SNP assuming a normal distribution with mean 0 and variance , and Δ i is a column vector having coefficients equal to 1 for a heterozygous genotype and 0 for a homozygous genotype at the ith SNP. The variance covariance matrix of phenotypic observations is,
The polygenic heritability based on G, and the ratio of variance due to family, cage and additive and dominance SNP effects over the total phenotypic variance were estimated using a reversible jump Markov chain Monte Carlo (RJMCMC) and REML.
In the estimation of variance components, solving mixed model equation (MME) was a heavy computing task because of very dense G. Therefore, solving dense MME and obtaining REML estimates in every MCMC round was almost impossible in models 3 and 4. Because of this obstacle, we used a computationally tractable strategy to estimate variance components. Initially, variance components were estimated using REML from model 2 (, , and ). In an RJMCMC process (Appendix B), the number of SNP associated with QTL, their positions and effects were sampled, conditional on the estimated variance components of , , and . The SNP effects were treated as fixed effects such that it was not required to update the variance covariance matrix (V) nor invert V for each set of sampled QTL effects, which made it possible to carry out a large number of RJMCMC rounds. Variance components for family, cage, polygenic and additive and dominance SNP effects were estimated every 1000 rounds using REML, and the estimated variance components were stored to obtain the posterior mean of the estimates. We used a total of 100,000 rounds of MCMC after 10,000 burn-in periods. Although the variance components were updated and stored only 100 times, the estimates reached convergence quickly probably because of a large number of iterations for the main process.
In order to efficiently search for sets of significant SNP, we preliminarily pruned SNP, and excluded closely linked SNP having r2 > 0.95 in sliding 50 SNP windows using PLINK . After pruning, 4194 SNP remained and were used for the Bayesian analysis.
Validation of estimates (predicting unobserved phenotypes)
We predicted phenotypes of individuals (ŷ ) with models 1 to 4. In the Bayesian approach (models 3 and 4), averages of ŷ over all RJMCMC rounds were used as predicted phenotypes. In order to quantify how well each model can disentangle genetic effects from environmental effects, we used two strategies to produce estimation and validation sets. First, we randomly selected approximately half of the individuals within each full-sib family, which divided the whole data into two subsets. One set was used as an estimation set, and the other set was used as a validation set. Since some individuals in the estimation and validation sets belonged to the same full-sib family, prediction was carried out within full-sib families. Second, approximately half of the full-sib families were randomly selected within each of the 85 unrelated families. This also divided the whole data into two subsets. In this case, no individual in the estimation and validation sets shared the same full-sib family although they would be related. Therefore, prediction was performed across full-sib families.
In ten replicates, the phenotypes for a validation set (~50% of the population) were predicted from the estimation based on the phenotypes and genotypes for the rest of the population in the estimation set. For each comparison, we correlated the predicted value of an animal in the validation set with its phenotype (which was not used in the estimation phase). We term the correlation between predicted phenotypes and actual phenotypes as the accuracy of prediction.
Figure 2 shows phenotypic correlations as a function of additive relationship for each trait. For all traits, the correlation among full-sibs (k = 1/2) was relatively much higher than for other types of relationship. For CC, the correlation increased exponentially. For REP, the correlations for k = 1/16, 1/8 and 1/4 were relatively low and there was little increase until a highly increased correlation for k = 1/2. For FDC, the correlations for k = 1/16, 1/8 and 1/4 were close to zero with again a much higher value for k = 1/2. For WT, the pattern was similar; the correlations for 1/16, 1/8 and 1/4 were low, and not much different from each other, but increased dramatically with k = 1/2. The relative high correlations for k = 1/2 were probably due to the fact that members within this group (i.e. full-sib) had common dominance and environmental family effects in addition to common additive genetic effects.
Estimating variance components
Estimated variance components proportional to the total phenotypic variance and model log-likelihood are compared in Tables 2, 3, 4 and 5. The results for the trait FDC are shown in Table 2. The model without family effects gave a log-likelihood value of 1619.24 which was significantly lower than that from the full model 1. A model without polygenic effects gave the same log-likelihood as the full model (1621.3), indicating that no genetic effects are captured by the pedigree information. Indeed, genetic variance was estimated as zero in the full model 1. This was not the case in model 2 which implemented the realized relationship matrix based on aggregate SNP information. In model 2, the variance due to additive genetic effects was increased to 25%, and the variance due to family effects was decreased to 7% of the total phenotypic variance. The model log-likelihood increases to 1633.91 which was much higher than that from model 1. This showed that the realized relationship matrix based on SNP information could disentangle the genetic effects which were confounded with environmental family effects in the pedigree-based analysis. When using model 3 to search for specific additive SNP effects, the additive genetic variance increased slightly to 30% of total phenotypic variance, e.g. 18% due to polygenic and 12% due to specific SNPs. The variances for family and cage effects did not change much compared to model 2. The averaged log-likelihood was 1650.56, and the averaged number of QTL fitted in the models was 3.55 in the RJMCMC process. When using model 4 to search for specific additive and dominant SNP effects, a relatively large variance due to dominance effects was estimated (27% of total phenotypic variance). Model 4 showed the highest value for the average log-likelihood, and the average number of additive and dominance QTL fitted was 10.2. The averaged Akaike information criterion (AIC) for model 4 was dramatically lower than that for model 3, implying that model 4 was not better than model 3.
The results for the trait REP are shown in Table 3. A model without either polygenic effects or environmental family effects gave a lower log-likelihood than the full model 1. This indicated that both polygenic and family effects should be fitted in the model. In the full model 1, the variance of family, cage and polygenic effects as percentage of total phenotypic variance was 10%, 11% and 25%, respectively. When using model 2, the additive genetic variance increased to 50% of total phenotypic variance, while family and cage variance was reduced to 6% and 8% of total phenotypic variance, respectively. The log-likelihood with model 2 was substantially higher than that with model 1 (1670.71). This indicated that the model implementing the realized relationship matrix based on aggregate SNP information explained variation in phenotypes better than the model implementing the numerator relationship matrix based on pedigree information (this is also empirically proven in the next section). When using model 3, the estimated variance due to additive genetic effects increased slightly to 54% of total phenotypic variance. Variances for family and cage effects did not change much compared to those of model 2. The average log-likelihood was 1717.3, and the average number of QTL was 5.3 in the RJMCMC process. When using model 4, the estimated dominance variance was 15% of total phenotypic variance. The average log-likelihood was 1730.33 and the average number of additive and dominance QTL was 14.72. The average AIC for model 4 was not much improved, compared to that for model 2 (Table 3).
Table 4 shows the results for the trait WT. On the one hand, the model without polygenic effects gave a log-likelihood of 3382.73 which was significantly lower than that from the full model 1 (3389). On the other hand, the family effects were shown to be negligible in phenotypic variation, i.e. a reduced model excluding family effects gave the same likelihood as the full model. In the full model 1, the family, cage and polygenic variances were estimated as 0%, 17% and 64% of total phenotypic variance, respectively. However, model 2 gave very different estimates, i.e. 14%, 16% and 38% for family, cage and polygenic variances, respectively. The log-likelihood for model 2 was much higher than that for model 1. When using model 3, the family and cage variances decreased slightly to 12% and 14% while the additive genetic variance increased to 48%, e.g. 27% due to polygenic and 21% due to specific SNPs. The values for the average log-likelihood and AIC were improved although they were not substantially higher than those for model 2. In model 4, the family and cage variances decreased to 5% and 6%. The additive genetic variance was 44% which was not very different to that of model 3, and the dominance variance was estimated as 35%. The average log-likelihood and AIC were moderately improved.
The results for the trait CC are shown in Table 5. A model without polygenic effects based on pedigree information gave a significantly lower log-likelihood compared to the full model 1 but omitting family effects gave only a small change. When using model 2, there were only slight changes in the variance components, e.g. the family variance increased to 7% and the polygenic variance decreased slightly to 71% of total phenotypic variance. However, the model log-likelihood was considerably higher than that from the model 1. When using model 3, the estimated variances were similar to those of model 2 although most of the additive genetic variance was captured by specific SNP. In model 4, nearly all the variance was captured by additive and dominant QTL effects and the averaged log-likelihood as well as AIC were far better than in any of the other models.
Correlation between estimated variance components
Table 6 shows sampling correlations between estimated variance components as derived from the average information matrix, i.e. the variance covariance matrix of estimated variance components. Correlations between f and u were very high and negative for REP, WT and CC, ranging from -0.85 to -0.94. Correlations between c and u were moderate and negative for FDC (-0.41). This showed that the additive genetic effects derived from pedigree information were highly confounded with the environmental family or cage effects. However, correlations between f and g were low for all the traits (-0.1 ~ -0.23), and those between c and g were negligible, indicating that realized relationships based on aggregate SNP information could disentangle genetic effects from environmental effects. For all the traits, the sampling correlations between estimated variances due to genetic and non-genetic effects were close to zero when using models 3 and 4.
Validating estimates and prediction of unobserved phenotypes
Accuracies of the prediction of unobserved phenotypes for the various models are shown in Table 7. Prediction was carried out for individuals within full-sib families or across full-sib families. In general, the accuracy was much lower for model 1 than for model 2. For all the traits, the accuracies for model 3 were slightly higher than those for model 2 although the differences in accuracy between models 2 and 3 were not significant. For FDC and CC the accuracies for model 4 were far better than those for model 3 where there was a considerable difference in AIC between models 3 and 4. However, for REP and WT there was no significant difference between the accuracies for models 3 and 4 and AIC values for the models were also not substantially different to each other. Accuracies were highest for CC, which has the largest heritability, and smallest for FDC which has also the lowest heritability.
The accuracies for predicting individuals within full-sib families were higher than those for predicting across full-sib families, which was expected since family information could not be used across the full-sib families. Interestingly, the difference between the accuracies for models 1 and 2 was larger when predicting phenotypes across full-sib families, compared to that when predicting phenotypes within full-sib families. The reduction in accuracy due to lack of family information was larger when using model 1 than when using model 2. This showed that the performance of model 2 was apparently less dependent on environmental family effects.
Deviation from unity of the regression coefficient of true phenotypes on predicted phenotypes is an indication of bias in the estimation compared to the true value. The averaged values of regression coefficients were close to 1 when predicting phenotypes within full-sib families. However, when predicting phenotypes across full-sib families, the values were clearly biased probably because of lack of family information across the full-sib families. In general, models 3 and 4 would give more biased estimates, compared to models 1 or 2 although the difference was small.
We have shown that a mixed linear model implementing a realized relationship matrix based on aggregate SNP information can efficiently disentangle genetic effects from environmental family and cage effects when the number of causal genes is large and their effects are additive, e.g. REP and WT in this study. When dealing with a trait having a limited number of causal genes with possibly dominance effects, e.g. FDC and CC in this study, a model with a finite number of individual loci can be used to help to disentangle efficiently genetic effects from non-genetic effects. Moreover, the latter model can separate additive and non-additive genetic effects and capture more of the total genetic variance. Therefore, the estimated variance components and resulting solutions from the models based on SNP information are more reliable and accurate, compared to those based on pedigree information only, and they allow a better dissection of the various genetic and non-genetic components of variation.
For REP and WT there was no improvement in accuracies for models 3 or 4, compared to those for model 2, which may be due to the fact that the true model for the traits is probably an infinitesimal model like model 2, i.e. a large number of causal genes, each with a small effect. Another possible reason might be that we used a slightly unrealistic prior for the number of QTL in the RJMCMC process. We used a Poisson distribution with a mean of 1 as the prior distribution for the number of QTL (Appendix B). It has been reported previously that the method is robust to different priors for the number of QTL [15, 27, 28]. Higher values gave more QTL sampled into the model, but the effect on prediction accuracy was small .
Since we analysed a single data set we cannot be sure about all the causal factors and how they are (partially) confounded. However, we have shown that the model likelihood increased (Tables 2 to 5), the sampling correlation between estimated effects for the factors decreased (Table 6), and the accuracy of predicting genetic effects in validation sets increased (Table 7) when using the models based on whole-genome SNP data. These observations strongly suggest that confounding effects between genetic and non-genetic effects are better disentangled when using whole-genome SNP data, compared to traditional approaches based on pedigree information only.
In our study, we have estimated a variance covariance matrix of the variance components using average information from Fisher's scoring and the Hessian matrix . A full Bayesian approach [29–31] may be able to assess the confounding between family, cage and polygenic effects by estimating the posterior correlations between variance components, e.g. BUGS . Our approach differs from a full Bayesian method as we used a (residual) maximum likelihood within the MCMC process to take advantage of a quick convergence and to decrease reducibility problems. Moreover, the realized relationship matrix was simultaneously fitted with specific SNP effects so that larger SNP effects, with or without dominance effects, could be captured and estimated adjusted for polygenic effects. In real practical situations where genetic and environmental effects are often confounded, the proposed approach may be worthwhile to implement and help dissect genetic variation of complex traits.
The better performance of the realized relationship matrix based on SNP information, compared to the numerator relationship matrix based on pedigree, is probably due to the fact that SNP-based analysis can better predict some of the variation within a family . In Figure 3, a validation set for REP was used as an example to show variation in estimated genetic values within families. As shown in Figure 3, individual genetic values estimated from model 1 based on pedigree information are the same for all the members of the same family whereas those from model 2 based on SNP vary within families (Figure 3B). Part of the variation within families could be captured by SNP information, resulting in consistent improvement on the estimation of phenotypes (Table 7). Similar results were observed for other traits.
Because most elements of the realized relationship matrix based on SNP data are non-zero, sparse matrix techniques [25, 33] could not be used neither to invert the G matrix nor to solve the mixed model equation. This resulted in much longer computing time to estimate variance components based on the realized relationship matrix. Therefore, we had to use the computationally tractable approach that was modified from the original approach. However, the estimated variance components for family, cage and polygenic effects were mostly consistent across the MCMC process. Therefore, we did not expect very different results when using the modified version.
In model 3, covariance between SNP was negligible probably because the model had a better fit when less dependent SNP were selected. However, this was not the case with model 4 because additive and dominance effects for a SNP were always fitted together whether they were correlated or not. This would cause a negative covariance between SNP effects, and overestimation of total phenotypic variance. When covariance between SNP is explicitly modelled, better estimates can be obtained although there is a risk of overparameterization in model 4.
In conclusion, the proposed method implementing a realized relationship matrix based on aggregate SNP information is useful to genetically dissect complex traits especially when there are confounding factors between genetic and non-genetic effects. Resulting variance components are less biased and more accurate. A further analysis could be carried out using the proposed Bayesian approach to disentangle additive genetic and dominance effects. This novel strategy may help to understand the architecture of various complex traits.
Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.
Wray NR, Goddard ME, Visscher PM: Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007, 17: 1520-1528. 10.1101/gr.6665407.
Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273: 1516-1517. 10.1126/science.273.5281.1516.
Gillian T: Genotype versus phenotype: Human pigmentation. Forensic Science International: Genetics. 2007, 1: 105-110. 10.1016/j.fsigen.2007.01.005.
Lander ES, Schork NJ: Genetic dissection of complex traits. Science. 1994, 265: 2037-2048. 10.1126/science.8091226.
Andersson L, Georges M: Domestic-animals genomics: Deciphering the genetics of complex traits. Nat Rev Genet. 2004, 5: 202-212. 10.1038/nrg1294.
Henderson CR: Applications of linear models in animal breeding. 1984, University of Guelph, Guelph
Henderson CR: Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975, 31: 423-447. 10.2307/2529430.
Patterson HD, Thompson R: Recovery of interblock information when block sizes are unequal. Biometrika. 1971, 58: 545-554. 10.1093/biomet/58.3.545.
Lange K, Westlake J, Spence MA: Extensions to pedigree analysis. III. Variance components by the scoring method. Ann Hum Genet. 1976, 39: 485-491. 10.1111/j.1469-1809.1976.tb00156.x.
Sellers TA, Weaver TW, Phillips BP, Altmann M, Rich SS: Environmental factors can confound identification of a major gene effect: Results from a segregation analysis of a simulated population of lung cancer families. Genet Epidemiol. 1998, 15: 251-262. 10.1002/(SICI)1098-2272(1998)15:3<251::AID-GEPI4>3.0.CO;2-7.
Goddard ME, Hayes BJ: Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet. 2009, 10: 381-391. 10.1038/nrg2575.
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9: 356-369. 10.1038/nrg2344.
Visscher PM: Sizing up human height variation. Nat Genet. 2008, 40: 489-490. 10.1038/ng0508-489.
Lee SH, van der Werf JHJ, Hayes BJ, Goddard ME, Visscher PM: Predicting unobserved phenotypes for complex traits from whole-genome SNP data. PLoS Genet. 2008, 4: e1000231-10.1371/journal.pgen.1000231.
Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, Cornes BK, Montgomery GW, Martin NG: Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2006, 2: e41-10.1371/journal.pgen.0020041.
Lynch M, Ritland K: Estimation of pairwise relatedness with molecular markers. Genetics. 1999, 152: 1753-1766.
Oliehoek PA, Windig JJ, van Arendonk JAM, Bijma P: Estimating relatedness between individuals in general populations with a focus on their use in conservation programs. Genetics. 2006, 173: 483-496. 10.1534/genetics.105.049940.
Visscher PM, Macgregor S, Benyamin B, Zhu G, Gordon S, Medland S, Hill WG, Hottenga J-J, Willemsen G, Boomsma DI, Liu Y-Z, Deng H-W, Montgomery GW, Martin NG: Genome partitioning of genetic variation for height from 11,214 sibling pairs. Am J Hum Genet. 2007, 81: 1104-1110. 10.1086/522934.
Hayes BJ, Visscher PM, Goddard ME: Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res. 2009, 91: 47-60. 10.1017/S0016672308009981.
Valdar W, Solberg LC, Gauguier D, Burnett S, Klenerman P, Cookson WO, Taylor MS, Rawlins JNP, Mott R, Flint J: Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat Genet. 2006, 38: 879-887. 10.1038/ng1840.
Valdar W, Solberg LC, Gauguier D, Cookson WO, Rawlins JNP, Mott R, Flint J: Genetic and environmental effects on complex traits in mice. Genetics. 2006, 174: 959-984. 10.1534/genetics.106.060004.
Falconer DS, Mackay TFC: Introduction to quantitative genetics. 1996, Longman, 4
Gilmour AR, Cullis BR, Welham SJ, Thompson R: ASREML reference manual. 2004, New South Wales, Australia: Orange Agriculture Institute
Gilmour AR, Thompson R, Cullis BR: Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics. 1995, 51: 1440-1450. 10.2307/2533274.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.
Sillanpää MJ, Gasbarra D, Arjas E: Comment on "On the Metropolis-Hastings acceptance probability to add or drop a quantitative trait locus in Markov chain Monte Carlo-based Bayesian analyses". Genetics. 2004, 167: 1037-10.1534/genetics.103.025320.
Jannink J-L, Fernando RL: On the Metropolis-Hastings acceptance probability to add or drop a quantitative trait locus in Markov chain Monte Carlo-based Bayesian analyses. Genetics. 2004, 166: 641-643. 10.1534/genetics.166.1.641.
Meuwissen TH, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.
O'Hara RB, Sillanpaa MJ: A review of Bayesian variable selection methods: What, how and which. Bayesian Analysis. 2009, 4: 85-118. 10.1214/09-BA403.
Meuwissen T, Solberg T, Shepherd R, Woolliams J: A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol. 2009, 41: 2-10.1186/1297-9686-41-2.
Spiegelhalter DJ: BUGS 0.5 Bayesian inference using Gibbs sampling manual (version II). 1996, Cambridge: MRC, Biostatistics Unit
Duff IS, Erisman AM, Reid JK: Direct method for sparse matrix. 1989, Oxford, Clarendon Press
Casella G: Empirical Bayes Gibbs sampling. Biostatistics. 2001, 2: 485-500. 10.1093/biostatistics/2.4.485.
Lee SH, Van der Werf JHJ: Simultaneous fine mapping of multiple closely linked quantitative trait loci using combined linkage disequilibrium and linkage with a general pedigree. Genetics. 2006, 173: 2329-2337. 10.1534/genetics.106.057653.
Green P: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995, 82: 711-732. 10.1093/biomet/82.4.711.
Sorensen D, Gianola D: Likelihood, Bayesian, and MCMC methods in quantitative genetics. 2002, New York: Springer
Yi N, Xu S: Bayesian mapping of quantitative trait loci under the identity-by-descent-based variance component model. Genetics. 2000, 156: 411-422.
Sillanpää MJ, Arjas E: Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics. 1998, 148: 1373-1388.
We are grateful for the use of the mouse data that was publicly available made by Jonathan Flint and Richard Mott from Welcome Trust Centre for Human Genetics. We thank Brian Kinghorn for the use of Pedigree Viewer. We are grateful for useful comments from the reviewers.
The authors declare that they have no competing interests.
All authors conceived the idea, contributed to the study design, and method developments. SHL undertook the analysis. All authors drafted the manuscript and approved the final manuscript.
Following Oliehoek et al. , the probability of genotypes for individual i and j being identical by state at locus l is,
where f ij is the relatedness coefficients between i and j, and H l is the homozygosity at the locus l in the base population.
When giving can be derived from (A1). Therefore,
When multiple marker loci are used,
Due to dependency of the estimates on the allele frequency at each locus, a different weight should be given to each locus. Following Oliehoek et al. , the weighted relatedness estimates are,
where and W is the sum of the weights over all L loci.
The term H l is not usually known, so has to be estimated. Oliehoek et al.  introduced an equal drift estimator for obtaining H l assuming the increase in coancestry since the base population is equal at all loci.
where hz l is expected homozygosity at locus l, and hz min is expected homozygosity at the locus having the lowest homozygosity. Expected homozygosity is calculated from estimated allele frequencies.
Empirical Bayesian approach
Concerning model 3 and 4, the number of significant SNPs (n q ) and an indicator vector of their positions (ρ) are unknown parameters to be estimated. Reversible jump Markov chain Monte Carlo (RJMCMC) is used to obtain the posterior distribution of n q and ρ. During the MCMC procedure, maximum likelihood estimates are obtained for β, f, c, g, α, δ, , , and (Θ ) given sampled n q and ρ in every round. This is an empirical Bayesian approach [34, 35]. The posterior probability of the parameters can be written as,
where pr(y|n q , ρ, Θ ) is the likelihood of the observed phenotypes given the sampled variables, pr(n q , ρ, Θ ) is the joint prior probability of the variables, and the denominator is summed over the probabilities of all possible parameter states. If the parameter states are many, a MCMC method can be an efficient tool to obtain the posterior distribution for the parameters as used in this study. When varying the number of SNP in the model, the model dimension varies. A Metropolis-Hastings sampler cannot properly infer the correct distribution unless the model dimension is fixed. However, a RJMCMC  can deal with all possible states across different model dimensions according to the proper acceptance ratio, and give the correct posterior distribution .
In this study, SNP-in and SNP-out step are used for moving the Markov chain across different model dimensions. With a proposal probability, a SNP can be newly added to the model (in) or excluded from the model (out). When adding a SNP (n q + 1), its position is uniformly sampled from all unoccupied putative positions across the region, and included in the model. When deleting a SNP, one of the occupied SNP in the model is uniformly sampled, and excluded from the model. The parameters for the new model are estimated using ML. The proposal of adding or deleting a SNP is accepted with an acceptance ratio,
The first term in the right hand side is the posterior density consisting of the likelihood and the prior, and the second and third term are the proposal probability of adding or deleting a SNP from the model. pr(n q *|n q ) is a proposal probability of changing the number of SNP in the model from n q to n q *, and J is the Jacobian of the transformation function probability from the current model to the other. Because adding or deleting a SNP in the method is the identity transformation, J is 1 [28, 38, 39]. The prior of the number of SNP, e.g. pr(n q ), has a Poisson distribution with mean μ n =1. This assumes that there is no prior information available for the number of SNP, and this is a conservative way of detecting SNP associated with QTL, avoiding false positives. A flat uniform prior is used for ρ.
In order to remedy the computational burden due to dense G, the process is slightly modified. In the process, the log-likelihood of the observed phenotypes is conditional on the variance covariance matrix (V) with , , and that are already estimated in model 2. The log-likelihood can be obtained as,
In every 1000 rounds, REML estimates for α and δ are obtained given n q and ρ, which assumes normal prior. It is noted that in this case, α and δ are random effects and have variances and .
About this article
- Quantitative Trait Locus
- Single Nucleotide Polymorphism
- Total Phenotypic Variance
- Pedigree Information
- Estimate Variance Component