Using the realized relationship matrix to disentangle confounding factors for the estimation of genetic variance components of complex traits

Background In the analysis of complex traits, genetic effects can be confounded with non-genetic effects, especially when using full-sib families. Dominance and epistatic effects are typically confounded with additive genetic and non-genetic effects. This confounding may cause the estimated genetic variance components to be inaccurate and biased. Methods In this study, we constructed genetic covariance structures from whole-genome marker data, and thus used realized relationship matrices to estimate variance components in a heterogenous population of ~ 2200 mice for which four complex traits were investigated. These mice were genotyped for more than 10,000 single nucleotide polymorphisms (SNP) and the variances due to family, cage and genetic effects were estimated by models based on pedigree information only, aggregate SNP information, and model selection for specific SNP effects. Results and conclusions We show that the use of genome-wide SNP information can disentangle confounding factors to estimate genetic variances by separating genetic and non-genetic effects. The estimated variance components using realized relationship were more accurate and less biased, compared to those based on pedigree information only. Models that allow the selection of individual SNP in addition to fitting a relationship matrix are more efficient for traits with a significant dominance variance.


Background
Complex traits are important in evolution, human medicine, forensics and artificial selection programs [1][2][3][4]. Most complex traits show a mode of inheritance that may be caused by many functional genes with additive and dominance effects, and possibly epistatic interactions, and environmental effects [5,6].
Traditionally, pedigree information has been used to estimate heritabilities and genetic effects for complex traits [7][8][9][10]. In many family studies, non-genetic factors such as familial or shared environmental effects can be confounded with genetic factors [11]. In particular for full-sibs there is confounding between shared environmental effects, additive genetic effects and non-additive genetic effects.
Recently, it has become feasible to generate individual genotype information on large numbers of single nucle-otide polymorphisms (SNP) across the whole genome, and genome-wide association studies have been performed in a number of species [12,13]. It is expected that SNP and causal genes will be in linkage disequilibrium (LD), making it possible to genetically dissect variation in complex traits in a more effective way [14]. Indeed, it has been shown that whole-genome dense SNP analyses can provide extra benefits compared to classical approaches based on pedigree information only [15].
In this study, we propose novel strategies that utilize dense SNP data for the genetic dissection of complex traits. First, we estimate a realized relationship matrix based on aggregate SNP information [16][17][18]. The realized relationship matrix in a classical mixed linear model makes it possible to obtain more accurate and reliable estimates for the narrow sense heritability, compared to traditional pedigree-based analysis [19,20]. Second, we explicitly search for additional additive and dominance effects that may not have been already captured, by using a Bayesian model selection approach. In the process, a stochastic model selection of random SNP effects is car-ried out nested in a mixed linear model with additive polygenic effects. Additional genetic effects found in this process make it possible to estimate additive genetic and dominance variances with greater precision for some traits which have significant dominance effects. We examine the estimates by using a validation step where unobserved phenotypes in an independent validation set are predicted. We use phenotypic data for four complex traits and genotypic data for ~2200 mice with ~11,000 SNP across the whole genome.

Data
Publicly available data including pedigree, genotypic and phenotypic information on heterogeneous stock mice were used [21]; http://gscan.well.ox.ac.uk/. The total number of animals was 2,296 from 85 unrelated families. The available pedigree spanned four generations. In this complex pedigree, there were 172 full-sib families with an average size of ~11 (SD ~8). The mice were reared in a total of 536 cages, and the number of animals per cage ranged from two to seven. This number was considered as a cage density factor for analyses. Figure 1 describes the family structure for one of the 85 unrelated families, which contains 44 members and five nuclear (full-sib) families. Cage information is displayed below each animal when known and indicates a fair degree of confound-ing between cages and families. Genotypes were available for 12,112 SNP on most animals in the pedigree, and we used the 11,730 SNP located on the autosomal chromosomes. The reason for excluding the sex chromosomes was that modeling them would complicate the analyses without greatly changing the estimates. The phenotypes were already adjusted for environmental fixed effects, e.g. sex, age, year and season [21,22]. However, the effects due to cage, cage density and family were further modeled with and without using information on SNP and additive polygenic effects. Four complex traits were investigated i.e. coat color (CC) (a score from light to dark), weight at 10 weeks (WT), recovery from ear punctuation (REP), and freezing time during cue (FDC). The reasons for choosing these are: CC has a number of major genes with relatively large effects and the environmental variance is small, WT is a typical quantitative trait with the variance probably affected by numerous genes, REP is a quantitative trait with a moderate heritability, and FDC is a quantitative trait with a low heritability.

Preliminary analysis for each trait
The intra-class correlation of phenotypes for groups having relationship k based on pedigree information was estimated (k = 1/16, 1/8, 1/4 and 1/2). For example, the intra-class correlation for the group with relationship k = 1/2 was that for full-sibs. However, for relationship k = 1/ 16, 1/8, and 1/4, it was difficult to group and classify them because of the complicated pedigree structure. In order to estimate intra-class correlations for the group with relationship k, pairs of relationship k were used, but in a way that there were no relationships between individuals of different pairs, i.e. relationship = k within each pair and relationship = 0 for individuals of different pairs. Because of this restriction, not all pairs of relationship k could be used simultaneously. Therefore, we sampled 10,000 independent pairs for each relationship k for each trait. The number of pairs for relationship k, and the average number of pairs in 10,000 samples are given in Table 1. The variance between these sampled pairs scaled by total variance would be the intra-class correlation [23] for individuals having a relationship k. Estimated intra-class correlations were averaged over the 10,000 sampling sets. These correlations are, approximately, the summary statistics that are modeled in the variance component analyses.

Mixed linear model implementing a numerator relationship matrix based on pedigree information
A mixed linear model analysis was used to estimate random polygenic, cage and family effects, and the fixed effect of cage density. The model can be expressed as, where y is a vector of N r phenotypic observations, β is a vector of fixed effects including the overall mean and the cage density as covariates, f is a vector of N f random environmental family effects, c is a vector of N c random environmental cage effects, u is a vector of N random additive polygenic effects for all animals derived from pedigree information (N = 2296), and e is a vector of N r residuals. It is assumed that f, c and u are normally distributed with a mean of 0 and a variance of , and , respectively. X, W, U and Z are incidence matrices for the effects. The variance covariance matrix (V) of phenotypic observations for the model can be written as, where A is the numerator relationship matrix based on pedigree information only, and I is an identity matrix. In order to see if estimates for genetic and environmental family effects are dependent, a simple comparison is carried out for model 1, by omitting subsequently the term u (model 1-u) or f (model 1-f). Variance components and effects are estimated by a residual maximum likelihood (REML) method [24,25]. The ratio of each variance component over the total phenotypic variance was calculated.

Mixed linear model implementing a realized relationship matrix based on genome wide SNP information
When SNP information is available, the realized relationship matrix (G) can be estimated and implemented in the model [16][17][18]. To estimate G, we used the method introduced by Oliehoek et al. (2006) since it is robust and bestperformed among tested methods in their study. The details to estimate G are in Appendix A. The model can be written as, where g is a vector of N random genome-wide effects for all animals. It was assumed that g is normally distributed with mean 0 and variance . The variance covariance matrix of phenotypic observations for this model is, Variance components and effects were again estimated by REML [24,25].

Bayesian approach to model specific SNP effects
Effects of specific quantitative trait loci (QTL) may not be fully captured by model 2, and a Bayesian approach can be used to explicitly search for sets of SNPs that explain additional genetic variance. In the first instance, we model only additive effects of QTL. The model can be written as, where n q is the number of SNP associated with the QTL, i is the random additive effects of the i th SNP which is normally distributed with mean 0 and variance , Λ i is a column vector having coefficients 0, 1 or 2 representing indicator variables of the genotype for each animal at the i th SNP. The variance covariance matrix of phenotypic observations is, In addition to additive SNP effects, dominant SNP effects are modeled for SNP having three genotypes and its heterozygosity > 10%. The model can be written as, where σ i is the random dominance effects of the i th SNP assuming a normal distribution with mean 0 and variance , and Δ i is a column vector having coefficients equal to 1 for a heterozygous genotype and 0 for a homozygous genotype at the i th SNP. The variance covariance matrix of phenotypic observations is, The polygenic heritability based on G, and the ratio of variance due to family, cage and additive and dominance SNP effects over the total phenotypic variance were estimated using a reversible jump Markov chain Monte Carlo (RJMCMC) and REML.
In the estimation of variance components, solving mixed model equation (MME) was a heavy computing task because of very dense G. Therefore, solving dense MME and obtaining REML estimates in every MCMC round was almost impossible in models 3 were estimated every 1000 rounds using REML, and the estimated variance components were stored to obtain the posterior mean of the estimates. We used a total of 100,000 rounds of MCMC after 10,000 burn-in periods.
Although the variance components were updated and stored only 100 times, the estimates reached convergence quickly probably because of a large number of iterations for the main process.
In order to efficiently search for sets of significant SNP, we preliminarily pruned SNP, and excluded closely linked SNP having r 2 > 0.95 in sliding 50 SNP windows using PLINK [26]. After pruning, 4194 SNP remained and were used for the Bayesian analysis.

Validation of estimates (predicting unobserved phenotypes)
We predicted phenotypes of individuals (ŷ ) with models 1 to 4. In the Bayesian approach (models 3 and 4), averages of ŷ over all RJMCMC rounds were used as predicted phenotypes. In order to quantify how well each model can disentangle genetic effects from environmental effects, we used two strategies to produce estimation and validation sets. First, we randomly selected approximately half of the individuals within each full-sib family, which divided the whole data into two subsets. One set was used as an estimation set, and the other set was used as a validation set. Since some individuals in the estimation and validation sets belonged to the same full-sib family, prediction was carried out within full-sib families. Second, approximately half of the full-sib families were randomly selected within each of the 85 unrelated families. This also divided the whole data into two subsets. In this case, no individual in the estimation and validation sets shared the same full-sib family although they would be related. Therefore, prediction was performed across full-sib families.
In ten replicates, the phenotypes for a validation set (~50% of the population) were predicted from the estimation based on the phenotypes and genotypes for the rest of the population in the estimation set. For each comparison, we correlated the predicted value of an animal in the validation set with its phenotype (which was not used in the estimation phase). We term the correlation between predicted phenotypes and actual phenotypes as the accuracy of prediction. Figure 2 shows phenotypic correlations as a function of additive relationship for each trait. For all traits, the correlation among full-sibs (k = 1/2) was relatively much higher than for other types of relationship. For CC, the correlation increased exponentially. For REP, the correlations for k = 1/16, 1/8 and 1/4 were relatively low and there was little increase until a highly increased correlation for k = 1/2. For FDC, the correlations for k = 1/16, 1/ 8 and 1/4 were close to zero with again a much higher value for k = 1/2. For WT, the pattern was similar; the correlations for 1/16, 1/8 and 1/4 were low, and not much different from each other, but increased dramatically with k = 1/2. The relative high correlations for k = 1/2 were probably due to the fact that members within this group (i.e. full-sib) had common dominance and environmental family effects in addition to common additive genetic effects.

Estimating variance components
Estimated variance components proportional to the total phenotypic variance and model log-likelihood are compared in Tables 2, 3, 4 and 5. The results for the trait FDC are shown in Table 2. The model without family effects gave a log-likelihood value of 1619.24 which was significantly lower than that from the full model 1. A model without polygenic effects gave the same log-likelihood as the full model (1621.3), indicating that no genetic effects are captured by the pedigree information. Indeed, genetic variance was estimated as zero in the full model 1. This was not the case in model 2 which implemented the realized relationship matrix based on aggregate SNP information. In model 2, the variance due to additive genetic effects was increased to 25%, and the variance due to family effects was decreased to 7% of the total phenotypic variance. The model log-likelihood increases to 1633.91 which was much higher than that from model 1. This showed that the realized relationship matrix based on SNP information could disentangle the genetic effects which were confounded with environmental family effects in the pedigree-based analysis. When using model 3 to search for specific additive SNP effects, the additive genetic variance increased slightly to 30% of total phenotypic variance, e.g. 18% due to polygenic and 12% due to specific SNPs. The variances for family and cage effects did not change much compared to model 2. The averaged log-likelihood was 1650.56, and the averaged number of QTL fitted in the models was 3.55 in the RJMCMC process. When using model 4 to search for specific additive and dominant SNP effects, a relatively large variance due to dominance effects was estimated (27% of total phenotypic variance). Model 4 showed the highest value for the average log-likelihood, and the average number of additive and dominance QTL fitted was 10.2. The averaged Akaike information criterion (AIC) for model 4 was dramatically lower than that for model 3, implying that model 4 was not better than model 3.
The results for the trait REP are shown in Table 3. A model without either polygenic effects or environmental family effects gave a lower log-likelihood than the full model 1. This indicated that both polygenic and family effects should be fitted in the model. In the full model 1, the variance of family, cage and polygenic effects as percentage of total phenotypic variance was 10%, 11% and 25%, respectively. When using model 2, the additive genetic variance increased to 50% of total phenotypic variance, while family and cage variance was reduced to 6% and 8% of total phenotypic variance, respectively. The log-likelihood with model 2 was substantially higher than that with model 1 (1670.71). This indicated that the model implementing the realized relationship matrix based on aggregate SNP information explained variation in phenotypes better than the model implementing the numerator relationship matrix based on pedigree information (this is also empirically proven in the next section). When using model 3, the estimated variance due to additive genetic effects increased slightly to 54% of total phenotypic variance. Variances for family and cage effects did not change much compared to those of model 2. The average log-likelihood was 1717.3, and the average number of QTL was 5.3 in the RJMCMC process. When using model 4, the estimated dominance variance was 15% of total phenotypic variance. The average log-likelihood was 1730.33 and the average number of additive and dominance QTL was 14.72. The average AIC for model 4 was not much improved, compared to that for model 2 (Table  3). Table 4 shows the results for the trait WT. On the one hand, the model without polygenic effects gave a log-likelihood of 3382.73 which was significantly lower than that from the full model 1 (3389). On the other hand, the family effects were shown to be negligible in phenotypic variation, i.e. a reduced model excluding family effects gave the same likelihood as the full model. In the full model 1, the family, cage and polygenic variances were estimated as 0%, 17% and 64% of total phenotypic variance, respectively. However, model 2 gave very different estimates, i.e. 14%, 16% and 38% for family, cage and polygenic variances, respectively. The log-likelihood for model 2 was FDC WT much higher than that for model 1. When using model 3, the family and cage variances decreased slightly to 12% and 14% while the additive genetic variance increased to 48%, e.g. 27% due to polygenic and 21% due to specific SNPs. The values for the average log-likelihood and AIC were improved although they were not substantially higher than those for model 2. In model 4, the family and cage variances decreased to 5% and 6%. The additive genetic variance was 44% which was not very different to that of model 3, and the dominance variance was estimated as 35%. The average log-likelihood and AIC were moderately improved. The results for the trait CC are shown in Table 5. A model without polygenic effects based on pedigree information gave a significantly lower log-likelihood compared to the full model 1 but omitting family effects gave only a small change. When using model 2, there were only slight changes in the variance components, e.g. the family variance increased to 7% and the polygenic variance decreased slightly to 71% of total phenotypic variance. However, the model log-likelihood was considerably higher than that from the model 1. When using model 3, the estimated variances were similar to those of model 2 although most of the additive genetic variance was captured by specific SNP. In model 4, nearly all the variance was captured by additive and dominant QTL effects and the averaged log-likelihood as well as AIC were far better than in any of the other models. Table 6 shows sampling correlations between estimated variance components as derived from the average information matrix, i.e. the variance covariance matrix of estimated variance components. Correlations between f and u were very high and negative for REP, WT and CC, ranging from -0.85 to -0.94. Correlations between c and u were moderate and negative for FDC (-0.41). This showed that the additive genetic effects derived from pedigree information were highly confounded with the environmental family or cage effects. However, correlations between f and g were low for all the traits (-0.1 ~ -0.23), and those between c and g were negligible, indicat- ing that realized relationships based on aggregate SNP information could disentangle genetic effects from environmental effects. For all the traits, the sampling correlations between estimated variances due to genetic and non-genetic effects were close to zero when using models 3 and 4.

Validating estimates and prediction of unobserved phenotypes
Accuracies of the prediction of unobserved phenotypes for the various models are shown in Table 7. Prediction was carried out for individuals within full-sib families or across full-sib families. In general, the accuracy was much lower for model 1 than for model 2. For all the traits, the accuracies for model 3 were slightly higher than those for model 2 although the differences in accuracy between models 2 and 3 were not significant. For FDC and CC the accuracies for model 4 were far better than those for model 3 where there was a considerable difference in AIC between models 3 and 4. However, for REP and WT there was no significant difference between the accuracies for models 3 and 4 and AIC values for the models were also not substantially different to each other. Accuracies were highest for CC, which has the largest heritability, and smallest for FDC which has also the lowest heritability. The accuracies for predicting individuals within full-sib families were higher than those for predicting across full-sib families, which was expected since family information could not be used across the full-sib families. Interestingly, the difference between the accuracies for models 1 and 2 was larger when predicting phenotypes across fullsib families, compared to that when predicting phenotypes within full-sib families. The reduction in accuracy due to lack of family information was larger when using model 1 than when using model 2. This showed that the performance of model 2 was apparently less dependent on environmental family effects.
Deviation from unity of the regression coefficient of true phenotypes on predicted phenotypes is an indication of bias in the estimation compared to the true value. The averaged values of regression coefficients were close to 1 when predicting phenotypes within full-sib families. However, when predicting phenotypes across full-sib families, the values were clearly biased probably because of lack of family information across the full-sib families. In general, models 3 and 4 would give more biased estimates, compared to models 1 or 2 although the difference was small.

Discussion
We have shown that a mixed linear model implementing a realized relationship matrix based on aggregate SNP information can efficiently disentangle genetic effects Proportion of total phenotypic variance due to family (f), cage (c), and polygenic effects based on pedigree (u), realized relationships (g), and specific additive and dominance SNP effects (α and δ) when using model 1, 2, 3 and 4 for REP c The averaged log-likelihood during MCMC process (the averaged number of parameters due to additive SNP in the model) d The averaged log-likelihood during MCMC process (the averaged number of parameters due to additive and dominance SNP in the model) from environmental family and cage effects when the number of causal genes is large and their effects are additive, e.g. REP and WT in this study. When dealing with a trait having a limited number of causal genes with possibly dominance effects, e.g. FDC and CC in this study, a model with a finite number of individual loci can be used to help to disentangle efficiently genetic effects from nongenetic effects. Moreover, the latter model can separate additive and non-additive genetic effects and capture more of the total genetic variance. Therefore, the estimated variance components and resulting solutions from the models based on SNP information are more reliable and accurate, compared to those based on pedigree information only, and they allow a better dissection of the various genetic and non-genetic components of variation. For REP and WT there was no improvement in accuracies for models 3 or 4, compared to those for model 2, which may be due to the fact that the true model for the traits is probably an infinitesimal model like model 2, i.e. a large number of causal genes, each with a small effect. Another possible reason might be that we used a slightly unrealistic prior for the number of QTL in the RJMCMC process. We used a Poisson distribution with a mean of 1 as the prior distribution for the number of QTL (Appendix B). It has been reported previously that the method is robust to different priors for the number of QTL [15,27,28]. Higher values gave more QTL sampled into the model, but the effect on prediction accuracy was small [15].
Since we analysed a single data set we cannot be sure about all the causal factors and how they are (partially) confounded. However, we have shown that the model likelihood increased (Tables 2 to 5), the sampling correlation between estimated effects for the factors decreased (Table 6), and the accuracy of predicting genetic effects in validation sets increased (Table 7) when using the models based on whole-genome SNP data. These observations strongly suggest that confounding effects between genetic and non-genetic effects are better disentangled when using whole-genome SNP data, compared to traditional approaches based on pedigree information only.
In our study, we have estimated a variance covariance matrix of the variance components using average information from Fisher's scoring and the Hessian matrix [25]. A full Bayesian approach [29][30][31] may be able to assess the confounding between family, cage and polygenic effects by estimating the posterior correlations between variance components, e.g. BUGS [32]. Our approach differs from a full Bayesian method as we used a (residual) maximum likelihood within the MCMC process to take advantage of a quick convergence and to decrease reducibility problems. Moreover, the realized relationship matrix was Proportion of total phenotypic variance due to family (f), cage (c), and polygenic effects based on pedigree (u), realized relationships (g), and specific additive and dominance SNP effects (α and δ) when using model 1, 2, 3 and 4 for WT c The averaged log-likelihood during MCMC process (the averaged number of parameters due to additive SNP in the model) d The average log-likelihood during MCMC process (the averaged number of parameters due to additive and dominance SNP in the model) Proportion of total phenotypic variance due to family (f), cage (c), and polygenic effects based on pedigree (u), realized relationships (g), and specific additive and dominance SNP effects (α and δ ) when using model 1, 2, 3 and 4 for CC c The averaged log-likelihood during MCMC process (the averaged number of parameters due to additive SNP in the model) d The average log-likelihood during MCMC process (the averaged number of parameters due to additive and dominance SNP in the model) simultaneously fitted with specific SNP effects so that larger SNP effects, with or without dominance effects, could be captured and estimated adjusted for polygenic effects. In real practical situations where genetic and environmental effects are often confounded, the proposed approach may be worthwhile to implement and help dissect genetic variation of complex traits. The better performance of the realized relationship matrix based on SNP information, compared to the numerator relationship matrix based on pedigree, is probably due to the fact that SNP-based analysis can better predict some of the variation within a family [16]. In Figure 3, a validation set for REP was used as an example to show variation in estimated genetic values within families. As shown in Figure 3, individual genetic values estimated from model 1 based on pedigree information are the same for all the members of the same family whereas those from model 2 based on SNP vary within families ( Figure 3B). Part of the variation within families could be captured by SNP information, resulting in consistent improvement on the estimation of phenotypes (Table 7). Similar results were observed for other traits.
Because most elements of the realized relationship matrix based on SNP data are non-zero, sparse matrix techniques [25,33] could not be used neither to invert the G matrix nor to solve the mixed model equation. This resulted in much longer computing time to estimate variance components based on the realized relationship matrix. Therefore, we had to use the computationally tractable approach that was modified from the original approach. However, the estimated variance components for family, cage and polygenic effects were mostly consistent across the MCMC process. Therefore, we did not expect very different results when using the modified version.
In model 3, covariance between SNP was negligible probably because the model had a better fit when less dependent SNP were selected. However, this was not the case with model 4 because additive and dominance effects for a SNP were always fitted together whether they were correlated or not. This would cause a negative covariance between SNP effects, and overestimation of total phenotypic variance. When covariance between SNP is explicitly modelled, better estimates can be obtained although there is a risk of overparameterization in model 4.

Conclusions
In conclusion, the proposed method implementing a realized relationship matrix based on aggregate SNP infor- The average of correlations of actual and predicted phenotypes (standard deviations), and regression of the true phenotypes on predicted phenotypes (standard deviations) over 10 replicates when using model 1, 2, 3 and 4 for the traits mation is useful to genetically dissect complex traits especially when there are confounding factors between genetic and non-genetic effects. Resulting variance components are less biased and more accurate. A further analysis could be carried out using the proposed Bayesian approach to disentangle additive genetic and dominance effects. This novel strategy may help to understand the architecture of various complex traits.