Skip to content

Advertisement

  • Research Article
  • Open Access

Understanding the potential bias of variance components estimators when using genomic models

Genetics Selection Evolution201850:41

https://doi.org/10.1186/s12711-018-0411-0

  • Received: 19 December 2017
  • Accepted: 10 July 2018
  • Published:

Abstract

Background

Genomic models that link phenotypes to dense genotype information are increasingly being used for infering variance parameters in genetics studies. The variance parameters of these models can be inferred using restricted maximum likelihood, which produces consistent, asymptotically normal estimates of variance components under the true model. These properties are not guaranteed to hold when the covariance structure of the data specified by the genomic model differs substantially from the covariance structure specified by the true model, and in this case, the likelihood of the model is said to be misspecified. If the covariance structure specified by the genomic model provides a poor description of that specified by the true model, the likelihood misspecification may lead to incorrect inferences.

Results

This work provides a theoretical analysis of the genomic models based on splitting the misspecified likelihood equations into components, which isolate those that contribute to incorrect inferences, providing an informative measure, defined as \(\varvec{\kappa }\), to compare the covariance structure of the data specified by the genomic and the true models. This comparison of the covariance structures allows us to determine whether or not bias in the variance components estimates is expected to occur.

Conclusions

The theory presented can be used to provide an explanation for the success of a number of recently reported approaches that are suggested to remove sources of bias of heritability estimates. Furthermore, however complex is the quantification of this bias, we can determine that, in genomic models that consider a single genomic component to estimate heritability (assuming SNP effects are all i.i.d.), the bias of the estimator tends to be downward, when it exists.

Background

Genomic models that incorporate dense genotype information are increasingly being used and studied to infer variance parameters [14]. We define a genomic model as any linear mixed model (LMM) that links a phenotype to multiple genotypes without knowledge of those that are associated with the phenotype. We refer to a general set of genotypes as single nucleotide polymorphisms (SNPs) and to the set of genotypes associated with the phenotype as quantitative trait loci (QTL). The variance parameters of the LMM can be inferred using restricted maximum likelihood (REML) [5], which produces consistent, asymptotically normal estimators of variance components, even if normality does not hold and the number of QTL increases dramatically, tending to infinity [6]. These asymptotic properties of the REML estimators are not guaranteed to hold when the likelihood of the genomic model used for inference differs substantially from the likelihood of the true model that conceptually generated the data. In such a situation, the likelihood is said to be misspecified. In a Gaussian setup, given the fixed effects, this will be the case when the covariance structures of the data specified by the genomic and the true models differ.
Fig. 1
Fig. 1

Simulation results for scenarios 1 and 2, consisting of one generation of completely unrelated individuals, with QTL and markers in complete linkage equilibrium (LE), for both \(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}} = \hbox {f}_{\mathrm{MAF}_{_{\mathrm{markers}}}}\) and \(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}} \ne \hbox {f}_{\mathrm{MAF}_{_{\mathrm{markers}}}}\). Simulations were performed with 3000 QTL generating the phenotypes, replicated 500 times. a shows the relationship between \(\lambda _{\mathrm{i}}\) and \(\kappa _{\mathrm{i}}\), for the true model (QTL only) and for both genomic models evaluated (QTL plus markers and markers only); b presents the confidence ellipses for the simulated heritabilities (\(\hbox {h}_{\mathrm{sim}}^{2} = \gamma _{_{\mathrm{sim}}}/(1+\gamma _{_{\mathrm{sim}}})\)), with a simulation parameter \(\hbox {h}^{2}=0.05,0.15,\ldots ,0.95\), and the heritabilities estimated using REML (\(\hbox {h}_{\mathrm{REML}}^{2} = \gamma _{_{\mathrm{REML}}}/(1+\gamma _{_{\mathrm{REML}}})\)), for the true model (QTL only) and for both genomic models evaluated (QTL plus markers and markers only); c presents the confidence ellipses for the simulated heritabilities (\(\hbox {h}_{\mathrm{sim}}^{2} = \gamma _{_{sim}}/(1+\gamma _{_{\mathrm{sim}}})\)), with a simulation parameter \(\hbox {h}^{2}=0.05,0.15,\ldots ,0.95\), and the relative bias of \(\hbox {h}_{\mathrm{REML}}^{2}\) (\(\hbox {RB}(\hbox {h}_{\mathrm{REML}}^{2}) = (\hbox {h}_{\mathrm{REML}}^{2}-\hbox {h}_{\mathrm{sim}}^{2})/\hbox {h}_{\mathrm{sim}}^{2}\))

Fig. 2
Fig. 2

Simulation results for scenarios 3, 4 and 5, consisting of one generation of completely unrelated individuals and two and 10 generations of related individuals, with QTL and markers in LD, for \(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}} = \hbox {f}_{\mathrm{MAF}_{_{\mathrm{markers}}}}\). Simulations were performed with 3000 QTL generating the phenotypes, replicated 500 times. a shows the relationship between \(\lambda _{\mathrm{i}}\) and \(\kappa _{\mathrm{i}}\), for the true model (QTL only) and for both genomic models evaluated (QTL plus markers and markers only); b presents the confidence ellipses for the simulated heritabilities (\(\hbox {h}_{\mathrm{sim}}^{2} = \gamma _{_{\mathrm{sim}}}/(1+\gamma _{_{\mathrm{sim}}})\)), with a simulation parameter \(\hbox {h}^{2}=0.05,0.15,\ldots ,0.95\), and the heritabilities estimated using REML (\(\hbox {h}_{\mathrm{REML}}^{2} = \gamma _{_{\mathrm{REML}}}/(1+\gamma _{_{\mathrm{REML}}})\)), for the true model (QTL only) and for both genomic models evaluated (QTL plus markers and markers only); c presents the confidence ellipses for the simulated heritabilities (\(\hbox {h}_{\mathrm{sim}}^{2} = \gamma _{_{\mathrm{sim}}}/(1+\gamma _{_{\mathrm{sim}}})\)), with a simulation parameter \(\hbox {h}^{2}=0.05,0.15,\ldots ,0.95\), and the relative bias of \(\hbox {h}_{\mathrm{REML}}^{2}\) (\(\hbox {RB}(\hbox {h}_{\mathrm{REML}}^{2}) = (\hbox {h}_{\mathrm{REML}}^{2}-\hbox {h}_{\mathrm{sim}}^{2})/\hbox {h}_{\mathrm{sim}}^{2}\))

Fig. 3
Fig. 3

Simulation results for scenarios 6, 7 and 8, consisting of one generation of completely unrelated individuals and two and 10 generations of related individuals, with QTL and markers in LD, for \(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}} \ne \hbox {f}_{\mathrm{MAF}_{_{\mathrm{markers}}}}\). Simulations were performed with 3000 QTL generating the phenotypes, replicated 500 times. Panel (a) shows the relationship between \(\lambda _{\mathrm{i}}\) and \(\kappa _{\mathrm{i}}\), for the true model (QTL only) and for both genomic models evaluated (QTL plus markers and markers only); panel (b) presents the confidence ellipses for the simulated heritabilities (\(\hbox {h}_{\mathrm{sim}}^{2} = \gamma _{_{\mathrm{sim}}}/(1+\gamma _{_{\mathrm{sim}}})\)), with a simulation parameter \(\hbox {h}^{2}=0.05,0.15,\ldots ,0.95\), and the heritabilities estimated using REML (\(\hbox {h}_{\mathrm{REML}}^{2} = \gamma _{_{\mathrm{REML}}}/(1+\gamma _{_{\mathrm{REML}}})\)), for the true model (QTL only) and for both genomic models evaluated (QTL plus markers and markers only); panel (c) presents the confidence ellipses for the simulated heritabilities (\(\hbox {h}_{\mathrm{sim}}^{2} = \gamma _{_{\mathrm{sim}}}/(1+\gamma _{_{\mathrm{sim}}})\)), with a simulation parameter \(\hbox {h}^{2}=0.05,0.15,\ldots ,0.95\), and the relative bias of \(\hbox {h}_{\mathrm{REML}}^{2}\) (\(\hbox {RB}(\hbox {h}_{\mathrm{REML}}^{2}) = (\hbox {h}_{\mathrm{REML}}^{2}-\hbox {h}_{\mathrm{sim}}^{2})/\hbox {h}_{\mathrm{sim}}^{2}\))

The correct covariance structure (referred to in our work as \(\mathbf{G }_{\mathrm{Q}}\)) requires knowledge of the QTL. Since these are typically unknown, in practice, the genomic model makes use of the available SNP genotypes instead in order to compute a covariance structure (referred to in our work as \(\mathbf{G }\)), leading to misspecification of the likelihood. The patterns of realized relationships at different sets of loci vary across the genome [7]. Because of this, \(\mathbf{G }\) may provide a poor description of \(\mathbf{G }_{\mathrm{Q}}\), and the likelihood misspecification may lead to biased estimators of variance parameters.

REML was first implemented with a genomic model in [1], where the focus of inference was the proportion of the variance of a quantitative trait explained by the LMM, including all genotyped SNPs simultaneously. In more recent years, concerns have been raised about the quality of the inferred variance parameters when genomic models are used without directly addressing the problem of likelihood misspecification. Speed et al. [4] argued that uneven linkage disequilibrium (LD) between SNPs can lead to upward or downward bias of variance parameters estimators. The consequences of using \(\mathbf{G }\) instead of \(\mathbf{G }_{\mathrm{Q}}\) on the likelihood were also investigated by [8]. These authors used the singular value decomposition of \(\mathbf{G }\) and expressed the likelihood function of the genomic model as a function of these decompositions, concluding that the likelihood-based estimators are unreliable because they are sensitive to small perturbations on the eigen-values. This work generated back-and-forth discussions [9, 10].

The problem of misspecification of the likelihood of the genomic model was first raised by [11] and was studied using simulation by [12]. However, in [12] the authors addressed the problem by redefining the variance parameters according to the genomic models. In our work, we compare the variance parameters estimators to the parameters defined by the true model, as previously studied by [13]. The assumptions posed by [13] on the true model, however, differ substantially from those posed by our study, which can lead to different conclusions. Jiang et al. [13] assumed that the number of QTL associated with a phenotype are large enough to be considered infinite, an assumption which we consider rather unrealistic, and therefore our study assumes that the number of QTL is finite, although possibly very large.

This work provides a theoretical analysis based on splitting the likelihood equations into components, isolating those that contribute to incorrect inferences. We describe a true model that associates a phenotype with QTL, and we use its likelihood as a basis for comparison with the likelihood of the genomic models. This theory was used to understand the potential bias of REML estimators of variance components under different scenarios, each with different assumptions on the true model, that are of interest in quantitative genetics studies.
Table 1

Relationship between \(\gamma\) and the REML estimators obtained from data with QTL plus markers (\(\hat{\gamma }_{\mathrm{QM}}\)) and markers only (\(\hat{\gamma }_{\mathrm{M}}\)), assuming \(\hbox {q}\) fixed and finite, \(\hbox {m}\) very large, \(\hbox {m} > \hbox {q}\), and \(\hbox {q}+\hbox {m}>>> \hbox {n}\)

Scenario

Population

MAF

QTL/markers

\({\lim _{\mathrm{n}\rightarrow \infty }}\)

\({\mathbb {E}(\hat{\gamma }_{\mathrm{QM}})}\)

\({\mathbb {E}(\hat{\gamma }_{\mathrm{M}})}\)

1

1 generation\(^{*}\)

\(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}}=\hbox {f}_{\mathrm{MAF}_{_{\mathrm{makers}}}}\)

Complete LE

\(\gamma\)

0

2

1 generation\(^{*}\)

\(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}}\ne \hbox {f}_{\mathrm{MAF}_{\mathrm{makers}}}\)

Complete LE

\(\gamma\)

0

3

1 generation\(^{*}\)

\(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}}=\hbox {f}_{\mathrm{MAF}_{\mathrm{makers}}}\)

LD

\(\gamma\)

\(<<< \mathbb {E}(\hat{\gamma }_{\mathrm{QM}})\)

4

2 generations

\(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}}=\hbox {f}_{\mathrm{MAF}_{\mathrm{makers}}}\)

LD

\(\gamma\)

\(<< \mathbb {E}(\hat{\gamma }_{\mathrm{QM}})\)

5

10 generations

\(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}}=\hbox {f}_{\mathrm{MAF}_{\mathrm{makers}}}\)

LD

\(\gamma\)

\(< \mathbb {E}(\hat{\gamma }_{\mathrm{QM}})^{\dagger }\)

6

1 generation\(^{*}\)

\(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}}\ne \hbox {f}_{\mathrm{MAF}_{\mathrm{makers}}}\)

LD

\(<<< \gamma\)

\(<<< \mathbb {E}(\hat{\gamma }_{\mathrm{QM}})\)

7

2 generations

\(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}}\ne \hbox {f}_{\mathrm{MAF}_{\mathrm{makers}}}\)

LD

\(<< \gamma\)

\(<< \mathbb {E}(\hat{\gamma }_{\mathrm{QM}})\)

8

10 generations

\(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}}\ne \hbox {f}_{\mathrm{MAF}_{\mathrm{makers}}}\)

LD

\(< \gamma ^{\dagger }\)

\(< \mathbb {E}(\hat{\gamma }_{\mathrm{QM}})^{\dagger }\)

*Completely unrelated individuals

\(^{\dagger }\lim _{\mathrm{g}\uparrow }\mathrm{h}^{2}_{\mathrm{M}}=\lim _{\mathrm{g}\uparrow }\mathrm{h}^{2}_{\mathrm{QM}}=\mathrm{h}^{2}\) for a large number \(\hbox {g}\) of generations (strongly related individuals)

Methods

True and genomic models

We start this section by describing a general model that links phenotypes of a complex trait to genotype data:
$$\begin{aligned} \mathbf{y }= & \mathbf{1 }_{\mathrm{n}}\mu + \mathbf{W }\mathbf{b } + \varvec{\varepsilon }, \end{aligned}$$
(1)
where \(\mu\) is the overall mean, \(\mathbf{W }\) is a \(\hbox {n} \times \hbox {s}\) standardized SNP genotypes matrix (with \(\hbox {W}_{\mathrm{ij}} = (\hbox {Z}_{\mathrm{ij}} - 2\theta _{\mathrm{j}})/\sqrt{2\theta _{\mathrm{j}}(1 - \theta _{\mathrm{j}})}\), \(\mathbb {E}(\hbox {W}_{\mathrm{ij}}) = 0\), and \(\mathbb {V}\hbox {ar}(\hbox {W}_{\mathrm{ij}}) = 1\); \(\hbox {Z}_{\mathrm{ij}} \in \left\{ 0,1,2\right\}\) is the count of the minor allele at the \(\hbox {j-th}\) SNP with minor allele frequency (MAF) \(\theta _{\mathrm{j}}\), of the \(\hbox {i-th}\) individual, for all \(\hbox {i}=1,\ldots ,\hbox {n}\) and \(\hbox {j}=1,\ldots ,\hbox {s}\)), \(\mathbf{b } \sim \hbox {N}(\mathbf{0 },\mathbf{I }_{\mathrm{s}}\sigma ^{2}_{\mathrm{b}})\) is an \(\hbox {s}\times 1\) vector of random SNP effects and \(\varvec{\varepsilon } \sim \hbox {N}(\mathbf{I }_{\mathrm{n}}\sigma ^{2}_{\varepsilon _{Q}})\) is an \(\hbox {n} \times 1\) vector of the model’s residuals. In the true model, \(\mathbf{W } \doteq \mathbf{W }_{\mathrm{Q}}\) is a \(\hbox {n} \times \hbox {q}\) (\(\hbox {q} \le \hbox {s}\)) matrix containing only the QTL, and all the elements and parameters that describe the model are sub-indexed with \(\hbox {Q}\). In the genomic model, \(\mathbf{W }\) contains \(\hbox {s} = \hbox {m} + \hbox {q}\) SNPs, with \(\hbox {m}=0,\ldots ,\hbox {s}\) the number of markers (non causative mutations). When \(\hbox {m} = 0\), we are in fact in the case of the true model, and when \(\hbox {m} = \hbox {s}\) the genomic model contains no QTL in the SNP data.

Variance components and REML estimation

The covariance structure specified by the true and genomic models are \(\mathbf{G }_{\mathrm{Q}} = \hbox {q}^{-1}\mathbf{W }_{\mathrm{Q}}\mathbf{W }_{\mathrm{Q}}'\) and \(\mathbf{G } = \hbox {s}^{-1}\mathbf{W }\mathbf{W }'\), respectively, which also define the relationships between individuals at the genotype level [14]. Let \(\sigma ^{2}_{\mathrm{T}} = \hbox {s}\sigma ^{2}_{\mathrm{b}}\) (total variance due to the genotypes), and \(\gamma = \sigma ^{2}_{\mathrm{T}}/\sigma ^{2}_{\varepsilon }\), and define the matrix \(\mathbf{V }(\gamma ) \doteq \mathbf{V } = \gamma \mathbf{G } + \mathbf{I }_{\mathrm{n}}\), we then have that \(\mathbb {V}\hbox {ar}(\mathbf{y } \,\, \mid \,\, \mathbf{W }) = \mathbb {V}\hbox {ar}(\mathbf{W } \mathbf{b }) + \mathbb {V}\hbox {ar}(\varvec{\varepsilon }) = \sigma ^{2}_{\mathrm{b}}\mathbf{W }\mathbf{W }' + \sigma ^{2}_{\varepsilon }\mathbf{I }_{\mathrm{n}} = \sigma ^{2}_{\mathrm{T}}\mathbf{G } + \sigma ^{2}_{\varepsilon }\mathbf{I }_{\mathrm{n}} = \sigma ^{2}_{\varepsilon }\mathbf{V }\).

In genetics studies, the interest lies in estimating the narrow-sense heritability, i. e. the proportion of phenotypic variance explained by the genotypes. Under the true model, the heritability is defined as \(\hbox {h}^{2} = \sigma ^{2}_{\mathrm{T}_{\mathrm{Q}}}/(\sigma ^{2}_{\mathrm{T}_{\mathrm{Q}}} + \sigma ^{2}_ {\varepsilon _{\mathrm{Q}}}) = \gamma _{_{\mathrm{Q}}}/(\gamma _{_{\mathrm{Q}}} + 1)\). Analogously, under the genomic model we have \(\hbox {h}^{2}_{\mathrm{gen}} = \gamma /(\gamma + 1)\). When we fit the true model (QTL only), \(\lim _{\mathrm{n}\rightarrow \infty }\hat{\gamma }_{_{\mathrm{Q}}} = \gamma _{_{\mathrm{Q}}}\) and consequently \(\lim _{\mathrm{n}\rightarrow \infty }\hat{\mathrm{h}}^{2} = \lim _{\mathrm{n}\rightarrow \infty }\hat{\gamma }_{_{\mathrm{Q}}}/(\hat{\gamma }_{_{\mathrm{Q}}} + 1) = \hbox {h}^{2}\), for any number of QTL [5], even if normality does not hold, as demonstrated by [6] using large-sample theory. We also used large-sample theory to evaluate the likelihood of the genomic model, which is misspecified to the likelihood of the model that conceptually generated the phenotypes. Intuitively, if when we fit the genomic model, we obtain \(\lim _{\mathrm{n}\rightarrow \infty }\hat{\gamma } = \gamma _{_{\mathrm{Q}}}\), then \(\lim _{\mathrm{n}\rightarrow \infty }\hat{\hbox {h}}^{2}_{\mathrm{gen}} = \lim _{\mathrm{n}\rightarrow \infty }\hat{\gamma }/(\hat{\gamma } + 1) = \hbox {h}^{2}\). This means that if \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma })=\gamma _{_{\mathrm{Q}}}\), then \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\hbox {h}}^{2}_{\mathrm{gen}})=\hbox {h}^{2}\). Therefore, because REML will yield the estimator \(\hat{\gamma }\), we focus our analysis on \(\mathbb {E}(\hat{\gamma })\).

Define \(\mathbf{P }(\gamma ) \doteq \mathbf{P } = \mathbf{V }^{-1} - \mathbf{V }^{-1}\mathbf{1 }_{\mathrm{n}}\left( \mathbf{1 }_{\mathrm{n}}'\mathbf{V }^{-1}\mathbf{1 }_{\mathrm{n}}\right) ^{-1}\mathbf{1 }_{\mathrm{n}}'\mathbf{V }^{-1}\). The REML log-likelihood of the genomic model is [15]:
$$\begin{aligned} \ell \left( \gamma ,\sigma ^{2}_{\varepsilon } \mid \mathbf{W }\right)\propto & -\log \left( \sigma ^{\mathrm{2n}}_{\varepsilon }\mid \mathbf{V } \mid \right) - \log \left( \sigma ^{-2}_{\varepsilon }\mid \mathbf{1 }_{\mathrm{n}}'\mathbf{V }^{-1}\mathbf{1 }_{\mathrm{n}} \mid \right) -\sigma ^{-2}_{\varepsilon }\mathbf{y }'\mathbf{P }\mathbf{y }, \end{aligned}$$
(2)
and by equating its gradient to zero at the point of maximum, \(\hat{\gamma }\) is the root of the REML equation [16]:
$$\begin{aligned} \frac{\mathbf{y }'\mathbf{P }\mathbf{G }\mathbf{P }\mathbf{y }}{\hbox {tr}\left( \mathbf{P }\mathbf{G }\right) } - \frac{\mathbf{y }'\mathbf{P }^{2}\mathbf{y }}{\hbox {tr}\left( \mathbf{P }\right) } =\, 0. \end{aligned}$$
(3)
Using the eigen-decomposition \(\mathbf{G } = \mathbf{U }\varvec{\Lambda }\mathbf{U }'\), the REML Eq. (3) can be written as a function of \(\gamma\) (see “Appendix 1”):
$$\begin{aligned} g(\gamma )= & \sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\frac{(\mathbf{y }'\mathbf{U }_{\mathrm{i}})^{2}}{(1 + \gamma \lambda _{\mathrm{i}})^{2}}\left( \frac{\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}}}{1 + \gamma \lambda _{\mathrm{j}}}\right) + \hbox {n}\bar{\mathrm{y}}^{2}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\frac{\lambda _{\mathrm{j}}}{1 + \gamma \lambda _{\mathrm{j}}}, \end{aligned}$$
(4)
such that \(\hat{\gamma }\) is the root of \(\hbox {g}(\gamma )\). Now, since \((\mathbf{y }'\mathbf{U }_{\mathrm{i}})^{2} \longrightarrow \mu ^{2}(\mathbf{1 }_{\mathrm{n}}'\mathbf{U }_{\mathrm{i}})^{2} + (\mathbf{b }_{\mathrm{Q}}'\mathbf{W }_{\mathrm{Q}}'\mathbf{U }_{\mathrm{i}})^{2}\) for \(\hbox {n}\) sufficiently large (see “Appendix 2”), with \((\mathbf{b }_{\mathrm{Q}}'\mathbf{W }_{\mathrm{Q}}'\mathbf{U }_{\mathrm{i}})^{2} \propto \sum _{\mathrm{k}=1}^{\mathrm{n}}(\mathbf{U }_ {\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}})^{2}\lambda _{\mathrm{Qk}}\), we can write the non-observable REML function at the root as the following (see “Appendix 3”):
$$\begin{aligned} \sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\frac{\sum _{\mathrm{k}=1}^{\mathrm{n}}(\mathbf{U }_ {\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}})^{2}\lambda _{\mathrm{Qk}}}{(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}}\left( \frac{\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}}}{1 + \hat{\gamma }\lambda _{\mathrm{j}}}\right) =\, 0. \end{aligned}$$
(5)
We refer to Eq. (5) as non-observable because it is written as a function of \(\mathbf{U }_{\mathrm{Q}}\) and \(\varvec{\Lambda }_{\mathrm{Q}}\), which cannot be observed directly when only phenotype and genomic data are available, and we have no knowledge about the QTL. The use of such function is purely theoretical as an aid to obtaining deeper understanding of REML mechanisms. In practical implementations, we find the root of Eq. (4) to obtain \(\hat{\gamma }\).

Genomic models for scenarios of interest in quantitative genetics

We evaluated genomic models for SNP data consisting of QTL and markers that can be either uncorrelated or correlated, considering two configurations: (i) QTL plus markers and (ii) markers only. The covariance structure specified by these configurations, as well as their eigen-decomposition are denoted respectively by \(\mathbf{G }_{\mathrm{QM}} = \mathbf{U }_{\mathrm{QM}}\varvec{\Lambda }_{\mathrm{QM}}\mathbf{U }_{\mathrm{QM}}'\) and \(\mathbf{G }_{\mathrm{M}} = \mathbf{U }_{\mathrm{M}}\varvec{\Lambda }_{\mathrm{M}}\mathbf{U }_{\mathrm{M}}'\).

Before we proceed to define our scenarios of interest and evaluate \(\kappa _{\mathrm{i}}\) for each of them, we provide a brief discussion about our assumptions for the true model. In the study by Jiang et al. [13], the authors assumed that both the number of QTL (\(\hbox {q}\)) and the number of markers (\(\hbox {m}\)) increase simultaneously with increasing SNP data density (\(\hbox {q},\hbox {m}\rightarrow \infty\)). Define \(\mathbf{A }\) as the matrix of expected relationships between individuals, such that \(\mathbb {E}(\mathbf{G }_{\mathrm{Q}}) = \mathbb {E}(\mathbf{G }) = \mathbf{A }\) [17]. Then \(\lim _{\mathrm{q},\mathrm{m} \rightarrow \infty }\mathbf{G } = \lim _{\mathrm{q} \rightarrow \infty }\mathbf{G }_{\mathrm{Q}} = \mathbf{A }\), and using the eigen-decompositions \(\mathbf{A } = \mathbf{U }_{\mathrm{A}}\varvec{\Lambda }_{\mathrm{A}}\mathbf{U }_{\mathrm{A}}'\), \(\mathbf{G }_{\mathrm{Q}} = \mathbf{U }_{\mathrm{Q}}\varvec{\Lambda }_{\mathrm{Q}}\mathbf{U }_{\mathrm{Q}}'\) and \(\mathbf{G } = \mathbf{U }\varvec{\Lambda }\mathbf{U }'\), we can state that \(\lim _{\mathrm{q},\mathrm{m} \rightarrow \infty }\mathbf{U } = \lim _{\mathrm{q} \rightarrow \infty }\mathbf{U }_{\mathrm{Q}} = \mathbf{U }_{\mathrm{A}}\) and \(\lim _{\mathrm{q},\mathrm{m} \rightarrow \infty }\varvec{\Lambda } = \lim _{\mathrm{q} \rightarrow \infty }\varvec{\Lambda }_{\mathrm{Q}} = \varvec{\Lambda }_{\mathrm{A}}\). Therefore, \(\lim _{\mathrm{q},\mathrm{m} \rightarrow \infty }\hbox {g}(\gamma ) = \lim _{\mathrm{q} \rightarrow \infty }\hbox {g}_{_{\mathrm{Q}}}(\gamma ) = \hbox {g}_{_{\mathrm{A}}}(\gamma ) = \sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}(\mathbf{y }'\mathbf{U }_{\mathrm{Ai}})^{2}(\lambda _{\mathrm{Ai}} - \lambda _{\mathrm{Aj}})/[(1 + \gamma \lambda _{\mathrm{Ai}})^{2}(1 + \gamma \lambda _{\mathrm{Aj}})] + \hbox {n}\bar{\hbox {y}}^{2}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\lambda _{\mathrm{Aj}}/(1 + \gamma \lambda _{\mathrm{Aj}})\), meaning that the REML functions of the true and genomic models become equal with increasing SNP data density. Because \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }_{_{\mathrm{Q}}})=\gamma _{_{\mathrm{Q}}}\) [6], if \(\lim _{\mathrm{q},\mathrm{m} \rightarrow \infty }\hbox {g}(\gamma ) = \lim _{\mathrm{q} \rightarrow \infty }\mathrm{g}_{_{\mathrm{Q}}}(\gamma )\), it is straightforward that \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma })=\gamma _{_{\mathrm{Q}}}\).

We consider that the assumption that both \(\hbox {q}\) and \(\hbox {m}\) increase simultaneously with increasing SNP data density is too strong. Unless we consider the true model to be the infinitesimal model (in which a phenotype is generated by a countable infinite number of QTL, each with very small effect), it is most likely that the number of QTL will be finite (it may still be large, but finite). Thus we assumed a fixed and finite number of QTL for the true model, and therefore \(\mathbf{G }_{\mathrm{Q}} \ne \mathbf{A }\). Our main objective was to evaluate how much of the variability in the phenotypes can be captured by \(\mathbf{G }\), potentially leading to \(\hat{\gamma }\) being biased to \(\gamma _{_{\mathrm{Q}}}\).

We used simulations to support the theoretical analysis of the REML estimators of \(\hbox {h}^{2}=\gamma _{_{\mathrm{Q}}}/(\gamma _{_{\mathrm{Q}}}+1)\) when using genomic models. A preliminary study indicated that 2000 individuals were enough to ensure the asymptotic properties of REML under the true model. The simulations were performed for eight scenarios that differed in population structure (completely unrelated or related individuals) and genetic architecture, in the linkage disequilibrium between QTL and markers, and the MAF of the QTL. We assumed independence between MAF and effect sizes when simulating QTL effects, and phenotypes were simulated using scaled genotypes, with a heritability parameter \(\hbox {h}^{2}=0.05,0.15,\ldots ,0.95\). 20,000 SNPs were simulated and for each scenario we estimated the heritability for 500 replicates that assigned 100 SNPs as QTL, and for 500 replicates that assigned 3000 SNPs as QTL. The algorithms used for the simulations can be found in appendices G to J. For each scenario, the heritability was estimated using the true model (QTL only), and two genomic models, one containing the QTL plus the markers, and one containing the markers only.

Results

Conditions for unbiased or biased REML estimators

The key to evaluating the bias of \(\hat{\gamma }\) to the true parameter \(\gamma _{_{\mathrm{Q}}}\) is the term \(\sum _{\mathrm{k}=1}^{\mathrm{n}}(\mathbf{U }_ {\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}})^{2}\lambda _{\mathrm{Qk}}\) for every \(\hbox {i}= 1,\ldots ,\hbox {n}\), in Eq. (5). This term corresponds to the diagonal of \(\mathbf{U }'\mathbf{U }_{\mathrm{Q}}\varvec{\Lambda }_{\mathrm{Q}}\mathbf{U }_{\mathrm{Q}}'\mathbf{U }\). The off-diagonals of this matrix do not feature in the likelihood of the genomic models, as can be seen in Eq. (5), and thus, are not relevant to the bias of \(\hat{\gamma }\). Nonetheless, we performed a brief analysis of the off-diagonal elements in our simulations, and verified that their values are centered at zero and within the interval (− 0.15,0.15), regardless of the scenario. To simplify notation, we define:
$$\begin{aligned} \sum _{\mathrm{k}=1}^{\mathrm{n}}(\mathbf{U }_ {\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}})^{2}\lambda _{\mathrm{Qk}} = \kappa _{\mathrm{i}}, \quad \forall \quad \hbox {i}=1,\ldots ,\hbox {n}. \end{aligned}$$
(6)
Here, we set out the conditions for asymptotically unbiased or biased \(\hat{\gamma }\), and in the following subsections we give details on how we arrived at these conditions:
  1. (1)

    \(\kappa _{\mathrm{i}} = \lambda _{\mathrm{i}}, \quad \forall \,\, \hbox {i}=1,\ldots ,\hbox {n} \quad \Longrightarrow \quad \lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }) = \gamma _{_{\mathrm{Q}}}\)    (unbiased)

     
  2. (2)

    \(\kappa _{\mathrm{i}} = \hbox {c}, \quad \forall \,\, \hbox {i}=1,\ldots ,\hbox {n} \quad \Longrightarrow \quad \lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }) = 0\)    (\(\gamma _{_{\mathrm{Q}}}\) cannot be estimated)

     
  3. (3)

    \(\kappa _{\mathrm{i}}< \lambda _{\mathrm{i}}, {\text{ for most }} \hbox {i}=1,\ldots ,\hbox {n} \, \Longrightarrow \, \lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }) < \gamma _{_{\mathrm{Q}}}\)    (biased: downwards)

     
  4. (4)

    \(\kappa _{\mathrm{i}}> \lambda _{\mathrm{i}}, {\text{ for most }} \hbox {i}=1,\ldots ,\hbox {n} \, \Longrightarrow \, \lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }) > \gamma _{_{\mathrm{Q}}}\)    (biased: upwards)

     
Note that \(\hbox {h}^{2}\) and \(\hbox {h}^{2}_{\mathrm{gen}}\) are monotone increasing to \(\gamma _{_{\mathrm{Q}}}\) and \(\gamma\) respectively, meaning that if we have \(0< \gamma _{-}< \gamma _{\mathrm{Q}} < \gamma _{+}\), such that \(\hbox {h}^{2}_{-} = \gamma _{-}/(\gamma _{-}+1)\) and \(\hbox {h}^{2}_{-} = \gamma _{+}/(\gamma _{+}+1)\), then \(\hbox {h}^{2}_{-}< \hbox {h}^{2} < \hbox {h}^{2}_{+}\). Thus, the direction of the bias of \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\) for \(\hbox {h}^{2}\) is the same as that of the bias of \(\hat{\gamma }\) for \(\gamma _{_{\mathrm{Q}}}\).

Unbiased estimators

We know that \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }_{_{\mathrm{Q}}})=\gamma _{_{\mathrm{Q}}}\) [6]. This holds for any number \(\hbox {q}\) of QTL, regardless of their MAF and correlation between them. Moreover, in the non-observable REML equation, \(\kappa _{\mathrm{Qi}} = \lambda _{\mathrm{Qi}}\). This means that for any set of eigen-values from any \(\mathbf{G }\), for \(\hbox {n}\) sufficiently large, \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma })=\gamma _{_{\mathrm{Q}}}\) if, and only if, the structure of the non-observable REML equation is as:
$$\begin{aligned} \sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\frac{\lambda _{\mathrm{i}}}{(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}}\left( \frac{\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}}}{1 + \hat{\gamma }\lambda _{\mathrm{j}}}\right) = 0, \end{aligned}$$
(7)
meaning that \(\hat{\gamma }\) is unbiased for \(\gamma _{_{\mathrm{Q}}}\) if, and only if, \(\kappa _{\mathrm{i}} = \lambda _{\mathrm{i}}\), for all \(\hbox {i}=1,\ldots ,\hbox {n}\).

Biased estimators

There are two cases in which \(\hat{\gamma }\) is biased to \(\gamma _{_{\mathrm{Q}}}\): (1) \(\kappa _{\mathrm{i}} = \hbox {c}\) for \(\hbox {i}=1,\ldots ,\hbox {n}-1\), such that \(\hbox {c}\) is a positive constant; (2) \(\kappa _{\mathrm{i}} = \hbox {a}_{\mathrm{i}} \ne \lambda _{\mathrm{i}}\) for \(\hbox {i}=1,\ldots ,\hbox {n}-1\), such that \(\hbox {a}_{\mathrm{i}} > 0\). Note that because the \(\mathbf{G }\) matrix is built with centered and scaled genotypes, its eigen-decomposition has \(\hbox {n}-1\) degrees of freedom, and \(\kappa _{\mathrm{n}} = \lambda _{\mathrm{n}} = 0\) always.

In the first case that \(\hat{\gamma }\) is biased to \(\gamma _{_{\mathrm{Q}}}\), where \(\kappa _{\mathrm{i}} = \hbox {c}\) for \(\hbox {i}=1,\ldots ,\hbox {n}-1\), we have:
$$\begin{aligned} \sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\frac{\mathrm{c}}{(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}}\left( \frac{\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}}}{1 + \hat{\gamma }\lambda _{\mathrm{j}}}\right) = 0. \end{aligned}$$
(8)
Only \(\hat{\gamma } = 0\) guarantees the identity in Eq. (8). Therefore, when \(\kappa _{\mathrm{i}} = \hbox {c}\) for \(\hbox {i}=1,\ldots ,\hbox {n}-1\), no variance from the genomic data can be captured by REML.
We now analyze the second case where \(\kappa _{\mathrm{i}} = \hbox {a}_{\mathrm{i}}\) for \(\hbox {i}=1,\ldots ,\hbox {n}-1\). If the relationship between \(\hbox {a}_{\mathrm{i}}\) and \(\lambda _{\mathrm{i}}\) is linear, i.e. \(\hbox {a}_{\mathrm{i}} = \hbox {b}\lambda _{\mathrm{i}}\), then Eq. (7) ensures \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma })=\gamma _{_{\mathrm{Q}}}\). However, because \(\sum _{\mathrm{i}=1}^{\mathrm{n}}\kappa _{\mathrm{i}} = \sum _{\mathrm{i}=1}^{\mathrm{n}}\lambda _{\mathrm{i}} = \mathrm{n}-1\), \(\mathrm{a}_{\mathrm{i}}\) and \(\lambda _{\mathrm{i}}\) cannot be linearly related, and \(\hat{\gamma }\) will be biased to \(\gamma _{_{\mathrm{Q}}}\). We have now:
$$\begin{aligned} \sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\frac{\hbox {a}_{\mathrm{i}}}{(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}}\left( \frac{\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}}}{1 + \hat{\gamma }\lambda _{\mathrm{j}}}\right) = 0. \end{aligned}$$
(9)
To understand the bias in this case, we will go through some details about \(\hbox {a}_{\mathrm{i}} > 0\). Let \(\hbox {a}_{\mathrm{i}} = \lambda _{\mathrm{i}} + \hbox {b}_{\mathrm{i}}\), with \(\hbox {b}_{\mathrm{i}} \ge -\lambda _{\mathrm{i}}\) and \(\sum _{\mathrm{i}=1}^{\mathrm{n}}\hbox {b}_{\mathrm{i}} = 0\) (because \(\sum _{\mathrm{i}=1}^{\mathrm{n}}\kappa _{\mathrm{i}} = \sum _{\mathrm{i}=1}^{\mathrm{n}}\lambda _{\mathrm{i}}\)). Thus, \(\hat{\gamma }\) satisfies
$$\begin{aligned} \sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\frac{\lambda _{\mathrm{i}}}{(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}}\left( \frac{\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}}}{1 + \hat{\gamma }\lambda _{\mathrm{j}}}\right) + \sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\frac{\hbox {b}_{\mathrm{i}}}{(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}}\left( \frac{\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}}}{1 + \hat{\gamma }\lambda _{\mathrm{j}}}\right) = 0. \end{aligned}$$
(10)
From the unbiased case, we know that \(\sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\lambda _{\mathrm{i}}(\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}})/[(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}(1 + \hat{\gamma }\lambda _{\mathrm{j}})]\mid _{\hat{\gamma } = \gamma _{\mathrm{Q}}} = 0\). This means that if \(\sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\mathrm{b}_{\mathrm{i}}(\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}})/[(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}(1 + \hat{\gamma }\lambda _{\mathrm{j}})] < 0\), (10) will hold only if \(\sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\lambda _{\mathrm{i}}(\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}})/[(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}(1 + \hat{\gamma }\lambda _{\mathrm{j}})] > 0\). Because the latter is monotone decreasing on \(\hat{\gamma }\), only an estimator \(\hat{\gamma } < \gamma _{\mathrm{Q}}\) will result in \(\sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\frac{\lambda _{\mathrm{i}}}{(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}}\left( \frac{\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}}}{1 + \hat{\gamma }\lambda _{\mathrm{j}}}\right)\) > 0. Now, if \(\sum _{\mathrm{i}=1}^{\mathrm{n}-1}\sum _{\mathrm{j}=1}^{\mathrm{n}-1}\mathrm{b}_{i}(\lambda _{\mathrm{i}} - \lambda _{\mathrm{j}})/[(1 + \hat{\gamma }\lambda _{\mathrm{i}})^{2}(1 + \hat{\gamma }\lambda _{\mathrm{j}})] > 0\), then we have analogously that \(\hat{\gamma } > \gamma _{\mathrm{Q}}\).

Genomic models for scenarios of interest in quantitative genetics

QTL uncorrelated to markers

In “Appendix 5”, we show that for genomic models that include the QTL plus markers, in which markers are uncorrelated to the QTL,
$$\begin{aligned} \kappa _{_{\mathrm{QM}}\mathrm{i}} = \lambda _{\mathrm{QMi}} + \frac{\mathrm{m}}{\mathrm{q}}\left( \lambda _{\mathrm{QMi}} - 1 - \delta _{\mathrm{i}}\right) , \quad \forall \quad \mathrm{i}=1,\ldots ,\mathrm{n}-1, {\text{ with }} \mathbb {E}\left( \delta _{\mathrm{i}}\right) = 0. \end{aligned}$$
(11)
Assuming that the number of SNPs is always much larger than the number of individuals (\(\hbox {q}+\hbox {m}>>>\hbox {n}\)), \(\lambda _{\mathrm{QM}1}>\cdots>\lambda _{\mathrm{QM},\mathrm{n}-1}>\lambda _{\mathrm{QMn}}=0\). Since \(\sum _{\mathrm{i}=1}^{\mathrm{n}}\lambda _{\mathrm{QMi}} = \hbox {n}-1\) and because \(\lambda _{\mathrm{QM1}},\ldots ,\lambda _{\mathrm{QMn}}\) follow the Mar\(\breve{c}\)enko-Pastur distribution when \(\hbox {n}\) is sufficiently large [18], we have:
$$\begin{aligned} \left( 1-\sqrt{\frac{\hbox {n}}{\hbox {q}+\hbox {m}}}\right) ^{2}< \lambda _{\mathrm{QM},\mathrm{n}-1}< \cdots< \lambda _{\mathrm{QM}1} < \left( 1+\sqrt{\frac{\hbox {n}}{\hbox {q}+\hbox {m}}}\right) ^{2}. \end{aligned}$$
(12)
Note in Eq. (12) that increasing the number \(\hbox {m}\) of markers will concentrate the eigen-values more strongly around 1. Hence, for \(\hbox {m}\) very large, \(\left( \lambda _{\mathrm{QMi}} - 1\right) \rightarrow 0\) at a faster rate than \(\hbox {m}/\hbox {q}\) increases, and \(\kappa _{_{\mathrm{QM}}\mathrm{i}} \rightarrow \lambda _{\mathrm{QMi}} - (\mathrm{m}/\mathrm{q})\delta _{\mathrm{i}}\). Since \(\mathbb {E}\left( \delta _{\mathrm{i}}\right) = 0\), the ratio \(\hbox {m}/\hbox {q}\) determines only the variance of \(\kappa _{_{\mathrm{QM}}\mathrm{i}}\) around \(\lambda _{\mathrm{QMi}}\). Therefore, a genomic model that contains QTL plus markers that are uncorrelated to the QTL will yield \(\hat{\gamma }_{_{\mathrm{QM}}}\) such that \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }_{_{\mathrm{QM}}})=\gamma _{_{\mathrm{Q}}}\).
We also show in “Appendix 5” that for genomic models that include markers only,
$$\begin{aligned} \kappa _{_{\mathrm{M}}\mathrm{i}} = 1 + \delta _{\mathrm{i}}, \quad \forall \quad \mathrm{i}=1,\ldots ,\mathrm{n}-1, {\text{ with }} \mathbb {E}\left( \delta _{\mathrm{i}}\right) = 0. \end{aligned}$$
(13)
Therefore, \(\kappa _{\mathrm{i}}\) is a constant, and a genomic model that only contains markers that are uncorrelated to the QTL in the SNP data will always obtain \(\hat{\gamma }_{\mathrm{M}} = 0\), when REML is used.

QTL correlated to markers

In “Appendix 6”, we show that for genomic models that include the QTL plus markers, in which markers are correlated to the QTL,
$$\begin{aligned} \kappa _{_{\mathrm{QM}}\mathrm{i}} = \lambda _{\mathrm{QMi}} + \delta _{\mathrm{i}}, \quad \forall \quad \mathrm{i}=1,\ldots ,\mathrm{n}-1, {\text{ with }} \mathbb {E}\left( \delta _{\mathrm{i}}\right) = \sum _{\mathrm{j}=1}^{\mathrm{n}}\sum _{\mathrm{l}=1}^{\mathrm{n}}(\sigma _{\mathrm{ijl,Qjl}} - \sigma _{\mathrm{ijl,jl}}), \end{aligned}$$
(14)
where \(\sigma _{\mathrm{ijl},\mathrm{Qjl}} = \mathbb {C}\hbox {ov}(\hbox {U}_{\mathrm{QMij}}\hbox {U}_{\mathrm{QMil}},\hbox {G}_{\mathrm{Qjl}})\) and \(\sigma _{\mathrm{ijl},\mathrm{jl}} = \mathbb {C}\hbox {ov}(\hbox {U}_{\mathrm{QMij}}\hbox {U}_{\mathrm{QMil}},\hbox {G}_{\mathrm{QMjl}})\). It is intuitive that \(\sigma _{\mathrm{ijl},\mathrm{jl}} \ge \sigma _{\mathrm{ijl},\mathrm{Qjl}}\), and therefore \(\mathbb {E}(\delta _{\mathrm{i}}) \le 0\). Thus, a genomic model that contains all QTL and markers that are correlated to the QTL will yield \(\hat{\gamma }_{_{\mathrm{QM}}}\) such that \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }_{_{\mathrm{QM}}})=\gamma _{_{\mathrm{Q}}}\), only when \(\sigma _{\mathrm{ijl},\mathrm{jl}} \approx \sigma _{\mathrm{ijl},\mathrm{Qjl}}\), resulting in \(\mathbb {E}(\delta _{\mathrm{i}}) \approx 0\).

\(\mathbb {E}\left( \delta _{\mathrm{i}}\right)\) depends greatly on the distributions of the minor allele frequencies of the QTL and markers, \(\hbox {f}_{\mathrm{MAF}(\mathrm{QTL})}\) and \(\hbox {f}_{\mathrm{MAF}(\mathrm{markers})}\), respectively. When \(\hbox {f}_{\mathrm{MAF}(\mathrm{QTL})} = \hbox {f}_{\mathrm{MAF}(\mathrm{markers})}\), unless the number of QTL is very small (say \(\hbox {q} \le 10\)), we find that \(\mathbf{G }_{\mathrm{Q}}\) and \(\mathbf{G }_{\mathrm{QM}}\) are very similar. It is intuitively obvious that in this case \(\sigma _{\mathrm{ijl},\mathrm{jl}} \approx \sigma _{\mathrm{ijl},\mathrm{Qjl}}\). Hence, \(\mathbb {E}\left( \delta _{\mathrm{i}}\right) \approx 0\), and consequently \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }_{_{\mathrm{QM}}})=\gamma _{_{\mathrm{Q}}}\). When \(\hbox {f}_{\mathrm{MAF}(\mathrm{QTL})} \ne \hbox {f}_{\mathrm{MAF}(\mathrm{markers})}\), the larger the number \(\hbox {m}\) of markers, the more different \(\mathbf{G }_{\mathrm{Q}}\) and \(\mathbf{G }_{\mathrm{QM}}\) will be. Moreover, \(\lim _{\mathrm{m}\rightarrow \infty }\mathbf{G }_{\mathrm{QM}} = \lim _{\mathrm{m}\rightarrow \infty }\mathbf{G }_{\mathrm{M}}\). This difference between \(\mathbf{G }_{\mathrm{Q}}\) and \(\mathbf{G }_{\mathrm{QM}}\) will ensure the inequality \(\sigma _{\mathrm{ijl},\mathrm{jl}} \ge \sigma _{\mathrm{ijl},\mathrm{Qjl}}\). Hence, \(\mathbb {E}\left( \delta _{\mathrm{i}}\right) < 0\), and consequently \(\mathbb {E}(\hat{\gamma }_{_{\mathrm{QM}}}) < \gamma _{_{\mathrm{Q}}}\).

We show in “Appendix 4” that \(\mathbf{G }_{\mathrm{Q}} = \mathbf{A } + \varvec{\Delta }_{\mathrm{Q}}\), with the number of QTL \(\hbox {q}\) being fixed, \(\mathbf{G }_{\mathrm{QM}} = \mathbf{A } + \varvec{\Delta }_{\mathrm{QM}}\), and \(\mathbf{G }_{\mathrm{M}} = \mathbf{A } + \varvec{\Delta }_{\mathrm{M}}\). The relationship between individuals increases the speed of convergence of \(\varvec{\Delta }_{\mathrm{QM}} \rightarrow \mathbf{0 }\) (when \(\hbox {f}_{\mathrm{MAF}(\mathrm{QTL})} \ne \hbox {f}_{\mathrm{MAF}(\mathrm{markers})}\)) and of \(\varvec{\Delta }_{\mathrm{M}} \rightarrow \mathbf{0 }\), when \(\hbox {m}\) increases. Therefore, \(\mathbb {V}\hbox {ar}\left( \delta _{\mathrm{QMij}}\right)\) and \(\mathbb {V}\hbox {ar}\left( \delta _{\mathrm{Mij}}\right)\) decreases when the number of generations increases. Finally, we find that \(\mathbf{G }\) and \(\mathbf{G }_{\mathrm{Q}}\) are more similar when the number of generations increases. Thus, the choice of populations with increasing relationship between individuals tends to reduce the bias of heritability estimators (when these are biased). However, when \(\hbox {f}_{\mathrm{MAF}(\mathrm{QTL})} \ne \hbox {f}_{\mathrm{MAF}(\mathrm{markers})}\), stronger relationships are necessary, so the downward-bias becomes less perceptible. The explanation for this is related to the range of the MAF. For any \(\mathbf{G } = \mathbf{A } + \varvec{\Delta }\), \(\mathbb {V}\hbox {ar}\left( \delta _{\mathrm{ij}}\right)\) for SNPs with low MAF is lower than \(\mathbb {V}\hbox {ar}\left( \delta _{\mathrm{ij}}\right)\) for SNPs with high MAF, and populations with more closely related individuals are necessary to compensate for this difference in \(\mathbb {V}\mathrm{ar}\left( \delta _{\mathrm{ij}}\right)\) across different MAF ranges.

We also show in “Appendix 6” that for genomic models that include the markers only,
$$\begin{aligned} \kappa _{_{\mathrm{M}}\mathrm{i}} = 1 + \sum _{\mathrm{j}=1}^{\mathrm{n}}\mathrm{U}_{\mathrm{Mij}}^{2}\delta _{\mathrm{Qjj}} + \sum _{\mathrm{j}=1}^{\mathrm{n}}\sum _{\mathrm{l} \ne \mathrm{j}}\mathrm{U}_{\mathrm{Mij}}\mathrm{U}_{\mathrm{Mil}}\left( \mathrm{a}_{\mathrm{jl}} + \delta _{\mathrm{Qjl}}\right) , \quad \forall \quad \mathrm{i}=1,\ldots ,\mathrm{n}-1. \end{aligned}$$
(15)
Equation (15) is not so straightforward to understand analytically. Assume that we randomly pick just a few markers. These markers will most likely be in low LD with the QTL and \(\kappa _{_{\mathrm{M}}\mathrm{i}} \approx 0\), as shown in the previous section. When the density of marker data increases, we obtain Eq. (15). We show in “Appendix 6” that \(\lim _{\mathrm{m}\rightarrow \infty }\kappa _{_{\mathrm{M}}\mathrm{i}} = \lim _{\mathrm{m}\rightarrow \infty }\kappa _{_{\mathrm{QM}}\mathrm{i}}\). This means that \(\lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }_{_{\mathrm{M}}}) \le \lim _{\mathrm{n}\rightarrow \infty }\mathbb {E}(\hat{\gamma }_{_{\mathrm{QM}}})\), with equality only when \(\hbox {m}\rightarrow \infty\).

Simulations

Table 1 summarizes what is expected regarding the estimation of the heritability for a set of scenarios that are relevant in quantitative genetics studies, based on the theory detailed in the section Conditions for unbiased REML estimators. The REML estimation of \(\gamma\) was performed on data containing QTL plus markers (\(\hat{\gamma }_{\mathrm{QM}}\)) and markers only (\(\hat{\gamma }_{\mathrm{M}}\)). It is known that \(\hat{\gamma }_{\mathrm{M}} < \hat{\gamma }_{\mathrm{QM}}\), since the markers alone cannot capture more genetic variability than SNP data that contains both QTL and markers [12]. However, for scenarios in which markers are in LD with the QTL \(\lim _{\mathrm{m}\rightarrow \infty }\hat{\gamma }_{\mathrm{M}}=\lim _{\mathrm{m}\rightarrow \infty }\hat{\gamma }_{\mathrm{QM}}\). When individuals are completely unrelated, such convergence is most probably unrealistic even with sequence data (although \(\hat{\gamma }_{\mathrm{M}}\) approaches \(\hat{\gamma }_{\mathrm{QM}}\)). In populations with strongly related individuals \(\hat{\gamma }_{\mathrm{M}} \longrightarrow \hat{\gamma }_{\mathrm{QM}}\) for \(\hbox {m}\) finite and sufficiently large.

Figure 1 presents the simulation results for scenarios 1 and 2, Fig. 2 presents the simulation results for scenarios 3, 4 and 5, and Fig. 3 presents the simulation results for scenarios 6, 7 and 8. All three figures show the results for simulations that assigned 3000 SNPs as QTL. The results for the simulations that assigned 100 SNPs as QTL differed from those presented in Figs. 1, 2 and 3 only by a larger variance around the same means. In all three figures, panel (a) shows the relationship between \(\lambda _{\mathrm{i}}\) and \(\kappa _{\mathrm{i}}\), for the true model (QTL only) and for both genomic models evaluated (QTL plus markers and markers only); the relationship in the simulated data agreed with the theory in the section Genomic models for scenarios of interest in quantitative genetics, for QTL uncorrelated and correlated to markers, respectively. In all three figures, panel (b) presents the confidence ellipses for the simulated heritabilities (\(\hbox {h}_{\mathrm{sim}}^{2} = \gamma _{_{\mathrm{sim}}}/(1+\gamma _{_{\mathrm{sim}}})\)), with a simulation parameter \(\hbox {h}^{2}=0.05,0.15,\ldots ,0.95\), and the heritabilities estimated using REML (\(\hbox {h}_{\mathrm{REML}}^{2} = \gamma _{_{\mathrm{REML}}}/(1+\gamma _{_{\mathrm{REML}}})\)), for the true model (QTL only) and for both genomic models evaluated (QTL plus markers and markers only); \(\hbox {h}_{\mathrm{sim}}^{2}\) was very stable around the simulation parameters, and \(\hbox {h}_{\mathrm{REML}}^{2}\) had confidence intervals that agreed with the results in Table 1. Note that when QTL were correlated with markers, in scenarios 3 to 8, the variability of \(\hbox {h}_{\mathrm{REML}}^{2}\) was smaller than that of \(\hbox {h}_{\mathrm{REML}}^{2}\) when QTL were uncorrelated with markers, in scenarios 1 and 2. In all three figures, panel (c) presents the confidence ellipses for the simulated heritabilities (\(\hbox {h}_{\mathrm{sim}}^{2} = \gamma _{_{\mathrm{sim}}}/(1+\gamma _{_{\mathrm{sim}}})\)), with a simulation parameter \(\hbox {h}^{2}=0.05,0.15,\ldots ,0.95\), and the relative bias of \(\hbox {h}_{\mathrm{REML}}^{2}\) (\(\hbox {RB}(\hbox {h}_{\mathrm{REML}}^{2}) = (\hbox {h}_{\mathrm{REML}}^{2}-\hbox {h}_{\mathrm{sim}}^{2})/\hbox {h}_{\mathrm{sim}}^{2}\)). Note that when QTL were correlated with markers, in scenarios 3 to 8, the variability of \(\hbox {RB}(\hbox {h}_{\mathrm{REML}}^{2})\) was smaller than that of \(\hbox {RB}(\hbox {h}_{\mathrm{REML}}^{2})\) when QTL were uncorrelated with markers, in scenarios 1 and 2. Note as well, that the variability of \(\hbox {RB}(\hbox {h}_{\mathrm{REML}}^{2})\) decreases when \(\hbox {h}^{2}_{\mathrm{sim}}\) increases. For scenarios 6, 7 and 8, in which QTL were correlated with markers and \(\hbox {f}_{\mathrm{MAF}(\mathrm{QTL})}\ne \hbox {f}_{\mathrm{MAF}(\mathrm{makers})}\), we found that increasing the number of generations (thus, increasing the relationship between simulated individuals) decreases the bias of estimation and the variability of \(\hbox {h}^{2}_{\mathrm{REML}}\) and \(\hbox {RB}(\hbox {h}_{\mathrm{REML}}^{2})\).

Discussion

We have performed a theoretical analysis of the likelihood equations of genomic models based on splitting these equations into components in order to isolate and identify those that contribute to incorrect inferences. We have shown that the term in the likelihood equations that is responsible for producing potentially biased heritability estimators (\(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\)) is in fact a measure that evaluates whether \(\mathbf{G }\) provides a proper description of \(\mathbf{G }_{\mathrm{Q}}\) or not.

The key measure to evaluate whether bias will arise in the REML heritability estimators is \(\kappa _{\mathrm{i}} = \sum _{\mathrm{k}=1}^{\mathrm{n}}(\mathbf{U }_ {\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}})^{2}\lambda _{\mathrm{Qk}}\), for every \(\hbox {i}=1,\ldots ,\hbox {n}\), as we have shown in the section Conditions for unbiased REML estimators. Elements \(\kappa _{1},\ldots ,\kappa _{\mathrm{n}}\) correspond to the diagonal of \(\mathbf{U }'\mathbf{U }_{\mathrm{Q}}\varvec{\Lambda }_{\mathrm{Q}}\mathbf{U }_{\mathrm{Q}}'\mathbf{U }\), and the condition for unbiased \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\) is that the portion of variance explained by the \(\hbox {i-th}\) component of \(\mathbf{G }\) (defined by \(\lambda _{\mathrm{i}}\)) is equivalent to the sum of weighted correlations between its corresponding eigen-vector and the eigen-vectors of \(\mathbf{G }_{\mathrm{Q}}\), i.e. \(\kappa _{\mathrm{i}} = \lambda _{\mathrm{i}}\). This identity is equivalent to saying that \(\varvec{\Lambda }_{\mathrm{Q}}\) and \(\varvec{\Lambda }\) are similar matrices in the general linear group of \(\mathbf{U }_{\mathrm{Q}}'\mathbf{U }\). Evaluation of this similarity of \(\varvec{\Lambda }_{\mathrm{Q}}\) and \(\varvec{\Lambda }\) is much more informative than a direct comparison of the elements of \(\mathbf{G }_{\mathrm{Q}}\) with those of \(\mathbf{G }\), or a comparison of their eigen-values (see Additional file 1, which presents the distribution of \(\lambda _{\mathrm{i}}\) and \(\kappa _{\mathrm{i}}\), Additional files 2, 3 and 4, which present the scatterplots of \(\lambda _{\mathrm{i}}\) versus \((\mathbf y '\mathbf U _{\mathrm{i}})^{2}\), and Additional files 5 and 6, which present respectively a scenario where \(\lambda _{\mathrm{QMi}} \ne \lambda _{\mathrm{Qi}}\) and \(\kappa _{\mathrm{QMi}} = \lambda _{\mathrm{QMi}}\) with \(\mathbb {E}(\hat{\hbox {h}}^{2}_{\mathrm{QM}}) = \hbox {h}^{2}\), and a scenario where \(\lambda _{\mathrm{Mi}} = \lambda _{\mathrm{Qi}}\) and \(\kappa _{\mathrm{Mi}} \ne \lambda _{\mathrm{Mi}}\) with \(\mathbb {E}(\hat{\hbox {h}}^{2}_{\mathrm{M}}) \ne \hbox {h}^{2}\)). Hence, in the scenarios studied here, we can detect when a genomic relationship estimated by the SNPs correctly represents the true genetic relationships by verifying whether \(\varvec{\Lambda }_{\mathrm{Q}}\) and \(\varvec{\Lambda }\) are similar matrices in the general linear group of \(\mathbf{U }_{\mathrm{Q}}'\mathbf{U }\), by comparing \(\kappa _{\mathrm{i}}\) with \(\lambda _{\mathrm{i}}\). This comparison allows us to determine the presence and direction of the bias, as described in the section Conditions for unbiased REML estimators.

Scenarios in which QTL and markers are in LE have been explored theoretically in other studies, with particular emphasis on the effect of the eigen-values of \(\mathbf{G }\) on the likelihood of the (misspecified) genomic model [8, 13]. Although Kumar et al. [8] discussed the relevance of the difference between the eigen-vectors of \(\mathbf{G }\) and \(\mathbf{G }_{\mathrm{Q}}\), they did not relate this difference expressed by the correlation between the eigen-vectors, implicit in \(\kappa _{\mathrm{i}}\), to the portion of variance explained by each \(\lambda _{\mathrm{i}}\) and \(\lambda _{\mathrm{Qk}}\). The performance of genomic models in estimating heritability was assessed mainly by describing the sensitivity of the likelihood to changes in the eigen-values, under the Mar\(\breve{c}\)enko–Pastur distribution [18]. Indeed, the likelihood depends sensitively on all the eigen-values, but evaluating the likelihood given a change in each eigen-value separately is not as informative as evaluating the REML equation given a change in the distribution of all eigen-values simultaneously.

Assuming that the number of individuals (\(\hbox {n}\)) is sufficiently large, and that the numbers of QTL and markers (\(\hbox {q}\) and \(\hbox {m}\)) both increase with increasing density of SNP data such that \(\lim _{\mathrm{n},\mathrm{q},\mathrm{m}\rightarrow \infty }\mathrm{n}/(\mathrm{q}+\mathrm{m})=\hbox {c}_{1} \in (0,1)\) and \(\lim _{\mathrm{q},\mathrm{m}\rightarrow \infty }\mathrm{q}/\mathrm{m}=\mathrm{c}_{2} \in (0,1)\), Jiang et al. [13] used the Mar\(\breve{c}\)enko–Pastur distribution of the eigen-values to evaluate the limiting behavior of the term \(\mathbf{P }\mathbf{G }\mathbf{P }/\hbox {tr}\left( \mathbf{P }\mathbf{G }\right) - \mathbf{P }^{2}/\hbox {tr}\left( \mathbf{P }\right)\) in the REML Eq. (3) of the genomic model, proving that under these assumptions, \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\) is unbiased and consistent. Although Jiang et al. [13] stated that \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\) still remains unbiased and consistent when QTL and markers are in LD, we have raised a particular concern about inferences of \(\hbox {h}^{2}\) in the case when MAF(QTL) \(\ne\) MAF(markers), such as when QTL are rare mutations, for which the estimates of heritability are empirically known to be biased [1, 4, 8, 19]. Two remarks about the approach used in [13] must be made at this stage. First, the limiting behavior of \(\mathbf{P }\mathbf{G }\mathbf{P }/\hbox {tr}\left( \mathbf{P }\mathbf{G }\right) - \mathbf{P }^{2}/\hbox {tr}\left( \mathbf{P }\right)\) relies strongly on the Mar\(\breve{c}\)enko–Pastur distribution, which holds only when the SNPs are in complete LE (and thus individuals are unrelated, since family relationships would induce LD). The second remark is that LD between markers and QTL and the distribution of their MAFs may alter correlations between phenotypes and genotypes, implied in \(\mathbf{y }'\mathbf{P }\mathbf{G }\mathbf{P }\mathbf{y }\) and \(\mathbf{y }'\mathbf{P }\mathbf{y }\), and this was not evaluated by Jiang et al. [13], whereas in our study the correlations between phenotypes and genotypes are implied in \(\kappa _{\mathrm{i}}\) (see "Appendix 2 and 3"). Using our approach and the result that \(\lim _{\mathrm{q},\mathrm{m}\rightarrow \infty }\mathbf{G } = \lim _{\mathrm{q}\rightarrow \infty }\mathbf{G }_{\mathrm{Q}} = \mathbf{A }\) [17], we demonstrated in section Genomic models for scenarios of interest in quantitative genetics, that the conclusions from Jiang et al. [13] about \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\) are mathematically true, even when QTL and markers are in LD. However, in populations of unrelated individuals with QTL as rare mutations, an unrealistically large number (tending to infinity) of QTL would be necessary to ensure \(\lim _{\mathrm{q}\rightarrow \infty }\mathbf{G }_{\mathrm{Q}}=\mathbf{A }\), and when we assume that the number of QTL is finite, \(\lim _{\mathrm{m}\rightarrow \infty }\mathbf{G } \ne \mathbf{G }_{\mathrm{Q}}\).

With the knowledge that the method proposed by Yang et al. [1] may yield a biased \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\) under some scenarios, several approaches have been proposed for solving the problem of biased estimates, exploring different genetic architectures of the trait and population structure. The theory presented in our study can be adapted to provide an explanation for the success of those approaches, and we offer an overview on how that can be done for five approaches.

First, addressing the different MAF of the SNPs, Speed et al. [4] suggests a weighting of the SNPs by their MAF, which would give the same weighting to terms involving \(\gamma\) in the non-observable REML functions, owing to the change in the definition of the heritability. A G-matrix obtained using the SNPs suitably weighted according to the scenario will improve the relationship between \(\kappa _{\mathrm{i}}\) and \(\lambda _{\mathrm{i}}\), reducing the bias of \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\). The definition of a suitable weight must be explored further, and the theory provided in this study provides a tool that can be used for such investigations. The genomic model in (1) can be generalized to assume different weights to the SNPs by simply changing the assumption for the distribution of \(\mathbf{b }\) to \(\mathbf{b }\sim \hbox {N}(\mathbf{0 },\mathbf{D }\sigma ^{2}_{\mathrm{b}})\), such that \(\mathbf{D }\) is a diagonal matrix of weights. The G-matrix must then be defined as \(\mathbf{G } = \mathbf{WDW'}/\hbox {tr}(\mathbf{D })\) to ensure that properties indicated in “Appendix 1” and theoretical evaluations in the Results section hold.

Second, and with the same objective of distinguishing SNPs by their MAF, Yang et al. [19] suggested a method that is analogous to that proposed in [1] by fitting the model with several genomic variance components, each of them relative to groups of SNPs with MAF values within the same range. In this approach, assuming the components to be independent, we can obtain a non-observable REML equation for each genomic variance component to be estimated, and the analysis then follows exactly as we have presented here. This approach also generalizes the genomic model in Eq. (1) by assuming \(\mathbf{b }\sim \hbox {N}(\mathbf{0 },\mathbf{D }\sigma ^{2}_{\mathrm{b}})\), as described in the previous paragraph, with the difference that the values on the diagonal of \(\mathbf{D }\) are to be estimated. Indeed the method is capable of removing the bias of \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\). However, as observed by [19], the increase in the number of variance components may increase the variance of \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\), especially when independence between the components cannot be ensured, and, depending on the scenario evaluated, the estimates may be less reliable than those obtained by fitting a single genomic variance component.

Edwards et al. [20] suggested a third approach, which fits genomic models by including a variance component for SNPs grouped based on genomic features (i.e. genes and their gene ontology) to the model, which requires the use of prior information about the genomic data. The results of their study showed that a relevant amount of variance was attributed to the significant feature, and \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\) in this approach can be evaluated with a non-observable REML equation for each component (SNPs grouped based on genomic features and SNPs not grouped based on genomic features), just as we suggest for evaluating \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\) obtained with the approach proposed by [19]. It is important to note that the genomic feature model works better than a single component when the feature component is enriched for the QTL; otherwise, this model can also lead to problems in the estimation of \(\hbox {h}^{2}\). The advantage of grouping SNPs based on genomic features instead of MAF is that there are fewer variance components, reducing the variance of \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\). The use of prior genomic information to fit genomic models with mutiple genomic variance components was previously suggested by Speed and Balding [21], who included a dynamic procedure to define a suitable partition of SNPs.

A fourth approach considers the situation where prior genomic feature information is absent and Bayesian mixture models, such as BayesB [22] or BayesR [23], are reasonable solutions for assigning different distributions to groups of SNP effects [20]. Again, non-observable REML equations for each component can be used to evaluate \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\) based on the assumptions posed by the Bayesian mixture models, and the assumptions can be tuned using the information from our suggested theoretical analysis.

A fifth approach includes related individuals to study populations, which can greatly reduce the bias of \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\), when it exists. This is because rare QTL induce genetic relationships between individuals. In populations of nominally unrelated individuals the common markers will disguise those induced genetic relationships at the QTL (\(\lim _{\mathrm{m}\rightarrow \infty }\mathbf{G }_{\mathrm{QM}}=\mathbf{A }=\mathbf{I }_{\mathrm{n}}\ne \mathbf{G }_{\mathrm{Q}}\)), drastically reducing the correlations between eigen-vectors \((1/\hbox {n})\mathbf{U }_{\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}}\) and resulting in \(\kappa _{\mathrm{i}} < \lambda _{\mathrm{i}}\). Conversely, in populations of related individuals, assuming no selection, the induced genetic relationships at the QTL will better reflect the kinship matrix (\(\mathbf{G }_{\mathrm{Q}}\approx \mathbf{A }\)), improving the correlation between eigen-vectors \((1/\mathrm{n})\mathbf{U }_{\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}}\) and resulting in less biased \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\).

A last point to be raised in this discussion, concerns the direction of the bias of \(\hat{\hbox {h}}^{2}_{\mathrm{gen}}\). We show in the section Genomic models for scenarios of interest in quantitative genetics, with our theoretical analysis, that when we consider a single genomic component in the model to estimate heritability (assuming SNP effects are all i.i.d.), when it exists, the bias of the estimator will tend to be downward. An exception is observed when \(\hbox {f}_{\mathrm{MAF}_{_{\mathrm{QTL}}}} \ne \hbox {f}_{\mathrm{MAF}_{_{\mathrm{markers}}}}\) and the number of QTL is smaller than the number of individuals. If genomic models are fitted including the QTL and markers, such that the total number of SNPs in the genomic data is smaller than the number of individuals, the heritability estimator will present an upwards bias. This fact is related to \(\hbox {rank}(\mathbf{G }_{\mathrm{QM}})<\hbox {n}-1\), as eigen-values that are zero are overestimated by \(\kappa _{\mathrm{i}}\). Increasing the number of markers will make \(\hbox {rank}(\mathbf{G }_{\mathrm{QM}})\) approach \(\hbox {n}-1\), forcing only the last eigen-value to zero and the overestimation will no longer be present (see Additional file 7).

When multiple genomic components are considered in the model, overestimation of heritability may be observed even when the total number of SNPs is larger than the number of individuals. When different variance parameters are estimated for each component, the multiple components approach is explicit, and overestimation of heritability will relate to components with a rank lower than \(\hbox {n}-1\). When a single variance parameter is estimated for SNPs associated with pre-determined weights, the multiple component approach is implicit, and overestimation of heritability as observed in [24] may relate to \(\kappa _{\mathrm{i}}\) overestimating the largest \(\lambda _{\mathrm{i}}\). Associating pre-determined weights to the SNPs in the genomic model may inflate correlations \(\mathbf{U }_{\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}}\) for eigen-vectors \(\mathbf{U }_{\mathrm{i}}\) that are associated with the highest eigen-values, resulting in \(\kappa _{\mathrm{i}} > \lambda _{\mathrm{i}}\), while correlations \(\mathbf{U }_{\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}}\) for eigen-vectors \(\mathbf{U }_{\mathrm{i}}\) that are associated with the lowest eigen-values will be deflated, resulting in \(\kappa _{\mathrm{i}} < \lambda _{\mathrm{i}}\) (see Additional file 8).

Conclusions

In a Gaussian setup, the likelihood of a genomic model is misspecified with respect to that of the true model that conceptually generated the data. When used for inferring variance parameters the misspecified likelihood may yield biased estimators of those parameters, and inferences must be interpreted with caution. Misspecification of the likelihood is due to the difference between the covariance structures of the data specified by the misspecified and true models (\(\mathbf{G }\) and \(\mathbf{G }_{\mathrm{Q}}\)), and our study shows that the bias of REML estimators of variance parameters is linked to the relationship between the eigen-values and eigen-vectors of both models, occurring when \(\kappa _{\mathrm{i}} = \sum _{\mathrm{k}=1}^{\mathrm{n}}(\mathbf{U }_ {\mathrm{i}}'\mathbf{U }_{\mathrm{Qk}})^{2}\lambda _{\mathrm{Qk}} \ne \lambda _{\mathrm{i}}\). Moreover, comparison of \(\kappa _{\mathrm{i}}\) with the eigen-value \(\lambda _{\mathrm{i}}\) not only identifies the potential bias of variance components estimators, but is also a very informative method for comparing \(\mathbf{G }\) with \(\mathbf{G }_{\mathrm{Q}}\). The eigen-vectors reflect how each individual contributes to the proportion of variance explained by the components of \(\mathbf{G }\) and \(\mathbf{G }_{\mathrm{Q}}\) (defined by \(\lambda _{\mathrm{i}}\) and \(\lambda _{\mathrm{Qk}}\)), and if the contributions are similar, then \(\kappa _{\mathrm{i}} \approx \lambda _{\mathrm{i}}\), meaning that the covariance structures of the data specified by the genomic and the true models are equivalent. In mathematical terms, \(\kappa _{\mathrm{i}} = \lambda _{\mathrm{i}}\) is the same as stating that \(\varvec{\Lambda }_{\mathrm{Q}}\) and \(\varvec{\Lambda }\) are similar matrices in the general linear group of \(\mathbf{U }_{\mathrm{Q}}'\mathbf{U }\). We have evaluated the similarity of \(\varvec{\Lambda }_{\mathrm{Q}}\) and \(\varvec{\Lambda }\) in a set of scenarios of interest to quantitative genetics studies, identifying those for which inferences must be interpreted with caution. Because of the many factors related to the genetic architecture that influence the similarity of \(\varvec{\Lambda }_{\mathrm{Q}}\) and \(\varvec{\Lambda }\) (LD between QTL and markers, presence and number of QTL in the SNP data, MAF, relationship between individuals) and the lack of information about the QTL, quantifying the bias (when it exists) of the estimators of variance parameters, is not trivial. Although the quantification of this bias is complex, we can determine that in genomic models that consider a single genomic component to estimate heritability (assuming SNP effects are all i.i.d.), the bias of the estimator will tend to be downward, when it exists.

Declarations

Authors' contributions

BCDC conceived the study, performed the calculus, simulations and analysis, and wrote the manuscript. ACS helped to conceive the study, helped with the analysis, and contributed to the manuscript. PS conceived the study, helped with the calculus and the analysis, and contributed to the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors thank Daniel Sorensen, who provided fruitful discussions and insights relevant to this work, and Gustavo de los Campos, who contributed with the initial ideas for the simulations. This research was supported and funded by the Center for Genomic Selection in Animals and Plants funded by the Danish Council for Strategic Research (contract number 12-132452).

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Authors’ Affiliations

(1)
Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics, Aarhus University, Blichers Allé 20, Postboks 50, 8830 Tjele, Denmark

References

  1. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–9.View ArticlePubMedPubMed CentralGoogle Scholar
  2. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88:76–82.View ArticlePubMedPubMed CentralGoogle Scholar
  3. Golan D, Rosset S. Accurate estimation of heritability in genome wide studies using random effects models. Bioinformatics. 2011;27:i317–23.View ArticlePubMedPubMed CentralGoogle Scholar
  4. Speed D, Hemani G, Johnson MR, Balding DJ. Improved heritability estimation from genome-wide SNPs. Am J Hum Genet. 2012;91:1011–21.View ArticlePubMedPubMed CentralGoogle Scholar
  5. Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58:545–54.View ArticleGoogle Scholar
  6. Jiang J. REML estimation: asymptotic behavior and related topics. Ann Stat. 1996;24:255–86.View ArticleGoogle Scholar
  7. Hill WG, Weir BS. Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet Res (Camb.). 2011;93:47–64.View ArticleGoogle Scholar
  8. Kumar SK, Feldman MW, Rehkopf DH, Tuljapurkar S. Limitations of GCTA as a solution to the missing heritability problem. Proc Natl Acad Sci USA. 2016;113:E61-70.Google Scholar
  9. Yang J, Lee H, Wray NR, Goddard ME, Vissher PM. Commentary on: Limitations of GCTA as a solution to the missing heritability problem. bioRXiv eprint. 2016. https://doi.org/10.1101/036574.
  10. Kumar SK, Feldman MW, Rehkopf DH, Tuljapurkar S. Response to commentary on: Limitations of GCTA as a solution to the missing heritability problem. bioRXiv eprint. 2016. https://doi.org/10.1101/039594.
  11. de los Campos G, Sorensen D. A commentary on pitfalls of predicting complex traits from SNPs. Nat Rev Genet. 2013;14:894.View ArticlePubMedPubMed CentralGoogle Scholar
  12. de los Campos G, Sorensen D, Gianola D. Genomic heritability: what is it? PLoS Genet. 2015;11:e1005048.View ArticlePubMed CentralGoogle Scholar
  13. Jiang J, Li C, Paul D, Yang C, Zhao H. High-dimensional genome-wide association study and misspecified mixed model analysis. arXiv eprint. 2014. arXiv:1404.2355.
  14. VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23.View ArticlePubMedGoogle Scholar
  15. Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc. 1977;72:320–38.View ArticleGoogle Scholar
  16. Jiang J. Linear and generalized linear mixed models and their applications. New York: Springer; 2007.Google Scholar
  17. Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA, et al. Using whole-genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster. PLoS Genet. 2012;8:e1002685.View ArticlePubMedPubMed CentralGoogle Scholar
  18. Mar $$\breve{c}$$ c ˘ enko VA, Pastur LA. Distribution of eigenvalues for some sets of random matrices. Math USSR-Sbornik. 1967;1:457–83.Google Scholar
  19. Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AAE, Lee SH, et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat Genet. 2015;47:1114–20.View ArticlePubMedPubMed CentralGoogle Scholar
  20. Edwards SM, Sørensen IF, Sarup P, Mackay TFC, Sørensen P. Genomic prediction for quantitative traits is improved by mapping variants to gene ontology categories in drosophila melanogaster. Genetics. 2016;203:1871–83.View ArticlePubMedPubMed CentralGoogle Scholar
  21. Speed D, Balding DJ. MultiBLUP: improved SNP-based prediction for complex traits. Genome Res. 2014;24:1550–7.View ArticlePubMedPubMed CentralGoogle Scholar
  22. Meuwissen THE, Solberg TR, Shepherd R, Woolliams JA. A fast algorithm for Bayes B type of prediction of genome-wide estimates of genetic value. Genet Sel Evol. 2009;41:2.View ArticlePubMedPubMed CentralGoogle Scholar
  23. Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, et al. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci. 2012;95:4114–29.View ArticlePubMedGoogle Scholar
  24. Speed D, Cai N, The UCLEB Consortium, Johnson MR, Nejentsev S, Balding DJ. Reevaluation of SNP heritability in complex human traits. Nat Genet. 2017;49:986–92.View ArticlePubMedPubMed CentralGoogle Scholar
  25. Bahadur RR. A representation of the joint distribution of responses to \(\varvec {n}\) dichotomous items. In: Solomon H, editor. Studies in item analysis and prediction. Stanford: Stanford University Press; 1961. p. 158–68.Google Scholar
  26. Emrich LJ, Piedmonte MR. A method for generating high-dimensional multivariate binary variables. Am Stat. 1991;45:302–4.Google Scholar
  27. Lee AJ. Generating random binary deviates having fixed marginal distributions and specified degrees of association. Am Stat. 1993;47:209–15.Google Scholar
  28. Gange SJ. Generating multivariate categorical variates using the iterative proportional fitting algorithm. Am Stat. 1995;49:134–8.Google Scholar
  29. Park CG, Park T, Shin DW. A simple method for generating correlated binary variates. Am Stat. 1996;50:306–10.Google Scholar
  30. Leisch F, Weingessel A, Hornik K. On the generation of correlated artificial binary data. Working Papers SFB “Adaptive Information Systems and Modelling in Economics and Management Science” 13. SFB Adaptive Information Systems and Modelling in Economics and Management Science. WU Vienna University of Economics and Business, Vienna. 1998.Google Scholar
  31. Meyer C. Recursive numerical evaluation of the cumulative bivariate normal distribution. J Stat Softw. 2013;52:1–14.View ArticleGoogle Scholar

Copyright

© The Author(s) 2018

Advertisement