Using haplotypes for the prediction of allelic identity to fine-map QTL: characterization and properties

Jacquin, Laval; Elsen, Jean-Michel; Gilbert, Hélène

doi:10.1186/1297-9686-46-45

Research
Open access
Published: 14 July 2014

Using haplotypes for the prediction of allelic identity to fine-map QTL: characterization and properties

Laval Jacquin^1,2,3,
Jean-Michel Elsen^1,2,3 &
Hélène Gilbert^1,2,3

Genetics Selection Evolution volume 46, Article number: 45 (2014) Cite this article

1309 Accesses
3 Citations
1 Altmetric
Metrics details

Abstract

Background

Numerous methods have been developed over the last decade to predict allelic identity at unobserved loci between pairs of chromosome segments along the genome. These loci are often unobserved positions tested for the presence of quantitative trait loci (QTL). The main objective of this study was to understand from a theoretical standpoint the relation between linkage disequilibrium (LD) and allelic identity prediction when using haplotypes for fine mapping of QTL. In addition, six allelic identity predictors (AIP) were also compared in this study to determine which one performed best in theory and application.

Results

A criterion based on a simple measure of matrix distance was used to study the relation between LD and allelic identity prediction when using haplotypes. The consistency of this criterion with the accuracy of QTL localization, another criterion commonly used to compare AIP, was evaluated on a set of real chromosomes. For this set of chromosomes, the criterion was consistent with the mapping accuracy of a simulated QTL with either low or high effect. As measured by the matrix distance, the best AIP for QTL mapping were those that best captured LD between a tested position and a QTL. Moreover the matrix distance between a tested position and a QTL was shown to decrease for some AIP when LD increased. However, the matrix distance for AIP with continuous predictions in the [0,1] interval was algebraically proven to decrease less rapidly up to a lower bound with increasing LD in the simplest situations, than the discrete predictor based on identity by state between haplotypes (IBS _hap), for which there was no lower bound. The expected LD between haplotypes at a tested position and alleles at a QTL is a quantity that increases naturally when the tested position gets closer to the QTL. This behavior was demonstrated with pig and unrelated human chromosomes.

Conclusions

When the density of markers is high, and therefore LD between adjacent loci can be assumed to be high, the discrete predictor IBS _hap is recommended since it predicts allele identity correctly when taking LD into account.

Background

Numerous methods have been developed to predict allelic identity at an unobserved locus between pairs of chromosome segments. Such predictions are generally carried out by observing allelic similarities between the pairs of chromosome segments that surround this locus [1–3]. It is assumed that chromosome segments that exhibit more similarities have a higher chance of harboring the same allele(s) at this locus. Many of these methods [1–5] use either directly or implicitly the concept of identity-by-descent (IBD), and therefore predict allelic identity based on allelic likeness. Such predictions of allelic identity can be either continuous or discrete in the [0,1] interval. The matrices that contain these predictions for pairs of chromosome segments, at an unobserved locus, can be used in a statistical procedure to detect association between the locus and some phenotypes of interest. For example, these matrices can be interpreted as being proportional to the covariance matrices of the effect of the locus on phenotypes of interest [1, 4, 6] and therefore play a central role in the statistical analysis of the variability. The similarity between chromosome segments can be measured based on the haplotypes of markers carried by the segments. Indeed, it has been shown that haplotype-based methods have a higher potential to detect trait-marker associations than single-marker methods in some cases [7–16]. Different methods for predicting allelic identity, hereafter named Allelic Identity Predictors (AIP), have been proposed and in this study, we have compared some of these methods i.e.: (1) the probability measure described by Meuwissen and Goddard [1] is the conditional probability of being IBD at an unobserved locus for pairs of haplotypes, given the identical-by-state (IBS) status of alleles spanning that position; (2) the similarity score of Li and Jiang [2] calculates the sum of the number of shared alleles and the length of the longest shared substring that spans an unobserved locus for pairs of haplotypes; (3) the probability model of Browning [3] is based on Variable Length Markov Chains (VLMC) and performs chromosome clustering at a given marker, and in this model, chromosomes that belong to a given cluster are considered as potentially harboring the same unobserved allele(s) locally; and (4) the IBS status of all marker alleles between pairs of haplotypes and (5) the IBS status of single markers alleles, which are the simplest AIP.

In some association studies, such as those that use random effect models for example, the only input that differs from one AIP to another is the similarity (covariance) matrix built for the tested location. Thus, investigating the properties of similarity matrices is another strategy when comparing AIP, since this comparison is generally based on the accuracy of quantitative trait locus (QTL) localization (e.g. root mean square error). The main objective of the present study was to understand the relation between linkage disequilibrium (LD) and allelic identity prediction when using haplotypes, by identifying the properties of similarity matrices in the neighborhood of a QTL and at the QTL. This was performed using a simple distance measure between these matrices and the similarity matrix at the QTL based on the observed allelic identity (IBS). This distance measure was expressed analytically in terms of LD coefficients. There has been an increasing interest in taking advantage of LD for fine-mapping of complex disease genes [17–20] and QTL [21–24]. Nevertheless, to the best of our knowledge, no study has yet used analytical methods to compare AIP in relation to LD. Here, we define a new criterion based on the chosen matrix distance measure, which allows discrimination between the six AIP. We evaluated the consistency of this criterion with the mapping accuracy of the six AIP for a QTL simulated according to different LD patterns and populations.

The simulations were based on two population types, a set of human chromosomes and a set of porcine chromosomes, with different LD and density patterns. In each case, the QTL was a hidden SNP that simulated a biallelic QTL, as previously proposed [4, 8, 23, 25]. Hence, the present study was framed around the common idea that there is a favorable allele at the QTL, which affects an observed trait. In this context, the aim of AIP is to predict, at the QTL, whether both chromosomal segments of any pair harbor the same unobserved favorable allele or not, which is the same as predicting the IBS or non-IBS state of the alleles. A new (6) unreferenced AIP, named trained predictor and abbreviated as TP, is also compared in this paper. This new predictor, based on a matrix distance concept similar to the one used to discriminate between the AIP, performs least squares prediction in a global fashion over chromosomes. The purpose of this predictor was to investigate the behavior of an AIP which performs global training over the chromosomes in relation to local patterns of LD.

Methods

Matrix distance comparison

Let $I = {i_{1}, \dots, i_{r}}$ be a set of positions that are tested for the presence of a QTL on 2n phased homologous chromosomes for n diploid individuals. Only one QTL is considered to be in the screened region. In a sliding window approach, each position tested is considered to be the unobserved center of the haplotypes carried by different chromosome segments. Figure 1 shows an example of tested positions for a sliding window of six markers and a QTL located between SNPs 39 and 40.

Let $s_{i, c_{1}, c_{2}}^{P} \in [0, 1]$ be the IBS or IBD prediction of allelic identity, depending on an AIP at a tested position $i \in I$ , for a couple (c₁, c₂) of chromosome segments. Note that $s_{i, c_{1}, c_{2}}^{P}$ is calculated according to the observed similarity between the haplotypes carried by c₁ and c₂. Hence, c₁ and c₂ can harbor different unobserved alleles at i even if these segments carry the same haplotype. We define $M^{P, i} = {(s_{i, c_{1}, c_{2}}^{P})}_{1 \leq c_{1}, c_{2} \leq 2 n}$ as the similarity matrix built from the predictions of allelic identity at locus i for . Matrix $M^{P, i}$ can be used in a statistical procedure to detect association between i and some phenotype of interest.

Let $u_{c_{1}, c_{2}}^{QTL} \in {0, 1}$ be the true allelic identity observed at the QTL (IBS) for a couple (c₁, c₂) of chromosome segments. On the basis of known alleles at the QTL, the similarity $M^{QTL} = {(u_{c_{1}, c_{2}}^{QTL})}_{1 \leq c_{1}, c_{2} \leq 2 n}$ can be built with the real allelic identities. Note that M^QTL is simply a similarity matrix that describes the IBS or non-IBS state of alleles at the QTL.

Let d₁ be a normalized distance measure between $M^{P, i}$ and M^QTL induced by the entrywise 1-norm, which is the sum of the absolute differences between the elements of two matrices or vectors, i.e.

\begin{align} d_{1} (M^{P, i}, M^{QTL}) & = \frac{1}{4 n^{2}} ∥ M^{P, i} - M^{QTL} ∥_{1} \\ = \frac{1}{4 n^{2}} \sum_{c_{1} = 1}^{2 n} \sum_{c_{2} = 1}^{2 n} | s_{i, c_{1}, c_{2}}^{P} - u_{c_{1}, c_{2}}^{QTL} | \end{align}

Note that some AIP have continuous prediction errors $| s_{i, c_{1}, c_{2}}^{P} - u_{c_{1}, c_{2}}^{QTL} |$ in [ 0,1], while for others, prediction errors are limited to the discrete set {0,1}. Measure d₁ is therefore more appropriate than the euclidean metric d₂, for example, because it does not shrink continuous prediction errors in [ 0,1]. Let θ_QTL be the position of the QTL. When a predictor performs well, $d_{1} (M^{P, i}, M^{QTL})$ should be minimum at the tested position closest to θ_QTL. Hence $d_{1} (M^{P, i}, M^{QTL})$ can be used to compare different AIP for a set of tested positions. Note that $d_{1} (M^{P, i}, M^{QTL})$ can also be expressed as [see Additional file 1]:

\begin{align} d_{1} (M^{P, i}, M^{QTL}) = & \sum_{p = 1}^{K} f_{i, h_{p}} \sum_{q = 1}^{K} f_{i, h_{q}} [p_{i, h_{p}, h_{q}}^{QTL} \times (1 - s_{i, h_{p}, h_{q}}^{P}) \\ + (1 - p_{i, h_{p}, h_{q}}^{QTL}) s_{i, h_{p}, h_{q}}^{P}] \end{align}

(1)

where K=2^t is the number of possible observed haplotypes at position i, for a sliding window of t markers. $f_{i, h_{p}}$ and $f_{i, h_{q}}$ are the frequencies of haplotypes h_p and h_q at position i, respectively. Note that some haplotypes among the K possible haplotypes may not be observed in practice. Hence, the corresponding frequencies for these haplotypes will naturally be equal to 0 in expression (1). $p_{i, h_{p}, h_{q}}^{QTL}$ is the proportion of identical alleles shared at the QTL by the pairs of chromosomes that carry h_p and h_q, at position i, and $s_{i, h_{p}, h_{q}}^{P}$ is the prediction of allelic identity at locus i for the predictor and a pair (h_p, h_q) of haplotypes. Expression (1) will be used subsequently to express d₁ as a function of LD coefficients, and to understand the trained predictor defined in this paper.

Measures of AIP evaluated

The AIP evaluated in this study were IBS _m (IBS status of alleles at single markers), IBS _hap (IBS status of all marker alleles between pairs of haplotypes), P(IBD) (IBD probability of Meuwissen and Goddard [1]), Score (similarity score of Li and Jiang [2]), Beagle (cluster-based probability model of Browning [3]) and TP (the trained predictor). Note that the tested positions coincide with marker positions for IBS _m and Beagle. These positions are therefore different from those in Figure 1. The tested positions for IBS _hap, P(IBD), Score and TP are defined as presented in Figure 1.

IBS _m gives an allelic identity prediction of 1 if a pair of chromosome segments carries the same allele at a tested marker and 0 otherwise. With IBS _hap the prediction of allelic identity is equal to 1 if both chromosome segments of a pair carry the same marker alleles for haplotypes that span the tested position i, and 0 otherwise. P(IBD) is an estimation of the conditional probability of being IBD at i for a pair of chromosome segments, given the IBS status of marker alleles of the haplotypes spanning i. This measure of probability is based on a coalescence process and models recombination between markers. The P(IBD) function was applied here with an ancestral effective population size of 100 and 100 generations from the base population, as in Meuwissen and Goddard [1]. Meuwissen and Goddard [23] showed that violations of these assumptions, i.e. that alter the effective population size and the number of generations since the base population, had no effect on the mapping accuracy of their methods [23, 26]. For a pair of haplotypes carried by two chromosome segments, Score is the summation of the number of IBS alleles and the length of the longest common substring of IBS alleles that span i. Score integrates weight functions that decrease the significance of markers based on their genetic distance from i. As proposed in Li and Jiang [2], these functions were chosen to be one minus the distance, in centiMorgan (cM), of each marker from i on the haplotypes within the sliding window (as presented in Figure 1). Beagle clusters chromosomes or haplotypes locally at a tested marker if they have similar probabilities of carrying the same alleles at following adjacent markers. The Beagle probability model was built at each marker by running the Beagle software (Beagle 3.3.2; http://faculty.washington.edu/browning/beagle/beagle.html, Browning [3], Browning and Browning [12]) and fitting all the chromosome markers at one time. The Beagle probability model needs two parameters (scale and shift) to be built. These parameters were first estimated from the data using a cross-validation procedure. However, the mapping results were less accurate than those obtained with the default values for these parameters that were proposed by the authors. According to the authors, the default values have performed well in simulation studies and real data analyses [12, 27]. Hence the default values scale = 4.0 and shift = 0.2 [12] were used.

The trained predictor (TP), built by least squares prediction, is based on the idea that pairs of haplotypes that exhibit the same amount of allelic similarity should have the same probability of harboring identical alleles, regardless of the tested positions they span. Estimates for ${(s_{i, h_{p}, h_{q}}^{TP})}_{(p, q) \in {1, .., K}^{2}}$ can be obtained as follows. Let $J = {j_{1}, \dots, j_{T}}$ be a set of observed SNPs on chromosomes, which are called target SNPs. Each target SNP j is defined as the middle marker of a sliding window of t+1 loci, where t is the number of observed flanking markers used to predict allelic identity at the target SNP. Let $u_{j, c_{1}, c_{2}} \in {0, 1}$ be the real allele identity at j for (c₁, c₂) and let $E^{TP}$ be the mean squared prediction errors over for TP, i.e.

\begin{align} E^{TP} & = \frac{1}{T} \sum_{j = j_{1}}^{j_{T}} [\frac{1}{4 n^{2}} \sum_{c_{1} = 1}^{2 n} \sum_{c_{2} = 1}^{2 n} {(s_{c_{1}, c_{2}}^{TP} - u_{j, c_{1}, c_{2}})}^{2}] \\ = \frac{1}{T} \sum_{j = j_{1}}^{j_{T}} [d_{2} (M^{TP, j}, M^{j})] \\ = \frac{1}{T} \sum_{j = j_{1}}^{j_{T}} [\sum_{p = 1}^{K} f_{j, h_{p}} \sum_{q = 1}^{K} f_{j, h_{q}} [p_{j, h_{p}, h_{q}} {(s_{h_{p}, h_{q}}^{TP} - 1)}^{2} \\ + (1 - p_{j, h_{p}, h_{q}}) {(s_{h_{p}, h_{q}}^{TP} - 0)}^{2}]] \end{align}

Note that the expression of the normalized squared euclidean distance, d₂, in terms of frequencies and proportions is analogous to that of d₁ in (1).

Indeed $f_{j, h_{p}}$ , $f_{j, h_{q}}$ and $p_{j, h_{p}, h_{q}}$ at locus j are defined as in (1). Estimates for ${(s_{i, h_{p}, h_{q}}^{TP})}_{(p, q) \in {1, .., K}^{2}}$ are obtained by differentiating $E^{TP}$ with respect to $s_{h_{p}, h_{q}}^{TP}$ , i.e.

\frac{\partial E^{TP}}{\partial s_{h_{p}, h_{q}}^{TP}} = 0 \Leftrightarrow ŝ_{h_{p}, h_{q}}^{TP} = \frac{\sum_{j = j_{1}}^{j_{T}} f_{j, h_{p}} f_{j, h_{q}} p_{j, h_{p}, h_{q}}}{\sum_{j = j_{1}}^{j_{T}} f_{j, h_{p}} f_{j, h_{q}}}

Note that the second derivative of $E^{TP}$ with respect to $s_{h_{p}, h_{q}}^{TP}$ is positive since it is a sum of frequencies. This implies that $E^{TP}$ reaches a minimum for the set of estimates ${(ŝ_{h_{p}, h_{q}}^{TP})}_{(p, q) \in {1, .., K}^{2}}$ , since $E^{TP}$ is a sum of convex functions of each $s_{h_{p}, h_{q}}^{TP}$ . Hence, TP associates $ŝ_{h_{p}, h_{q}}^{TP}$ to any observed couple (h_p, h_q) at any tested position $i \in I$ . The observed target SNPs ( $j \in J$ ) are used to estimate the predictions of allelic identity for TP and should not be confused with the unobserved tested positions ( $i \in I$ ).

Statistical models, test statistic and relative efficiency

Mixed models

The following mixed models were used to test for the presence of a QTL at a given position $i \in I$ for all AIP:

\{\begin{array}{l} Y = X β + Z_{h} h + Z_{u} u + ε & (H_{1}) \\ Y = X β + Z_{u} u + ε & (H_{0}) \end{array}

where β is a fixed effect, which is the overall mean, and is a vector of n ones. Vector u represents the random polygenic effects due to relationships among individuals, i.e. $u \sim N_{n} (0, A σ_{u}^{2})$ where A is the additive relationship matrix built from the pedigree [28, 29]. Z_h and Z_u are design matrices that link random effects to individuals and ε is the vector of homoscedastic error terms, i.e. $ε \sim N_{n} (0, I_{n} σ_{ε}^{2})$ .

In the model corresponding to (H₁), h is a vector of random effects of haplotypes at position i, i.e. $h \sim N_{κ} (0, H^{P, i} σ_{h}^{2})$ , where κ (κ≤K) corresponds to the number of observed haplotypes, or alleles, at position i. Note that h has the same dimension κ for all AIP except for IBS _m and Beagle. The tested positions coincide with marker positions for these two predictors. At a tested marker i, κ=2 for IBS _m and κ is equal to the number of local clusters for Beagle. Therefore, depending on the predictor , $H^{P, i}$ is a similarity matrix based on either distinct observed haplotypes (e.g. Score) or distinct clusters (e.g. $P =$ Beagle). Note that $H^{P, i}$ and $M^{P, i}$ are equivalent sources of data contingent upon the list of haplotypes, or distinct local clusters, for the chromosome segments at any tested position. Indeed, depending on , one can build $M^{P, i}$ from $H^{P, i}$ in one of the two following ways. (1) $M_{c_{1}, c_{2}}^{P, i} = H_{h (c_{1}), h (c_{2})}^{P, i}$ , where h(c₁) and h(c₂) are the haplotype numbers carried by chromosomes c₁ and c₂, respectively or (2) $M_{c_{1}, c_{2}}^{P, i} = H_{C (c_{1}), C (c_{2})}^{P, i}$ , where $C (c_{1})$ and $C (c_{2})$ are the cluster numbers to which chromosomes c₁ and c₂ belong, respectively.

RLRT statistic

The Expectation-Maximization algorithm was used for the restricted maximum likelihoods of the mixed models [30–33], to estimate the components β, h, u, ε and the variance terms $σ_{h}^{2}, σ_{u}^{2}, σ_{ε}^{2}$ . Let $λ_{i}^{P}$ be the restricted maximum likelihood ratio test (RLRT) of (H₁) versus (H₀) for position i, i.e.

λ_{i}^{P} = - 2 ln (\frac{L_{REML} (H_{0})}{L_{REML}^{P} (H_{1})})

We defined $θ_{m.a.}^{P}$ as the estimated position of a QTL for a predictor , i.e.

θ_{m.a.}^{P} = \underset{i \in I}{argmax} \{{\hat{λ}}_{i}^{P}\}

Relative efficiency

To compare the predictive ability of the different predictors in relation to LD, we defined $θ_{r.e.}^{P}$ as the tested position where $d_{1} (M^{P, i}, M^{QTL})$ is minimized for a predictor , i.e.

θ_{r.e.}^{P} = \underset{i \in I}{argmin} \{d_{1} (M^{P, i}, M^{QTL})\}

Consequently, we defined the relative efficiency of a predictor as follows. Predictor is considered to be more efficient than a predictor $P^{'}$ if

\{\begin{array}{l} | θ_{r.e.}^{P} - θ_{QTL} | < | θ_{r.e.}^{P^{'}} - θ_{QTL} | & (a) \\ d_{1} (M^{P, θ_{r.e.}^{P}}, M^{QTL}) < d_{1} (M^{P^{'}, θ_{r.e.}^{P^{'}}}, M^{QTL}) & (b) \end{array}

where |.| is the absolute value. When $θ_{r.e.}^{P}$ was not unique, the mean of the different argmins was retained as $θ_{r.e.}^{P}$ . Inequality (a) states that the tested position associated with the best prediction, of the allele identity at the QTL, is closer to the QTL for than that for $P^{'}$ . Inequality (b) states that the true allelic identity at the QTL is better predicted by at $θ_{r.e.}^{P}$ than by $P^{'}$ at $θ_{r.e.}^{P^{'}}$ .

Comparison criteria

N simulations (w=1,.., N) were performed to evaluate the mapping accuracy and the relative efficiency of the different AIP in different situations.

Mapping accuracy

The mapping accuracy of the simulated QTL was evaluated for each AIP with the root mean square error (RMSE):

{RMSE}^{m.a.} = \sqrt{\frac{1}{N} \sum_{w = 1}^{N} {(θ_{m.a.}^{P, w} - θ_{QTL})}^{2}}

Relative efficiency

The relative efficiency of each AIP was evaluated by considering the three following quantities:

\{\begin{array}{l} {RMSE}^{r.e.} = \sqrt{\frac{1}{N} \sum_{w = 1}^{N} {(θ_{r.e.}^{P, w} - θ_{QTL})}^{2}} \\ {\hat{E}}^{r.e.} = \frac{1}{N} \sum_{w = 1}^{N} d_{1} (M^{P, θ_{r.e.}^{P, w}}, M^{QTL, w}) \\ {\hat{σ}}^{r.e.} = \sqrt{\frac{1}{N} \sum_{w = 1}^{N} {(d_{1} (M^{P, θ_{r.e.}^{P, w}}, M^{QTL, w}) - {\hat{E}}^{r.e.})}^{2}} \end{array}

where RMSE ^r.e. and ${\hat{E}}^{r.e.}$ measure conditions (a) and (b), defined in the paragraph on relative efficiency, and ${\hat{σ}}^{r.e.}$ measures the standard deviation of the matrix distance at $θ_{r.e.}^{P}$ .

Data for simulation

A sliding window of t=6 markers was chosen for all analyses, except for IBS _m and Beagle. Windows of six and 12 markers were previously shown to be optimal for QTL mapping accuracy [34, 35] with 60K type SNP chips. Hence, all analyses were done using a sliding window of t=6 markers, except for IBS _m and Beagle, to make comparison between the series of results easier. A set of 90 human chromosomes 21 from unrelated Han Chinese individuals from Beijing (HCB), and a set of 235 swine chromosomes 18 from French Large White (FLW) pigs, were used for LD and matrix distance computations. The 90 HCB chromosomes were genotyped for 16 881 SNPs and are available from the HapMap project website (http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2005-03_phaseI/full/). The FLW chromosomes were genotyped for 1252 SNPs using the Illumina Porcine 60K+SNP iSelect Beadchip [36]. Only 14 976 SNPs on the HCB chromosomes and 969 SNPs on the FLW chromosomes for which the minor allele frequency was greater than 5% were retained for analysis. The LD and matrix distance computations were conducted for the HCB and the FLW chromosomes. The QTL simulations were only conducted for the FLW chromosomes for which a pedigree was available. The marker density varied across the FLW chromosomes based on physical distance in kilobase. One megabase was considered equivalent to 1 cM for conversion in this study.

Variation of LD between tested positions and a QTL

LD between a tested position i and a QTL was measured using the multiallelic measure R of LD as suggested by [37–39]. Let $Δ_{p} = f_{i, h_{p} a_{1}}^{QTL} - f_{i, h_{p}} f_{a_{1}}$ be the LD coefficient between haplotype h_p at position i and allele a₁ at the QTL. $f_{i, h_{p} a_{1}}^{QTL}$ is the frequency of haplotype h_pa₁ defined by the marker haplotype h_p that spans position i and allele a₁ at the QTL. $f_{a_{1}}$ is the frequency of allele a₁ at the QTL and $f_{i, h_{p}}$ is haplotype h_p frequency at i. Note that $- Δ_{p} = f_{i, h_{p} a_{2}}^{QTL} - f_{i, h_{p}} f_{a_{2}}$ . Hence, for a biallelic QTL, R can be expressed as:

\begin{align} R_{i, QTL} & = \frac{\sum_{p = 1}^{K} \sum_{l = 1}^{2} {(f_{i, h_{p} a_{l}}^{QTL} - f_{i, h_{p}} f_{a_{l}})}^{2}}{(1 - \sum_{p = 1}^{K} f_{i, h_{p}}^{QTL 2}) (1 - \sum_{l = 1}^{2} f_{a_{l}}^{2})} \\ = \frac{\sum_{p = 1}^{K} [{(Δ_{p})}^{2} + {(- Δ_{p})}^{2}]}{(1 - \sum_{p = 1}^{K} f_{i, h_{p}}^{QTL 2}) (1 - \sum_{l = 1}^{2} f_{a_{l}}^{2})} \\ = \frac{2 \sum_{p = 1}^{K} Δ_{p}^{2}}{(1 - \sum_{p = 1}^{K} f_{i, h_{p}}^{QTL 2}) (1 - \sum_{l = 1}^{2} f_{a_{l}}^{2})} = \frac{D_{i, QTL}^{2}}{H_{i} . H_{QTL}} \end{align}

where $H_{i} = 1 - \sum_{p = 1}^{K} f_{i, h_{p}}^{QTL 2}$ and $H_{QTL} = 1 - \sum_{l = 1}^{2} f_{a_{l}}^{2}$ are the Hardy-Weinberg heterozygosities at i and the QTL respectively and $D_{i, QTL}^{2} = 2 \sum_{p = 1}^{K} Δ_{p}^{2}$ . R_{i, QTL} and $D_{i, QTL}^{2}$ are expected to increase as the tested position i gets closer to a QTL. The general behaviors of the normalized measure R_{i, QTL} and the non-normalized measure $D_{i, QTL}^{2}$ were described by computing the LD between the haplotypes at successive distinct positions, using a sliding window, and the alleles of a fixed SNP centered over a region of 81 markers on the chromosomes. The fixed SNP was centered over a region of 76 distinct overlapping sliding windows available within the region of 81 markers. The 76 distinct positions associated to the windows played the role of the tested positions of an association study. The fixed SNP played the role of a biallelic QTL. The computation was repeated for all possible regions of 81 successive markers. Since 969 SNPs were retained on the 235 porcine chromosomes, computation was performed for 889 (969-81 + 1 = 889) regions of 81 markers. The same procedure was performed on the HCB chromosomes, thus leading to 14 896 possible regions for this set of chromosomes. The empirical means of the 889 FLW and the 14 896 HCB LD profiles were then computed to describe the expected behaviors of R_{i, QTL} and $D_{i, QTL}^{2}$ . Another major purpose of these computations was to help the analytical comparison of the AIP and the associated matrix distances, which can be expressed as elements of multiallelic LD (see Results section).

Distributions of matrix distance as a function of multiallelic LD

The distributions of the matrix distance for the six compared AIP, as function of local multiallelic LD, were also evaluated on the FLW and HCB chromosomes. The matrix distances for the six AIP were calculated at 966 and 14 973 possible target SNPs for the FLW and HCB chromosomes, respectively. The target SNPs were defined in exactly the same way as used for the trained predictor (TP). The matrix distances calculated at each window that harbors a target SNP for the six AIP were then plotted against the multiallelic measure R of LD between the haplotypes and the target alleles within the window.

QTL simulation on FLW chromosomes

The 235 FLW chromosomes were included in N=200 gene-drop simulations, in a 25-generation pedigree for the FLW breed, using the LDSO software [40]. The pedigree was composed of 1594 founders, 3373 sires and 7100 dams. The gene-drop procedure was used to generate different realistic genealogy structures between the chromosomes. For each gene-drop the 235 FLW chromosomes were uniformly distributed, with replacement, among the 1594 founders of the pedigree. Hence, the measured LD structure for mapping among descendant individuals at the end of each gene-drop was almost the same as on the 235 FLW chromosomes. It must be emphasized that the use of replicates of only 235 chromosomes to populate 1594 diploid founders, followed by 25 generations of recombinations events, means that the number of different haplotypes at a position is much lower than 3188 (2 × 1594). Thus, the results correspond to medium range population sizes. After each gene-drop, only the chromosomes and phenotypes of the n=485 individuals of generation 25 were retained for subsequent analyzes.

Three distant SNPs were chosen as putative QTL, in order to have different LD levels with the six-marker haplotype that surrounds them on the 235 initial FLW chromosomes. Two different QTL effects were simulated for each of these SNPs, thus leading to six different scenarios. The LD between these SNPs and the observed haplotypes that harbored them was measured using the multiallelic measure R of LD. The LD levels around the three SNPs were equal to 0.52, 0.18 and 0.08, and the lengths of the haplotypes harboring them were equal to 0.09 cM, 0.37 cM and 0.75 cM, respectively. Note that these differences in length were due to the different marker densities in the distinct regions that harbor each putative QTL. The length of the region scanned for QTL mapping around each simulated QTL was approximately 3 cM.

The phenotypes in the pedigree were computed as $y_{i} = \frac{1}{2} (p_{i}^{f} + p_{i}^{m}) + ϕ_{i} + g_{i}^{QTL} + δ$ , where $p_{i}^{f}, p_{i}^{m}$ are normal random polygenic effects of the parents with variance 0.5, ϕ_i is a normal random mendelian sampling effect with variance 0.25 and δ is a normal random environmental effect with variance 1. $g_{i}^{QTL}$ is the QTL genotype effect of individual i. QTL genotype effect was first computed as $g_{i}^{QTL} = 2$ or 0 or −2, if the QTL genotype of individual i was a₁a₁ or a₁a₂ or a₂a₂ respectively. In the same way a second set of simulations was carried out with the QTL genotype effect computed as $g_{i}^{QTL} = 0.5$ or 0 or −0.5. Only the gene-drop simulations for which the minor allele frequency at the QTL was greater or equal to 0.1 were retained. Each simulated QTL was verified for Hardy-Weinberg equilibrium during simulations. Hence, under the standard model, where the dominance effect is equal to 0 as in this study, the first simulated QTL effect explained at most 57% of the phenotypic variance for equal frequencies at the QTL. In the same way, the second simulated QTL effect explained at most 8% of the phenotypic variance.

Results

This section gives theoretical and empirical results that show that, compared to others, some AIP exhibit a better behavior for the decrease of their matrix distance, as defined by expression (1), when the multiallelic LD between a tested position and a QTL increases. In summary, the theoretical results show that expression (1) can be written as a function of the multiallelic LD coefficients of R, and that the decreasing behavior of this function depends on the nature of the AIP (see equations (2), (3), (4), (5) and (6) of this section). The empirical results show that R is expected to be highest when the tested position is closest to the QTL (see Figure 2 of this section). The expectation taken for the multiallelic LD was the empirical mean, which was found to converge for distant regions on the chromosomes. These regions can be assumed to be independent, thus showing an expected behavior for the multiallelic LD. The empirical results also show that the tested position that minimizes the matrix distance is highly correlated with the mapping accuracy of the AIP (see sub-section on mapping accuracy and relative efficiency of this section).

Variation of LD between tested positions and a QTL

Figure 2 shows the empirical means of the 889 FLW and the 14 896 HCB LD profiles for R_{i, QTL} and $D_{i, QTL}^{2}$ .

In Figure 2 the values of R_{i, QTL} and $D_{i, QTL}^{2}$ increase, as expected, as the tested position i moves closer to the QTL. This implies that the sum of the $Δ_{p}^{2}$ terms increases on average as position i moves toward the QTL. The highest expected values for R_{i, QTL} and $D_{i, QTL}^{2}$ , in Figure 2 are reached for the tested position closest to the QTL. Note that the range of values for $D_{i, QTL}^{2}$ , in Figure 2 is smaller than that of R_{i, QTL}. This is due to the lack of a normalization factor for $D_{i, QTL}^{2}$ .

Matrix distance as function of multiallelic LD coefficients

Based on expression (1), $d_{1} (M^{P, i}, M^{QTL})$ can be re-written as [see Additional file 1]:

\begin{align} d_{1} (M^{P, i}, M^{QTL}) & = \sum_{p = 1}^{K} \sum_{q = 1}^{K} [[f_{i, h_{p} a_{1}}^{QTL} f_{i, h_{q} a_{1}}^{QTL} \\ + f_{i, h_{p} a_{2}}^{QTL} f_{i, h_{q} a_{2}}^{QTL}] (1 - s_{i, h_{p}, h_{q}}^{P}) \\ + [f_{i, h_{p} a_{2}}^{QTL} f_{i, h_{q} a_{1}}^{QTL} + f_{i, h_{p} a_{1}}^{QTL} f_{i, h_{q} a_{2}}^{QTL}] s_{i, h_{p}, h_{q}}^{P}] \end{align}

(2)

Replacing the 2K frequencies in expression (2) by the (Δ_p)_1≤p≤K LD coefficient terms and the product of marginal frequencies gives [see Additional file 1]:

\begin{align} d_{1} (M^{P, i}, M^{QTL}) & = \sum_{p = 1}^{K} [4 (\sum_{q \neq p}^{K} s_{i, h_{p}, h_{q}}^{P} - s_{i, h_{p}, h_{p}}^{P}) Δ_{p}^{2} \\ + Ψ_{pq}^{P} (Δ_{l \neq p, q}) Δ_{p} + Φ_{pq}^{P} (Δ_{l \neq p, q})] \\ = ξ^{P} (Δ_{1}, .., Δ_{K}), \end{align}

(3)

where $Ψ_{pq}^{P} (Δ_{l \neq p, q})$ and $Φ_{pq}^{P} (Δ_{l \neq p, q})$ are sums and products of marginal frequencies, allelic identity predictions and LD coefficient terms. The general behavior of $ξ^{P} (Δ_{1}, .., Δ_{K})$ , with respect to R_{i, QTL}, is unspecifiable due to its complexity. For instance, the behavior of $ξ^{P} (Δ_{1}, .., Δ_{K})$ cannot be specified for continuous AIP in [0,1]. However for , $ξ^{P} (Δ_{1}, .., Δ_{K})$ reduces to a sum of strictly concave functions of each LD coefficient [see Additional file 1], i.e.

\begin{align} ξ^{IBS_{hap}} (Δ_{1}, .., Δ_{K}) & = \sum_{p = 1}^{K} [- 4 Δ_{p}^{2} + Ψ_{pq}^{IBS_{hap}} Δ_{p} \\ + Φ_{pq}^{IBS_{hap}}] \\ = \sum_{p = 1}^{K} [Q_{p} (Δ_{p})], \end{align}

(4)

where $Ψ_{pq}^{IBS_{hap}}$ and Φ_{p
q} are sums and products of marginal frequencies that do not depend on LD coefficients. Let $Δ_{p}^{*} = \frac{Ψ_{pq}^{IBS_{hap}}}{8}$ be the critical value for each Q_p function. $ξ^{IBS_{hap}} (Δ_{1}, .., Δ_{K})$ will decrease if the squared or absolute deviation of each Δ_p term from its corresponding $Δ_{p}^{*}$ increases [see Additional file 2]. However note that the squared deviations of all Δ_p terms from their corresponding critical values do not need to increase simultaneously for $ξ^{IBS_{hap}} (Δ_{1}, .., Δ_{K})$ to decrease. For example, some Q_p functions corresponding to haplotypes with low frequencies can be negligible in expression (4). Hence, if $\sum_{p = 1}^{K} {(Δ_{p} - Δ_{p}^{*})}^{2}$ increases sufficiently, $ξ^{IBS_{hap}} (Δ_{1}, .., Δ_{K})$ will decrease. It can be shown that $\sum_{p = 1}^{K} {(Δ_{p} - Δ_{p}^{*})}^{2}$ will increase if $\sum_{p = 1}^{K} {Δ_{p}}^{2}$ increases and that these two quantities share almost the same pattern for their expected values [see Additional file 2]. Thus, according to the $D_{i, QTL}^{2}$ profiles in Figure 2, $ξ^{IBS_{hap}} (Δ_{1}, .., Δ_{K})$ is expected to decrease as position i moves toward the QTL position.

An important result for $ξ^{P} (Δ_{1}, .., Δ_{K})$ is obtained when only two haplotypes are observed among the K possible haplotypes. In this case, $ξ^{P} (Δ_{1}, .., Δ_{K})$ reduces to a real function of one LD coefficient [see Additional file 1], i.e.:

\begin{align} ξ^{P} (Δ_{1}) & = [- 4 s_{i, h_{1}, h_{1}}^{P} - 4 s_{i, h_{2}, h_{2}}^{P} + 8 s_{i, h_{1}, h_{2}}^{P}] Δ_{1}^{2} \\ + Ψ^{P} Δ_{1} + Φ^{P}, \end{align}

(5)

where $Ψ^{P}$ and $Φ^{P}$ are terms independent of LD, and the minimum and maximum possible values for Δ₁ are given by $- \frac{1}{4}$ and $\frac{1}{4}$ , respectively. If we have:

\begin{align} ξ^{IBS_{hap}} (Δ_{1}) & = - 8 Δ_{1}^{2} + Ψ^{IBS_{hap}} Δ_{1} + Φ^{IBS_{hap}} \end{align}

(6)

The minimum and maximum possible values for the critical value, $Δ_{1}^{*}$ , of $ξ^{IBS_{hap}}$ are given by $- \frac{1}{4}$ and $\frac{1}{4}$ , respectively, if the tested locus and the QTL are monomorphic [see Additional file 1]. In other words, the critical value of this function will always lie within the range of the LD coefficient when LD exist. In expression (5), the coefficient $[- 4 s_{i, h_{1}, h_{1}}^{P} - 4 s_{i, h_{2}, h_{2}}^{P} + 8 s_{i, h_{1}, h_{2}}^{P}]$ is always greater or equal to −8 for any other predictor than IBS_hap, since $s_{i, h_{1}, h_{2}}^{P} \in [0, 1]$ . For instance, AIP by construction assign positive values to $s_{i, h_{1}, h_{2}}^{P}$ when haplotypes h₁ and h₂ share allele similarity. This property is even truer if h₁ and h₂ are very similar. In such cases, the highest rate of decrease for $ξ^{P}$ , with respect to the absolute deviation of Δ₁ from $Δ_{1}^{*}$ , is thus induced by . Moreover, for such cases, we also have $ξ^{P} (- \frac{1}{4}) = ξ^{P} (\frac{1}{4}) \in [\frac{1}{2} s_{i, h_{1}, h_{2}}^{P}, 1]$ , which expresses a lower bound for $ξ^{P}$ (i.e. $\frac{1}{2} s_{i, h_{1}, h_{2}}^{P}$ , [see Additional file 1]). Finally, $ξ^{P} (- \frac{1}{4}) = ξ^{P} (\frac{1}{4}) = 0$ if and only if . In other words, when LD between the haplotypes and the QTL alleles is complete, the matrix distance is equal to 0 if and only if [see Additional file 1]. The decreasing behavior of $ξ^{P}$ between a tested position and a QTL for a substantial increase of LD is therefore deteriorated for AIP with continuous predictions in [ 0,1]. Hence, this result questions the behavior of AIP with continuous predictions in [ 0,1] in relation to LD, in the general case where K is greater than 2.

Distributions of matrix distance as function of multiallelic LD

Figures 3 and 4 show the distributions of the matrix distance for the six AIP against the local multiallelic LD. Figures 3 and 4 convey only local information for the case where the tested position is closest to the QTL, as opposed to Figure 2. Darker and lighter blue regions in Figures 3 and 4 correspond to higher and lower density of points. The red lines in Figures 3 and 4 correspond to non-parametric LOESS regressions of the matrix distance on the multiallelic LD.

Figures 3 and 4 show a better behavior of IBS _hap and P(IBD) for the decrease of their matrix distance, with lower variability around the LOESS curves compared to the other predictors, when the LD between the haplotypes and the target alleles increases. The distributions of the matrix distance for IBS _hap and P(IBD) in these figures show similar trends on the FLW and HCB chromosomes. This is due to the fact that these two predictors perform similarly in some conditions (see sub-section on mapping accuracy and relative efficiency). However IBS _hap shows a better behavior compared to all other predictors in Figures 3 and 4, for the decrease of its matrix distance with increasing multiallelic LD. The good behavior of IBS _hap for the decrease of its matrix distance in Figures 3 and 4 is totally explained by equation (4), where the sum of the concave polynomials decreases as the multiallelic LD increases. The better behavior of IBS _hap, compared to the other predictors in Figures 3 and 4, is explained by equations (3) and (5), which show that continuous predictions in [0,1] will deteriorate the decrease of the matrix distance with respect to LD.

The matrix distances for Beagle and IBS _m were also plotted against the local multiallelic LD between the haplotypes and the target alleles in Figures 3 and 4, although these two predictors are defined for marker positions only. Indeed, one of the aims of this study was to compare the AIP based on local LD between haplotypes and alleles at a hidden locus. TP, Score, Beagle and IBS _m showed poor relationships for the decrease of their matrix distance with the increasing multiallelic LD. The matrix distance distributions showed high variability for these predictors with respect to R on the FLW and HCB chromosomes. Note that the length of the six marker haplotypes on the HCB chromosomes were equal to 0.01 cM, on average, compared to 0.31 cM on average for those on the FLW chromosomes.

Mapping accuracy and relative efficiency

Table 1 relates the relative efficiency of the six AIP that were compared, and their mapping accuracies, for a QTL simulated under six scenarios on the FLW chromosomes for N=200 simulations. $R_{i^{*}, QTL}$ in Table 1 corresponds to the multiallelic LD at position i^∗, measured between the marker-haplotypes that harbor the simulated QTL and the QTL alleles. Note that the tested position i^∗ does not necessarily coincide with the QTL position. Thus, i^∗ can be defined as the tested position closest to the simulated QTL.

Table 1 Relative efficiencies and mapping accuracies for different QTL effects

Full size table

In Table 1, IBS mQTL refers to the IBS _m predictor applied to the data set containing the causal variants. This situation was examined as a gold standard. As shown in Table 1 and as expected, IBS mQTL provided the best mapping accuracy since the data set used contained the causal variants and both the simulated QTL and the analyzed markers were biallelic. However, it should be noted that the RMSE ^m.a. for IBS mQTL was never equal to 0. This is principally due to the error term in the probabilistic models for hypothesis testing. RMSE ^r.e. for IBS mQTL was also not equal to 0 when LD was highest ( $R_{i^{*}, QTL} = 0.52$ ). This was due to a nearby marker which was in complete LD with the SNP that simulated the QTL (i.e. the biallelic LD was complete). Consequently the argument of the minimum (argmin) for the set of matrix distances was not unique.

In Table 1 both RMSE ^r.e. and RMSE ^m.a. increased globally for all predictors when LD decreased in the vicinity of the QTL. RMSE ^r.e. and RMSE ^m.a. were highly correlated, regardless of the QTL effect. Across all LD levels, the Spearman correlation coefficient between these two quantities was equal to 0.89 (or 0.91) when the QTL effect explained at most 57% (or 8%) of the total variance, respectively (Figure 5).

Each dot in Figure 5 represents RMSE ^m.a. against RMSE ^r.e. for one of the AIP at a particular LD level. In Table 1, the IBS _hap predictor was often the most accurate and efficient AIP when the data was analyzed without the QTL. However, the P(IBD) predictor showed similar mapping and efficiency results to IBS _hap. As defined by [1], the P(IBD) predictor relies on the IBS state of alleles between haplotype markers which suggests that IBS _hap and P(IBD) may perform similarly in some conditions [41]. Indeed, the distribution of IBD probabilities in the vicinity of a simulated QTL was almost bimodal (0 or 1) among the different pairs of chromosome segments for the different sets of simulations, and thus similar to the distribution of the values for IBS _hap between the segments. To illustrate this phenomenon, Figure 6 provides an example of distributions for the values of P(IBD) and IBS _hap, for one gene-drop simulation, between pairs of chromosome segments around the simulated QTL for the moderate LD situation ( $R_{i^{*}, QTL} = 0.18$ ).

IBS _hap and P(IBD) also showed similar patterns at a set of tested positions for the matrix distances $d_{1} (M^{P, i}, M^{QTL})$ . Figure 7 shows an example for the mean and the sample quantiles at 2.5 and 97.5% for $d_{1} (M^{P, i}, M^{QTL})$ at each tested position for the six AIP, from 200 gene-drop simulations with a QTL simulated for the moderate LD situation ( $R_{i^{*}, QTL} = 0.18$ ).

As observed in Figure 7, the minima of the curves for the mean and the sample quantiles at 2.5 and 97.5% of the matrix distance distributions almost coincide with the QTL position for IBS _hap and P(IBD). For these two predictors, the three curves also show a smooth decreasing behavior as the tested position gets closer to the simulated QTL. This behavior shows the ability of IBS _hap and P(IBD) to capture LD structure along the chromosomes with respect to the simulated QTL, for different gene-drop simulations. It is interesting to note that IBS _hap and P(IBD) show similar patterns for the mean and the sample quantiles curves. However, the minimum of each of the three curves in Figure 7 is lower for IBS _hap than for P(IBD). Note that the patterns of the matrix distances for IBS _hap in Figure 7 are explained by equation (4) and Figure 2. That is, the matrix distance will decrease for IBS _hap due to the expected increase of the multiallelic LD, as the tested position moves toward the QTL position. In the same way, the patterns of the matrix distances for P(IBD) in Figure 7 are explained according to Figures 2 and 6. That is, P(IBD) will behave slightly differently from IBS _hap, according to Figures 2 and 6, when taking equations (3) and (5) into account.

As shown in Figure 7, the other predictors cannot capture the LD structure along the chromosomes with respect to the simulated QTL as well as IBS _hap and P(IBD); this is particularly the case for Score and even more for TP. For the latter two predictors, $d_{1} (M^{P, i}, M^{QTL})$ shows little variability and is low on average across the tested positions. This could explain the lack of a clear ranking between the mapping accuracies of TP and Score in Table 1. For Beagle, a good relative efficiency and mapping accuracy was observed for the lowest LD situation ( $R_{i^{*}, QTL} = 0.08$ ) in Table 1, compared to all the other predictors, when the QTL effect was low. Note that AIP that are based on haplotypes and that do not perform haplotype clustering like Beagle, may not be at an advantage for a low LD situation. For example, the matrix distance for IBS _hap, as defined by equation (4), will not decrease if there is little LD between local haplotypes and QTL alleles. Therefore, haplotype clustering is necessary for such situations. Moreover, these AIP will intrinsically provide an excess of degrees of freedom for testing association if the QTL is biallelic, while not compensating for the low LD captured in the matrix distance. Hence, AIP based on haplotype clustering can provide higher mapping accuracy for low LD situations.

Discussion

Matrix distance properties

The present study showed that the QTL mapping accuracy of AIP is highly correlated to the tested position that minimizes the matrix distance defined for comparison. The use of the matrix distance to compare various AIP has many advantages for methodology development and validation. First, it is independent of phenotype simulation processes and statistical tests that are commonly used to compare QTL mapping accuracy of different AIP [4, 8, 23, 25]. Indeed the phenotype simulation process, when based on certain specific assumptions, may favor some AIP over others: for example, IBD-based AIP might be at an advantage if the phenotypes are simulated only according to population history. The statistical test used may also favor some AIP, such as IBS _hap, IBS _m and Beagle, over others due to numerical stability when estimating variance components. As such, solving mixed model equations when covariance matrices are close to singularity due to AIP computation has been reported as an issue, and clustering strategies for haplotypes, which actually modify the properties of the AIP matrices, have been proposed to facilitate computation [42, 43]. The major drawback of the matrix distance approach is related to this advantage: a particularly efficient AIP or a particularly efficient haplotype size, identified from the matrix distance, that can not be used in association studies would be of no value. Another advantage of the matrix distance approach is that computation time is highly reduced compared to association studies, so numerous comparisons can be done. In the present study, the relative efficiency of the AIP was consistent with the results for QTL mapping accuracy, regardless of the QTL effects and LD patterns. Therefore, the concept of relative efficiency was proven useful to compare AIP and avoid time-consuming association studies on simulated data. Combining the relative efficiency with the mapping accuracy of predictors could also be helpful to gain a better understanding of the underlying mechanisms in an association study.

Comparing AIP

The results showed that the most accurate AIP for mapping were those that best captured LD between a tested position and a QTL. This was proposed from algebraic developments in the simplest situations and validated using real data and simulations. The matrix distance can be written for any AIP as a sum of functions of LD coefficients, and more precisely for the IBS _hap predictor as a sum of concave polynomials of LD coefficients. When LD was moderate to high around the QTL, the IBS _hap predictor was the most efficient and accurate matrix for mapping. For a biallelic QTL, the domains of values for which some of these concave polynomials can either decrease or increase with increasing LD was shown in our developments as limited to extreme allele frequencies for the haplotypes and QTL. Additionally, continuous AIP in [0,1] were shown to deteriorate the matrix distance generally when LD between a tested position and the QTL increased. This was observed on two unrelated data sets, which showed that this behavior is not related to the marker density or population history. All LD measures are based on counting occurrences for discrete events at distinct loci to quantify non-random association [37, 38], which thus explains the algebraic and simulation results for discrete and continuous AIP when a relatively high LD is available for detection. The pig example was built using 235 haplotypes and 25 gene-drop generations, a realistic situation with regard to the effective population size. However, the impact of the resulting long-range haplotypic identity, which depends strongly on the population size and mating strategies, on the relative values of the considered AIP should be investigated.

Despite using two contrasting data sets in terms of marker density and population history, P(IBD) always behaved very similarly to IBS _hap. When extending the calculations to longer haplotypes (results not shown), a similar behavior was observed. Yet advantages have been reported for P(IBD) compared to IBS _hap in some situations. For example, Roldan et al.[43] showed better accuracy for P(IBD) compared to IBS _hap, after a clustering step for haplotypes when marker intervals were equal to 0.05 cM between SNPs, but not when they reached 0.25 cM. However in Roldan et al.[43], different statistical models were applied to P(IBD) versus IBS _hap (mixed model versus fixed effects model respectively). Hence, these two AIP were not compared on the same basis. For instance, Boleckova et al.[44] showed that statistical models in which haplotypes were fitted as random effects performed better than those in which they were fitted as fixed effects. When both LD and the QTL effect were low, Beagle showed a relatively good efficiency and mapping accuracy. It was not possible to derive algebraical comparisons between AIP when LD was low, but this, together with earlier studies that point out that continuous advanced methods are more efficient than simple IBS _hap, suggests that some continuous AIP in [0,1] may provide efficiency when LD between markers and a QTL is reduced.

Extending the results to multiallelic QTL

In the present study, we considered a biallelic QTL for algebra and simulations. Yet the algebraic derivation of the matrix distance can be generalized to a multiallelic QTL without difficulty [see Additional file 1]. As suggested by these developments, for a multiallelic QTL, the relationship between continuous predictions of allelic identity at a tested position and the corresponding LD coefficients will tend to be looser than for discrete predictions. In addition, the matrix distance for the IBS _hap predictor can always be written as a sum of concave polynomials of LD coefficients for any degree of allelism at the QTL.

Conclusion

The IBS _hap predictor can always capture multiallelic LD between a tested position and a QTL, regardless of the degree of allelism at the QTL. The IBS _hap predictor also has the advantage of being simple, fast and numerically stable when used in association studies. Therefore, it is suggested that, for studies with a high density of markers and for which LD between markers and the causal variants is likely to be high, the use of the IBS _hap predictor is recommended.

References

Meuwissen THE, Goddard ME: Prediction of identity by descent probabilities from marker haplotypes. Genet Sel Evol. 2001, 33: 605-634. 10.1186/1297-9686-33-6-605.
Article PubMed Central CAS PubMed Google Scholar
Li J, Jiang T: Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics. 2005, 21: 4384-4393. 10.1093/bioinformatics/bti732.
Article CAS PubMed Google Scholar
Browning SR: Multilocus association mapping using variable-length Markov chains. Am J Hum Genet. 2006, 78: 903-913. 10.1086/503876.
Article PubMed Central CAS PubMed Google Scholar
Pong-Wong R, George AW, Woolliams JA, Haley CS: A simple and rapid method for calculating identity-by-descent matrices using multiple markers. Genet Sel Evol. 2001, 33: 453-471. 10.1186/1297-9686-33-5-453.
Article PubMed Central CAS PubMed Google Scholar
Bercovici S, Meek C, Wexler Y, Geiger D: Estimating genome-wide IBD sharing from SNP data via an efficient hidden Markov model of LD with application to gene mapping. Bioinformatics. 2010, 26: i175-i182. 10.1093/bioinformatics/btq204.
Article PubMed Central CAS PubMed Google Scholar
Druet T, Georges M: A hidden Markov model combining linkage and linkage disequilibrium information for haplotype reconstruction and quantitative trait locus fine mapping. Genetics. 2010, 184: 789-798. 10.1534/genetics.109.108431.
Article PubMed Central CAS PubMed Google Scholar
Akey J, Jin L: Xiong M: Haplotypes vs single marker linkage disequilibrium tests: what do we gain?. Eur J Hum Genet. 2001, 9: 291-300. 10.1038/sj.ejhg.5200619.
Article CAS PubMed Google Scholar
Abdallah J, Goffinet B, Cierco-Ayrolles C, Pérez-Enciso M: Linkage disequilibrium fine mapping of quantitative trait loci: a simulation study. Genet Sel Evol. 2003, 35: 513-532. 10.1186/1297-9686-35-6-513.
Article PubMed Central CAS PubMed Google Scholar
Cardon LR, Abecasis GR: Using haplotype blocks to map human complex trait loci. Trends Genet. 2003, 19: 136-140.
Article Google Scholar
Clark AG: The role of haplotypes in candidate gene studies. Genet Epidemiol. 2004, 27: 321-333. 10.1002/gepi.20025.
Article PubMed Google Scholar
Schaid DJ: Evaluating associations of haplotypes with traits. Genet Epidemiol. 2004, 27: 348-364. 10.1002/gepi.20037.
Article PubMed Google Scholar
Browning BL, Browning SR: Efficient multilocus association testing for whole genome association studies using localized haplotype clustering. Genet Epidemiol. 2007, 31: 365-375. 10.1002/gepi.20216.
Article PubMed Google Scholar
Chen Y, Li X, Li J: A novel approach for haplotype-based association analyzis using family data. BMC Bioinformatics. 2010, 11: S45-10.1186/1471-2105-11-S1-S45.
Article PubMed Central PubMed Google Scholar
Lin WY, Yi N, Zhi D, Zhang K, Gao G, Tiwari HK, Liu N: Haplotype-based methods for detecting uncommon causal variants with common SNPs. Genet Epidemiol. 2012, 36: 572-582. 10.1002/gepi.21650.
Article PubMed Central PubMed Google Scholar
Knüppel S, Esparza-Gordillo J, Marenholz I, Holzhütter HG, Bauerfeind A, Ruether A, Weidinger S, Lee YA, Rohde K: Multi-locus stepwise regression: a haplotype based algorithm for finding genetic associations applied to atopic dermatitis. BMC Med Genet. 2012, 13: 8-
Article PubMed Central PubMed Google Scholar
Li M, Wing HW, Art BO: A sparse transmission disequilibrium test for haplotypes based on Bradley-Terry graphs. Hum Hered. 2012, 73: 52-61. 10.1159/000335937.
Article Google Scholar
Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273: 1516-1517. 10.1126/science.273.5281.1516.
Article CAS PubMed Google Scholar
Terwilliger JD, Weiss KM: Linkage disequilibrium mapping of complex disease: fantasy or reality?. Curr Opin Biotechnol. 1998, 9: 578-594. 10.1016/S0958-1669(98)80135-3.
Article CAS PubMed Google Scholar
Jorde LB: Linkage disequilibrium and the search for complex disease genes. Genome Res. 2000, 10: 1435-1444. 10.1101/gr.144500.
Article CAS PubMed Google Scholar
Weiss KM, Clark AG: Linkage disequilibrium and the mapping of complex human traits. Trends in Genet. 2002, 18: 19-24. 10.1016/S0168-9525(01)02550-1.
Article CAS Google Scholar
Slatkin M: Disequilibrium mapping of a quantitative-trait locus in an expanding population. Am J Hum Genet. 1999, 64: 1764-1772.
Article PubMed Central CAS PubMed Google Scholar
Farnir F, Coppieters W, Arranz JJ, Berzi P, Cambisano N, Grisart B, Karim L, Marcq F, Moreau L, Mni M, Nezer C, Simon P, Vanmanshoven P, Wagenaar D, George M: Extensive genome-wide linkage disequilibrium in cattle. Genome Res. 2000, 10: 220-227. 10.1101/gr.10.2.220.
Article CAS PubMed Google Scholar
Meuwissen THE, Goddard ME: Fine mapping of quantitative trait loci using linkage disequilibria with closely linked marker loci. Genetics. 2000, 155: 421-430.
PubMed Central CAS PubMed Google Scholar
Remington DL, Thornsberry JM, Matsuoka Y, Wilson LM, Whitt SR, Doebley J, Kresovich S, Goodman MM, Buckler ES: Structure of linkage disequilibrium and phenotypic associations in the maize genome. Proc Natl Acad Sci USA. 2001, 98: 11479-11484. 10.1073/pnas.201394398.
Article PubMed Central CAS PubMed Google Scholar
He W, Fernando RL, Dekkers JCM, Gilbert H: A gene frequency model for QTL mapping using Bayesian inference. Genet Sel Evol. 2010, 42: 21-10.1186/1297-9686-42-21.
Article PubMed Central CAS PubMed Google Scholar
Grapes L, Dekkers JCM, Rothschild MF, Fernando RL: Comparing linkage disequilibrium-based methods for fine mapping quantitative trait loci. Genetics. 2004, 166: 1561-1570. 10.1534/genetics.166.3.1561.
Article PubMed Central CAS PubMed Google Scholar
Browning BL, Browning SR: Haplotypic analysis of Wellcome Trust Case Control Consortium data. Human Genet. 2008, 123: 273-280. 10.1007/s00439-008-0472-1.
Article Google Scholar
Henderson CR: Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975, 31: 423-447. 10.2307/2529430.
Article CAS PubMed Google Scholar
Henderson CR: A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics. 1976, 32: 69-83. 10.2307/2529339.
Article Google Scholar
Dempster AP, Laird NM, Rubin DB: Maximum likelihood from in-complete data via the EM algorithm. Roy Statist Soc Ser B. 1977, 39: 1-38.
Google Scholar
Patterson HD, Thompson R: Recovery of inter-block information when block sizes are unequal. Biometrika. 1971, 58: 545-554. 10.1093/biomet/58.3.545.
Article Google Scholar
Harville DA: Bayesian inference for variance components using only error contrasts. Biometrika. 1974, 61: 383-385. 10.1093/biomet/61.2.383.
Article Google Scholar
Foulley JL: EM algorithm: theory and application to the mixed model. J Soc Fr Stat. 2002, 143: 57-109.
Google Scholar
Calus MPL, Meuwissen THE, Windig JJ, Knol EF, Schrooten C, Vereijken ALJ, Veerkamp RF: Effects of the number of markers per haplotype and clustering of haplotypes on the accuracy of QTL mapping and prediction of genomic breeding values. Genet Sel Evol. 2009, 41: 11-10.1186/1297-9686-41-11.
Article PubMed Central PubMed Google Scholar
Grapes L, Firat MZ, Dekkers JCM, Rothschild MF, Fernando RL: Optimal haplotype structure for linkage disequilibrium-based fine mapping of quantitative trait loci using identity by descent. Genetics. 2005, 172: 1955-1965. 10.1534/genetics.105.048686.
Article PubMed Google Scholar
Ramos AM, Crooijmans RP, Affara NA, Amaral AJ, Archibald AL, Beever JE, Bendixen C, Churcher C, Clark R, Dehais P, Hansen MS, Hedegaard J, Hu ZL, Kerstens HH, Law AS, Megens HJ, Milan D, Nonneman DJ, Rohrer GA, Rothschild MF, Smith TP, Schnabel RD, Van Tassell CP, Taylor JF, Wiedmann RT, Schook LB, Groenen MA: Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS ONE. 2009, 4: e6524-10.1371/journal.pone.0006524.
Article PubMed Central PubMed Google Scholar
Hedrick PW, Thomson G: A two-locus neutrality test: applications to humans, E. coli and Lodgepole pine. Genetics. 1985, 112: 135-156.
Google Scholar
Hedrick PW: Gametic disequilibrium measures: proceed with caution. Genetics. 1987, 117: 331-341.
PubMed Central CAS PubMed Google Scholar
Maurer HP, Knaak C, Melchinger AE, Ouzunova M, Frisch M: Linkage disequilibrium between SSR markers in six pools of elite lines of an european breeding program for hybrid maize. Maydica. 2006, 51: 269-279.
Google Scholar
Ytournel F, Teyssèdre S, Roldan D, Erbe M, Simianer H, Boichard D, Gilbert H, Druet T, Legarra A: LDSO: A program to simulate pedigrees and molecular information under various evolutionary forces. J Anim Breed Genet. 2012, 129: 417-421. 10.1111/j.1439-0388.2011.00986.x.
Article CAS PubMed Google Scholar
Ytournel F, Gilbert H, Boichard D: Concordance between IBD probabilities and linkage disequilibrium. Proceedings of European Federation of Animal Science Annual Meeting; 26 August 2007; Dublin. 2007, 1248-1248. [http://www.eaap.org/Previous_Annual_Meetings/2007Dublin/Papers/S38_1248_Ytournel.pdf]
Google Scholar
Druet T, Fritz S, Boussaha M, Ben-Jemaa S, Guillaume F, Derbala D, Zelenika D, Lechner D, Charon C, Boichard D, Gut I, Eggen A, Gautier M: Fine mapping of quantitative trait loci affecting female fertility in dairy cattle on BTA03 using a dense single-nucleotide polymorphism map. Genetics. 2008, 178: 2227-2235. 10.1534/genetics.107.085035.
Article PubMed Central CAS PubMed Google Scholar
Roldan DL, Gilbert H, Henshall JM, Legarra A, Elsen JM: Fine-mapping quantitative trait loci with a medium density marker panel: efficiency of population structures and comparison of linkage disequilibrium linkage analysis models. Genet Res Camb. 2012, 94: 223-234. 10.1017/S0016672312000407.
Article PubMed Central CAS PubMed Google Scholar
Boleckova J, Christensen OF, Sørensen P, Sahana G: Strategies for haplotype-based association mapping in a complex pedigreed population. Czech J Anim Sci. 2012, 1: 1-9.
Google Scholar

Download references

Acknowledgements

The study was supported by the French National Research Agency (ANR-09-GENM-002 Rules & Tools Project).

Author information

Authors and Affiliations

INRA, GenPhySE (Génétique, Physiologie et Systèmes d’Elevage), F-31326, Castanet-Tolosan, France
Laval Jacquin, Jean-Michel Elsen & Hélène Gilbert
Université de Toulouse, INP, ENSAT, GenPhySE (Génétique, Physiologie et Systèmes d’Elevage), F-31326, Castanet-Tolosan, France
Laval Jacquin, Jean-Michel Elsen & Hélène Gilbert
Université de Toulouse, INP, ENVT, GenPhySE (Génétique, Physiologie et Systèmes d’Elevage), F-31076, Toulouse, France
Laval Jacquin, Jean-Michel Elsen & Hélène Gilbert

Authors

Laval Jacquin
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Michel Elsen
View author publications
You can also search for this author in PubMed Google Scholar
Hélène Gilbert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laval Jacquin.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

LJ derived the analytical results, performed the simulations and wrote the manuscript. LJ, JME and HG were involved in the conception of the study. All authors read and approved the final manuscript.

Electronic supplementary material

12711_2013_2668_MOESM1_ESM.pdf

Additional file 1:Algebraic derivations of formulas in the main text. This file contains all the algebraic derivations for expressions (1) to (6) and the generalization of the matrix distance, as a sum of concave functions of LD coefficients when IBS _hap, for the case of a multiallelic QTL. (PDF 177 KB)

12711_2013_2668_MOESM2_ESM.pdf

Additional file 2:Domains of LD coefficients and boundary conditions for the critical values of each Q_p function. This file contains the domain of values for the multiallelic LD coefficients, the boundary conditions for the critical value of each Q_p function in expression (4) and the relation between the sum of the squared deviations and $D_{i, QTL}^{2}$ . (PDF 285 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Jacquin, L., Elsen, JM. & Gilbert, H. Using haplotypes for the prediction of allelic identity to fine-map QTL: characterization and properties. Genet Sel Evol 46, 45 (2014). https://doi.org/10.1186/1297-9686-46-45

Download citation

Received: 20 November 2013
Accepted: 20 May 2014
Published: 14 July 2014
DOI: https://doi.org/10.1186/1297-9686-46-45

Using haplotypes for the prediction of allelic identity to fine-map QTL: characterization and properties

Abstract

Background

Results

Conclusions

Background

Methods

Matrix distance comparison

Measures of AIP evaluated

Statistical models, test statistic and relative efficiency

Mixed models

RLRT statistic

Relative efficiency

Comparison criteria

Mapping accuracy

Relative efficiency

Data for simulation

Variation of LD between tested positions and a QTL

Distributions of matrix distance as a function of multiallelic LD

QTL simulation on FLW chromosomes

Results

Variation of LD between tested positions and a QTL

Matrix distance as function of multiallelic LD coefficients

Distributions of matrix distance as function of multiallelic LD

Mapping accuracy and relative efficiency

Discussion

Matrix distance properties

Comparing AIP

Extending the results to multiallelic QTL

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genetics Selection Evolution

Contact us