A comparison of methods for whole-genome QTL mapping using dense markers in four livestock species

With dense genotyping, many choices exist for methods to detect quantitative trait loci (QTL) in livestock populations. However, no across-species study has been conducted on the performance of different methods using real data. We compared three methods that correct for relatedness either implicitly or explicitly: linkage and linkage disequilibrium haplotype-based analysis (LDLA), efficient mixed-model association (EMMA) analysis, and Bayesian whole-genome regression (BayesC). We analyzed one chromosome in each of five datasets (dairy cattle, beef cattle, sheep, horses, and pigs) using real genotypes based on dense single nucleotide polymorphisms and phenotypes. The P values corrected for multiple testing or Bayes factors greater than 150 were considered to be significant. To complete the real data study, we also simulated quantitative trait loci (QTL) for the same datasets based on the real genotypes. Several scenarios were chosen, with different QTL effects and linkage disequilibrium patterns. A pseudo-null statistical distribution was chosen to make the significance thresholds comparable across methods. For the real data, the three methods generally agreed within 1 or 2 cM for the locations of QTL regions and disagreed when no signals were significant (e.g. in pigs). For certain datasets, LDLA had more significant signals than EMMA or BayesC, but they were concentrated around the same peaks. Therefore, the three methods detected approximately the same number of QTL regions. For the simulated data, LDLA was slightly less powerful and accurate than either EMMA or BayesC but this depended strongly on how thresholds were set in the simulations. All three methods performed similarly for real and simulated data. No method was clearly superior across all datasets or for any particular dataset. For computational efficiency and ease of interpretation, EMMA is recommended, but using more than one method is suggested.


Background
Many methods to detect and localize quantitative trait loci (QTL) in humans and animals are reported in the literature. Current methods work on the basis of identity, either by descent or by state, at either single nucleotide polymorphisms (SNPs) or a short series of SNPs (i.e., a haplotype). Current state-of-the-art methods, especially in human genetics, use a consecutive series of single-marker tests (e.g. association analysis), often corrected by stratification or coancestry [1][2][3][4]. The high density of molecular markers ensures that common QTL variants of medium effect will be captured by random linkage disequilibrium of close markers. However, for certain cases, such as causal QTL that are rare variants, simultaneous consideration of consecutive loci (e.g. haplotype association or linkage analysis) can be more powerful or accurate than locus-by-locus analysis (e.g. [5]). In addition, recent research has proposed fitting all markers simultaneously to better model the genetic background and thus improve the analysis [6][7][8]. Sahana et al. [9] used simulation to compare methods, but comparison of QTL mapping methods using real data has not been reported in animal genetics. The purpose of this study was to compare methods for QTL detection and localization based primarily on real datasets for five traits and four species and complemented with some simulations that mimicked the real data.
Three state-of-the-art methods were compared: LDLA (linkage disequilibrium and linkage analysis), a method that considers haplotypes clustered by approximate identical-by-descent probabilities [10]; EMMA (efficient mixed-model association), a regular single-marker association analysis method for a genome-wide association study (GWAS) but with correction for relatedness in the population [3,11,12]; and BayesC, a Bayesian method that fits effects of all SNPs simultaneously [13]. We examined the following features: number of positive signals, agreement of location of positive signals between the methods, shape of signals, and redundancy of test statistics between neighbouring positions. Data-based simulation was used to characterize method properties (power, accuracy, and false discovery rate) under a pseudonull hypothesis that considered presence of a QTL effect.

Methods
Several research projects in the INRA (French National Institute for Agriculture Research) Animal Genetics division have generated a number of datasets that include SNPs and phenotypes. For this study, we considered a variety of distinct experimental designs and species. First, we chose and analyzed chromosomes on which at least one QTL (but not a very large one) had already been detected using the datasets. Then, we considered populations with different family structures: very related (cattle), minimally related but with structure (horses), structured in breeds (pigs), and breed crosses (sheep).

Description of data
Five datasets, which included four species separately, were examined. One chromosome was analyzed for each dataset.
Cattle data in this study came from on-farm recording. For the horses, the procedures involving animals consisted of radiography and collection of blood samples, which are routinely performed in veterinary practice and are considered noninvasive. All horses were sedated in accordance with clinical guidelines. For the sheep study, all animals were kept indoors, handled with care, and managed as a commercial flock following the INRA ethics policy. At the end of the experimental infection, animals were slaughtered at the INRA-Nouzilly abattoir following the EU rules. For the pig study, animals involved were reared and slaughtered in compliance with national regulations applicable to animal research and commercial slaughtering.

Dairy cattle (Bos taurus)
Milk yield (305-day) was examined for a population of 1221 Montbéliarde bulls with a complex pedigree structure that overlapped several generations. Pseudo-performance of bulls was measured by daughter yield deviation [14], which is the average performance of a bull's daughters corrected for other effects such as herd and genetic merit of dams. The bulls had been genotyped with the Illumina Bovine SNP50 BeadChip. Markers were discarded based on low call rate (<95%), lack of known position on the genome, or very high Mendelian inconsistency (more than 20% parentoffspring discordance), resulting in 43 582 SNPs that were used. Chromosome 1 with 2854 markers was analyzed.

Beef cattle (Bos taurus)
Meat tenderness [15] was examined for a population of 936 Blonde d' Aquitaine bulls that had been genotyped with the Illumina Bovine SNP50 BeadChip (43 582 SNPs), filtered as for dairy cattle. Chromosome 7 with 1889 markers was analyzed.

Sheep (Ovis aries)
Resistance to nematode infestation measured as faecal egg count after a first infection at 90 days postpartum (FEC12t) [16] was examined for a population of 1067 meat sheep that resulted from a BlackBelly × Romane backcross [17]. Three generations were included: F1, backcross, and backcross × backcross. Animals had been genotyped with the Illumina OvineSNP50 BeadChip (42 469 SNPs, filtered as in [17]). Chromosome 12 with 1424 markers was analyzed.

Horse (Equus caballus)
Incidence of hock osteochondrosis [18] was examined for a minimally related but family structured population of 627 French Trotter horses; 102 stallions sired the 525 horses that had been scored for hock osteochondrosis. Animals had been genotyped with the Illumina Equi-neSNP50 BeadChip (41 249 SNPs, filtered as in [18]). Chromosome 3 with 2267 markers was analyzed.

Pig (Sus scrofa)
Length of carcass pre-corrected for breed and other environmental factors [19] was examined for a population of 764 pigs (495 Large White, 129 Landrace, and 140 Piétrain) from 327 sires; 656 pigs had been measured for length of carcass. The entire population had been genotyped with the Illumina PorcineSNP60 DNA Analysis kit (46 865 SNPs, edited as in [19]). Chromosome 17 with 1672 markers was studied.
A brief description of the data is in Table 1.

Data analysis
We used three methods to analyse the data: LDLA, EMMA, and BayesC.

LDLA
This method was originally developed by Meuwissen et al. [20] and our implementation was as in [10]. Genotypes were first phased based on family and linkage disequilibrium information [21]. Then, haplotypes were defined based on four consecutive (polymorphic) SNPs. Transmission of these haplotypes was traced along the pedigree; if no recombination occurred, haplotypes transmitted from parents to offspring were considered to be identical and carriers of the same allele at the QTL. Founder haplotypes, which cannot have identity-bydescent ascertained, were clustered based on their resemblance, which was based on alikeness in state according to a simple coalescence model. Finally, a variance component model was fitted to the phenotypes for each locus on the chromosome of interest and maximized by restricted maximum likelihood, either with (alternative hypothesis) or without haplotype effects (null hypothesis). The model included a polygenic effect structured according to a pedigree-based relationship matrix: y = 1μ + Zu + Th + e, where y is the vector of phenotypes, u and h are vectors of polygenic and haplotype effects (with respective incidence matrices Z and T), respectively, and e is a vector of residuals; V ar h ð Þ ¼ Hσ 2 h and V ar u ð Þ ¼ Aσ 2 u are haplotypic and polygenic covariance matrices, with respective variance components σ 2 h and σ 2 u , with matrix H set up according to [10]. A likelihood ratio test comparing the null and alternative hypothesis above was then calculated, and P values were computed according to [22]. The process was repeated for each locus on the chromosome of interest; for each locus, new incidence and covariance matrices T and H were created.

EMMA
We used a simple extension of regular regression-based association analysis. The extension fitted a polygenic background via a relationship matrix to deal with structure, as regularly practiced in animal genetics (e.g. [23]) and recently applied in human genetics (e.g. [3]). In our implementation, we used a whole-genome SNP-based relationship matrix (G) [24] instead of a pedigree-based relationship matrix. At each marker locus on the chromosome of interest, we fitted the model y = 1μ + Zu + ws + e, where u is a vector of polygenic effects as in the LDLA model but with V ar u ð Þ ¼ Gσ 2 u , w is a vector of covariates coded as (0,1,2) for each genotype at the analysed SNP, and s is the substitution effect of the marker. Effects were estimated using blupf90 [25]. A tstatistic was constructed from the estimate of the substitution effect s as t = ŝ/s. e. (ŝ) where s. e. (ŝ) is the standard error of the estimate ŝ.

BayesC
This method [13] is one of a family of Bayesian methods that were originally conceived for prediction of genetic merit and phenotypes and later extended to map gene locations [6,26]. It has the potential to analyze the whole genome simultaneously. The BayesC model included a set of variable indicators, d = {d 1 , …, d n }, which indicates if a SNP is (d i = 1) or is not (d i = 0) in the model. The phenotypes were thus modelled as y = 1μ + ∑w i d i s i + e. Solutions of the model were obtained by Gibbs sampling, which provides marginal a posteriori inference on d in the form of Bayes factors (BF) [27,28]. A key parameter of the model is the a priori number of SNPs, which was fixed at Pr(d i = 1) = 1/1000. Experimentation with 1/100 or 1/ 10000 did not qualitatively change the results. Gibbssampling chains were run for 100 000 iterations using GS3 [29]. All SNPs in the genome were included in the model, although only the SNP estimates for the chromosome of interest were further analyzed. The statistic used was the BF, which corresponds to the increase from prior to posterior probabilities of the SNP being "in" the model [27,30], which in our case can be written as: whered i is the posterior probability of a locus being in the model. For simple models, the BF is numerically identical to the likelihood ratio if a priori probabilities are identical [27,30]. Numerically, the scales of the BF and the likelihood ratio are fairly similar [27]. An advantage of BF is that it has a clear interpretation, contrary to the ad hoc thresholds used for the likelihood ratio (e.g. [31]), which are not comparable from one study to another.

QTL detection
We used chromosome-wise Bonferroni correction to infer rejection thresholds for P values of LDLA and EMMA, and a BF of 150 for BayesC, because this value was suggested as "very strong evidence" [27] and later used for QTL mapping [32].
Potentially, different methods can reveal different QTL in the same dataset. For instance, for two closely positioned QTL on the same chromosome, classical linkage-based analysis will show a large QTL signal around one location, whereas linkage disequilibrium methods may indicate two separate narrow signals. To determine approximately the ability of methods to disentangle closely positioned QTL, we estimated a measure of redundancy for each method along the chromosome, which was based on the equivalent number of independent tests using Geyer's [4] effective sample size. This method estimates the number of independent points in a Markov chain process, which in our case consisted of QTL signals along the genome produced by a method.

Simulations
We simulated data for five scenarios. One hundred replicates were run for each scenario. For each real dataset, we picked at random one or two SNPs as QTL (the same for each replicate within each scenario but not across scenarios) and assigned an effect (the same for each replicate; the effect could be additive or dominant). The effect type, minor allele frequency, and linkage disequilibrium (measured as the correlation between a QTL and the neighbouring marker in strongest linkage disequilibrium with QTL) are in Table 2 for each scenario. The SNP acting as a QTL was then masked. We also simulated a polygenic component based on pedigree relationships, as well as residual deviations, which were added to the QTL genotype effects to simulate phenotypes. Each replicate was analyzed with the three methods (LDLA, EMMA, and BayesC) and for each replicate, the most significant QTL was considered and other significant signals were disregarded. We evaluated three indicators of the quality of QTL mapping: mean squared error in localization (MSE), false discovery rate (FDR), and power. The MSE and power were based on the declared QTL falling within a region around the simulated QTL, and FDR was the number of QTL declared outside the QTL region. Three sizes of QTL regions were examined: ±0.5, ±1, and ±2 cM around the true QTL. The QTL explained 5% of total genetic variance in all scenarios, and the broad-sense heritability of the trait was 0.30.
Statistics for significance decisions (acceptance or rejection of the null hypothesis) generally were not comparable across methods. For example, BF does not have defined type-I and type-II error rates. Therefore, we used the simulations to establish the appropriate rejection thresholds for each statistic (likelihood ratio test for LDLA, t-statistic for EMMA, and BF for BayesC) within each scenario. We examined three thresholds. First, a 5% empirical threshold was computed from the distribution of statistics across all non-QTL positions (those that did not fall in the intervals mentioned above). Because a QTL was included in the simulated data but only non-QTL positions were included in the computation of the statistic, we refer to this as a pseudo-null distribution. Based on 1000 non-QTL positions and 20 simulations, the distribution of the statistic under the pseudo-null hypothesis was described by 20 000 samples. Two more stringent thresholds were also defined. For each of the 100 simulations, the most significant P value in the non-QTL regions was retained; then, the 10% and 20% quantiles (computed following [33]) were set as the two thresholds.
After running all simulations, we had five datasets, five scenarios, 100 replicates for each scenario, three methods, three alternative thresholds, and three alternative QTL region sizes to consider as true positives. The average (across 100 replicates) of each QTL quality indicator (MSE, FDR, or power) was calculated, which resulted in 675 values, one for each combination of the above factors.

Data analysis
Manhattan plots of QTL signals for the three detection methods are shown for each data set in Figures 1, 2, 3, 4 and 5. Table 3 shows the numbers of significant QTL positions detected. These numbers should be regarded with caution because the actual numbers of QTL are not known. Of the three methods, LDLA had the largest number of signals for dairy cattle, horses, and sheep. For beef cattle and pigs, no significant association was found using any of the three detection methods.

Agreement
Because the objective of QTL mapping is not only to detect QTL but to provide locations for further investigation by molecular geneticists, agreement of the location of positive signals between the three detection methods was evaluated. Informal visual agreement between methods is evident from Figures 1, 2, 3, 4 and 5. Additional Tables S1, S2, S3, S4 and S5 [See Additional file 1: Tables S1, S2, S3, S4 and S5] show exact locations and statistics for the two major signals found in each analysis. For example, the dairy cattle data showed a QTL across methods in a window of 2 cM around 65 cM and beef cattle data show agreement between methods of a QTL around 32 and 98 cM, etc. For sheep and beef cattle, the figures show more agreement than the additional tables, because the peaks in agreement may be first or second highest signals in one method but third or fourth highest signals in another. Only the pig dataset did not show clear agreement for detected QTL, but no significant positive signals were found for pigs. We believe that the true signals that were present in these datasets were found by all methods (within a bracket of 1 to 5 cM) and that false signals were random noise and, therefore, without agreement.

Redundancy
Estimates of the numbers of independent tests are in Table 4. Redundancy increased with family structure (i.e. number of half-and full-sibs within family); e.g. in horses, the large number of small families resulted in low redundancy. In general, LDLA tended to be the most redundant (i.e. a small number of independent tests). Redundancy was less for EMMA and BayesC than for LDLA because (1) haplotypes (or haplotype structure) are highly correlated from one position to the next (but SNP effects are   not) and (2) LDLA is partly family (linkage) based (but EMMA and BayesC are not). In addition, BayesC estimators are marginalized over all other loci.

Simulations
Quality indicators for QTL detection methods are in Table 5 as least-squares means (estimated marginal means over a balanced design across all other simulation factors [34]). The LDLA method had not only lower power but also higher MSE and FDR. The EMMA and BayesC methods were clearly superior to LDLA in the simulations, regardless of scenario or simulation factor (data not shown).

Discussion
Overall, the results were similar with the three QTL detection methods. The Manhattan plots had similar profiles and indicated the same QTL locations for significant (or close to significant) points. None of the methods appeared to produce spurious peaks because of population stratification. All three methods accounted for relationships either through pedigree or genome-wide markers. Table 3 suggests that LDLA is more powerful than EMMA or BayesC. However, in Figures 1, 2, 3, 4 and 5, some multiple signals obtained with LDLA corresponded to a single peak obtained with EMMA and BayesC, which suggests that many LDLA signals echo a single QTL. This makes sense because the covariance structure for haplotypes changes little from one position to another (as evident in Table 4); i.e. LDLA uses linkage information. All QTL regions that were detected by LDLA were also detected by EMMA and BayesC; thus, LDLA was not more powerful than EMMA or BayesC.
The performance of the methods differed slightly between the real and the simulated data. For example, LDLA produced more positive signals with real data than EMMA and BayesC. In contrast, for simulated data (Table 5), LDLA was the least powerful method. Several explanations for this contradiction are possible. One is that only one QTL was simulated, whereas many potential QTL exist in the real data. Another explanation is that LDLA produces several correlated signals for a single QTL, as discussed previously. Still another explanation is that the nominal threshold used for the real data was too liberal and, therefore, produced false positives. The threshold to be used for the real data is also debatable (e.g. [35]). Although a reasonable consensus exists on how to calculate a threshold under a strict null hypothesis (no QTL on the chromosome), such a hypothesis is nonsensical because biologically there must be many, possibly small QTL on each chromosome. Therefore, a pseudo-null hypothesis was used for our simulations. As an example of the problems caused by the use of rejection thresholds, the two highest (although non-significant) peaks of EMMA in Figure 1 agree with the significant peaks obtained with LDLA and BayesC. In Figure 2, (non-significant) results from LDLA agree with results from the other methods. Therefore, no one method is clearly the most powerful for a given dataset because the ideal rejection threshold depends on the true genetic architecture of a dataset. Thus, a common sense approach is to use several different methods for QTL detection.
For QTL location, the methods were in agreement for the real data. This was as expected because the chromosomes were suspected to contain QTL. For the pig data, no agreement for QTL location was found among the methods, but no signals were significant; using several methods and obtaining no significant signals indicates the absence of detectable QTL for this dataset. However, for the beef cattle data, no significant QTL were found with LDLA but significant QTL were evident when using EMMA and BayesC.
The results from the BayesC and EMMA methods agreed to a very large extent for the real and simulated data because they both model each SNP as having an effect. The complexity of BayesC and lack of a consensual rejection threshold for BayesC might make EMMA preferable. If BayesC is used, Bayesian testing (i.e. using BF) is recommended, since it produced results that are comparable within and across datasets.
Finally, LDLA results in a more correlated structure of signals across the genome. This was evident as less power and accuracy in QTL mapping with simulated data but was not clear in the real data. Therefore, the larger number of independent tests found for EMMA and BayesC does not necessarily imply greater accuracy.
Some discussion about the scope of this work is also needed. Many studies have compared QTL detection methods based on simulations. In those simulations, the QTL location and action are precisely described. However, very little is known about QTL action in real data and, thus, such simulations tend to provide incomplete results because they do not account well for the range of possible gene actions. Phenomena such as selfregulation, epistasis, and pleiotropy are particularly difficult to simulate. Our study, which was based on real data, attempts to address the question of whether analysis of a dataset with another method will provide a different result. We have given some empirical answers to this question.

Conclusions
All three methods (haplotype-based, association analysis with relationship matrices, and Bayesian analysis) were suitable for QTL detection. Using several methods for the same dataset is recommended. EMMA was easier to use and had slightly higher accuracy and power for both real and simulated data. For QTL with large signals, the methods agreed well. Establishment of significance thresholds is difficult because it is unclear what the null hypothesis should be [35]. The use of nominal P values corrected for multiple-testing for the frequentist methods and of BF for the Bayesian method resulted in acceptable statistical properties and led to similar conclusions across methods and datasets.

Additional file
Additional file 1: Table S1. Dairy cattle: positions of the first and second largest signals for each method. Positions of two largest QTL detection signals. Table S2. Beef cattle: positions of the first and second largest signals for each method. Positions of two largest QTL detection signals. Table S3. Horses: positions of the first and second largest signals for each method. Positions of two largest QTL detection signals. Table S4. Title: Sheep: positions of the first and second largest signals for each method. Positions of two largest QTL detection signals. Table S5. Pigs: positions of the first and second largest signals for each method. Positions of two largest QTL detection signals.