Evaluation of the accuracy of imputed sequence variant genotypes and their utility for causal variant detection in cattle
© The Author(s) 2017
Received: 14 November 2016
Accepted: 14 February 2017
Published: 21 February 2017
The availability of dense genotypes and whole-genome sequence variants from various sources offers the opportunity to compile large datasets consisting of tens of thousands of individuals with genotypes at millions of polymorphic sites that may enhance the power of genomic analyses. The imputation of missing genotypes ensures that all individuals have genotypes for a shared set of variants.
We evaluated the accuracy of imputation from dense genotypes to whole-genome sequence variants in 249 Fleckvieh and 450 Holstein cattle using Minimac and FImpute. The sequence variants of a subset of the animals were reduced to the variants that were included on the Illumina BovineHD genotyping array and subsequently inferred in silico using either within- or multi-breed reference populations. The accuracy of imputation varied considerably across chromosomes and dropped at regions where the bovine genome contains segmental duplications. Depending on the imputation strategy, the correlation between imputed and true genotypes ranged from 0.898 to 0.952. The accuracy of imputation was higher with Minimac than FImpute particularly for variants with a low minor allele frequency. Using a multi-breed reference population increased the accuracy of imputation, particularly when FImpute was used to infer genotypes. When the sequence variants were imputed using Minimac, the true genotypes were more correlated to predicted allele dosages than best-guess genotypes. The computing costs to impute 23,256,743 sequence variants in 6958 animals were ten-fold higher with Minimac than FImpute. Association studies with imputed sequence variants revealed seven quantitative trait loci (QTL) for milk fat percentage. Two causal mutations in the DGAT1 and GHR genes were the most significantly associated variants at two QTL on chromosomes 14 and 20 when Minimac was used to infer genotypes.
The population-based imputation of millions of sequence variants in large cohorts is computationally feasible and provides accurate genotypes. However, the accuracy of imputation is low in regions where the genome contains large segmental duplications or the coverage with array-derived single nucleotide polymorphisms is poor. Using a reference population that includes individuals from many breeds increases the accuracy of imputation particularly at low-frequency variants. Considering allele dosages rather than best-guess genotypes as explanatory variables is advantageous to detect causal mutations in association studies with imputed sequence variants.
Several genotyping arrays that comprise a varying number of single nucleotide polymorphisms (SNPs) are routinely used for genome-wide genotyping in cattle. Cows are usually genotyped using low-density genotyping arrays whereas bulls are mostly genotyped at higher density . Moreover, the sequencing of important ancestors of many cattle breeds yielded genotypes at millions of polymorphic sites . Combining the genotype data from various sources into a single large dataset may enhance the power of genome-wide analyses. The imputation of missing genotypes is necessary to ensure that all individuals have genotypes for a shared set of variants. Genotype imputation may also infer dense genotypes in silico for animals that were genotyped at lower density by using animals that were genotyped at a higher density as a reference population .
Algorithms that infer missing genotypes apply family- (e.g., ) or population-based (e.g., [5–7]) imputation approaches or a combination thereof (e.g., [8–10]). Family-based imputation approaches rely on Mendelian transmission rules in pedigrees to infer missing genotypes. Population-based imputation approaches exploit linkage disequilibrium (LD) between adjacent markers to predict missing genotypes using a probabilistic framework without (explicitly) considering pedigree information . While population-based imputation approaches are accurate, their computing costs are too high to infer genotypes for a large number of animals and markers for routine applications [12, 13]. Methods that apply a combination of family- and population-based imputation approaches exploit shared haplotypes among relatives thereby enabling rapid imputation of genotypes for tens of thousands of individuals and millions of markers in silico [8–10, 14].
The accuracy of imputation from low to higher density depends on the relationships between target and reference animals, genotype density in both panels, allele frequencies of the imputed variants, LD between adjacent markers and algorithms applied to infer missing genotypes (e.g., [12, 15, 16]). These parameters also affect the accuracy of imputation from dense genotypes to sequence variants [2, 17, 18]. However, the accurate imputation of low-frequency variants is critical with sequence data because rare alleles are more frequent in sequence than array-derived variants and the LD between low-frequency and array-derived variants may be low [19, 20].
In cattle, a number of studies with simulated and real sequence data indicated that using imputed sequence variant genotypes may improve genomic predictions and facilitate the detection of causal trait variants because the polymorphisms that underlie phenotypic variation are included in the data [2, 19, 21–23]. However, the benefits of dense marker maps for genome-wide analyses may be compromised when the accuracy of imputation is low [24–26].
In this paper, we evaluate the accuracy of imputing sequence variant genotypes in two cattle breeds using different imputation algorithms and reference populations. We perform association studies between imputed sequence variant genotypes and milk fat percentage in 6958 Fleckvieh bulls and show that the imputation strategy is critical to pinpoint causal mutations.
Animal ethics statement
No ethical approval was required for this study.
We used whole-genome sequence data of 1577 taurine animals that were included in the fifth run of the 1000 bull genomes project . The reads were aligned to the UMD3.1 bovine reference genome using the BWA-MEM algorithm [27, 28]. SNPs, short insertions and deletions were genotyped for all sequenced animals simultaneously using a multi-sample variant calling pipeline that was implemented with the mpileup module of SAMtools  and is described in Daetwyler et al. . The variant calling yielded genotypes at 39,721,987 biallelic sites for 1577 animals. We considered genotypes at 22,737,136 autosomal sequence variants that had a minor allele frequency (MAF) higher than 0.5% to build the genomic relationship matrix among the sequenced animals using the plink (version 1.9) software tool . Principal components of the genomic relationship matrix were calculated using the GCTA (version 1.25.3) software tool . Our analyses focused on the Fleckvieh (FV) and Holstein (HOL) breeds because a large number of animals from both breeds were included in the 1000 bull genomes data. Following the inspection of the top principal components, animals for which breed was uncertain were removed leaving 249 FV and 450 HOL cattle (see Additional file 1: Figure S1).
Design of cross-validation scenarios
The imputation from dense genotypes to full sequence variants was evaluated using 15-fold cross-validation in FV and HOL cattle. We considered only sequence variants that were polymorphic in 249 FV or 450 HOL animals. For each fold, the sequenced animals were divided into reference and validation animals. Forty-nine FV and 100 HOL animals, that were a random subset of the sequenced animals, were considered as validation animals (i.e., the same individual may have been included in several validation sets). All remaining sequenced animals from the 1000 bull genomes project were considered as a multi-breed reference population. The within-breed reference populations consisted of 200 FV or 350 HOL animals. The genotypes of the validation animals were reduced to the variants that were included on the Illumina BovineHD genotyping array (HD) to mimic dense genotypes. The masked genotypes of the validation animals were subsequently inferred in silico using full sequence information from the reference animals. The selection of validation animals and subsequent imputation of genotypes was repeated 15 times for chromosomes 1, 5, 10, 15, 20 and 25.
Evaluation of the accuracy of imputation
The overall accuracy of imputation was the mean correlation between in silico imputed and true (sequenced) genotypes (rIMP,SEQ) across 15 folds and six chromosomes analysed. Specifically, for each fold and chromosome, we calculated the correlation between matrices of imputed and true genotypes. These values were averaged to obtain the overall rIMP,SEQ. In addition, we grouped the imputed sequence variants into 50 classes with regard to their MAF in the reference population. The rIMP,SEQ-value for each MAF class was the mean value across six chromosomes and 15 folds. To detect intra-chromosomal variations in the accuracy of imputation, we calculated rIMP,SEQ-values for successive 1-Mb segments along each chromosome analysed. Differences between imputation scenarios were tested for significance using two-tailed t-tests.
The performance of the FImpute (version 2.2)  and Minimac3 (version 2.0.1, henceforth referred to as Minimac)  software tools was evaluated using default parameter settings. The algorithm implemented in FImpute uses family and population-based information to infer haplotypes and missing genotypes. A pedigree consisting of 47,012 FV animals tracing back ancestry information to animals born in 1925 was considered when genotypes were inferred using FImpute. The pedigree included ancestry information for 138 (out of 250) sequenced FV animals. Pedigree records for sequenced animals from breeds other than FV were also included if available. However, these pedigrees included only first-degree relatives. Genotypes for the imputed sequence variants were coded as 0, 1 and 2 for homozygous, heterozygous and alternative homozygous animals, respectively.
The algorithm implemented in the Minimac software tool uses previously phased genotypes, i.e., it takes haplotypes as input for reference and validation animals. Haplotype phases were estimated separately for the reference and validation animals using the phasing algorithm implemented in the Eagle (version 2.3) software tool . Both Eagle and Minimac do not consider pedigree information to infer haplotypes and missing genotypes. Minimac provides best-guess genotypes (coded as 0, 1 and 2 for homozygous, heterozygous and alternative homozygous animals, respectively) and allele dosages (continuously distributed values ranging from 0 to 2) for the imputed sequence variants. The rIMP,SEQ-values were calculated for best-guess genotypes (Minimac BG) and allele dosages (Minimac DOS).
Imputation of sequence variants in 6958 Fleckvieh animals
We imputed genotypes for 23,256,743 autosomal sequence variants that were polymorphic in 249 sequenced FV animals for 6958 FV bulls that had (partially imputed) array-derived genotypes at 603,662 autosomal variants (see ). Genotypes were imputed with either FImpute or Minimac using either within- or multi-breed reference populations that consisted either of 249 FV animals or 1577 animals from various bovine breeds (see above). Haplotype phases for reference and target animals were estimated using the Eagle software tool (see above). The LD among imputed sequence variants was calculated as the squared correlation between predicted allele dosages.
Accuracy of imputation at two known causal mutations
712 and 902 FV animals that had imputed sequence variant genotypes also had direct genotypes for two mutations in the DGAT1 (rs109234250 and rs109326954, p.A232K) and GHR (rs385640152, p.F279Y) genes that were obtained for previous studies using TaqMan® genotyping assays (Life Technologies) [34–37]. The accuracy of imputation at both sites was the proportion of correctly imputed genotypes.
Association analyses in Fleckvieh cattle
Association tests between 23,256,743 imputed sequence variants and milk fat percentage were carried out in 6958 FV bulls using a variance component-based approach that was implemented in the EMMAX software tool and that accounts for population stratification by fitting a genomic relationship matrix as detailed in [20, 38]. The genomic relationship matrix was built using array-derived genotypes of 603,662 (partially imputed) autosomal SNPs . Daughter yield deviations (DYD) for milk fat percentage were the response variables. The genotypes were coded as 0, 1 and 2 for homozygous, heterozygous and alternative homozygous animals, respectively, when sequence variants were imputed using FImpute. When the sequence variants were imputed using Minimac, we considered both best-guess genotypes and allele dosages as explanatory variables for the association tests. Sequence variants with P values less than 2.1 × 10−9 (5%-Bonferroni-corrected type I error threshold for 23,256,743 tests) were considered as significantly associated.
All computations were performed on the Biosciences Advanced Scientific Computing (BASC) cluster that is located at AgriBio, Centre for AgriBiosciences, Bundoora, VIC 3083. The memory usage and process time required to infer genotypes for 23,256,743 sequence variants in 6958 animals was quantified on 12-core Intel® Xeon® processors rated at 2.93 GHz with 96 GB of random-access memory (RAM).
Number of polymorphic sites in Fleckvieh and Holstein cattle
Chr length (Mb)
SEQ (MAF < 0.1)
700 K (MAF < 0.1)
SEQ (MAF < 0.1)
700 K (MAF < 0.1)
Evaluation of the accuracy of imputation
Cross-validation imputation accuracy in Fleckvieh and Holstein cattle
Within-breed reference populations
Multi-breed reference populations
The number of sequence variants that were not polymorphic after imputation in the validation population varied considerably across the imputation scenarios tested (see Additional file 2: Figure S2). Using Minimac BG, between 29.2 and 37.4% of the imputed sequence variants were invariant in the validation populations depending on the breed and reference population considered. The proportion of invariant sites was lower when FImpute (between 19.8 and 24.2% of the sequence variants) and Minimac DOS (between 3.6 and 6.7% of the sequence variants) were used for imputation. Variants that had a low MAF in the reference population were more often invariant in the validation population than common variants. When the rIMP,SEQ-values were calculated using only sequence variants that were polymorphic in the validation population, Minimac BG provided the most accurate genotypes. However, the accuracy of imputation was also high using Minimac DOS (see Additional file 2: Figure S2c and S2d) which provided genotypes for considerably more sequence variants than Minimac BG and FImpute.
The accuracy of imputation varied considerably across chromosomes. The rIMP,SEQ-values were high for sequence variants located on BTA1, 20 and 25. However, the accuracy of imputation was lower for sequence variants located on BTA5, 10 and 15 (Fig. 1c, d). This pattern was observed regardless of the reference population and imputation software used. To investigate the reason for the variable accuracy of imputation across chromosomes, we calculated rIMP,SEQ-values, HD SNP coverage and sequence variant density for successive 1-Mb windows along the six chromosomes analysed. Multiple (partially extended) segments with high imputation errors were located on BTA5, 10 and 15 (see Additional file 3: Figure S3) at positions where the bovine genome contains large segmental duplications [40, 41]. These segments were often sparsely covered with HD SNPs but contained a large number of polymorphic sequence variants. Such segments with low imputation quality were not detected along BTA1, 20 and 25, which had an overall higher accuracy of imputation.
Pinpointing causal mutations among imputed sequence variants
Accuracy of imputation for two causal variants in Fleckvieh cattle
Minor allele frequency
Proportion of correctly imputed genotypes/rIMP,SEQ
Fine-mapping of five fat percentage QTL with imputed sequence variants
A total of 239 variants located between 91,857,670 and 93,955,207 bp on BTA5 were significantly associated (P < 2.1 × 10−9) with fat percentage [Fig. 4b and Table S1 (see Additional file 5: Table S1)]. Thirteen variants in the first intron of the MGST1 gene (microsomal glutathione S-transferase 1) were in high LD (r2 > 0.95) with each other and had markedly lower P values (P < 6.5 × 10−21) than all other variants. These 13 variants were also associated with milk composition in New Zealand dairy cows . The top variant (rs208248675) was located at 93,945,991 bp. The rs208248675 A-allele had a frequency of 19.4% in FV cattle and decreased fat percentage.
A QTL for fat percentage on BTA6 encompassed the casein gene complex [Fig. 4c and Table S1 (see Additional file 5: Table S1)]. One hundred and sixty-four variants with P < 2.1 × 10−9 were located between 87,084,144 and 87,296,017 bp. The most significantly associated variant (rs109193501 at 87,154,594 bp, P = 7.6 × 10−14) was located in an intron of the CSN1S1 gene (casein alpha s1).
Two hundred and sixty-two variants located within a 50-kb interval (between 103,274,736 and 103,324,728 bp) on BTA11 were associated with fat percentage [Fig. 4d and Table S1 (see Additional file 5: Table S1)]. The top variant (rs381989107 at 103,296,192 bp, P = 1.5 × 10−15) was located 5472 bp upstream of the translation start site of the LGB gene [beta-lactoglobulin, also known as progestagen associated endometrial protein (PAEP)]. Two missense variants (p.G80D and p.V134A) in the LGB gene (rs110066229 at 103,303,475 bp and rs109625649 at 103,304,757 bp) that distinguish the LGB protein variants A and B  were in high LD with rs381989107 and had P values less than 4.5 × 10−12 (see Additional file 5: Table S1). A variant in the promoter of the LGB gene that causes an aberrant LGB expression in Brown Swiss cattle (OMIA 001437-9913)  was not polymorphic in the sequenced FV animals.
A QTL on BTA16 included 14 significantly associated variants located between 481,671 and 3,265,708 bp (see Additional file 5: Table S1). The top association signal (rs380194132, P = 1.6 × 10−10) resulted from an intergenic variant that was located at 3,242,688 bp. Since the genomic region that included associated sequence variants with similar P values extended over almost 3 Mb, the identification of putative candidate gene(s) underlying the QTL was not attempted (see Additional file 6: Figure S5).
On BTA19, a single variant located at 51,380,692 bp was significantly associated (rs208289132, P = 4.1 × 10−10) with fat percentage [Fig. 4e and Table S1 (see Additional file 5: Table S1)]. The associated variant was located 4230 bp upstream of the translation start site of the FASN gene (fatty acid synthase).
Computing resources required
We assessed the computing resources required to impute genotypes at 23,256,743 sequence variants for 6958 target animals using 1577 sequenced animals as a reference. Eagle and Minimac were run on 10 processors per chromosome, whereas FImpute was run on a single processor. FImpute ran out of memory and failed to impute genotypes for BTA12 and 23 likely because large structural variants are located on both chromosomes (see Additional file 7: Figure S6). The process (CPU) time to infer haplotypes and impute sequence variants for 27 chromosomes was 1037, 490 and 146 h, respectively, for Eagle, Minimac and FImpute.
The wall-clock time and RAM usage for FImpute was between 2.9 and 9.6 h and 6.7 and 29.7 GB per chromosome, respectively (see Additional file 8: Figure S7). The estimation of haplotype phases took between 1.6 and 5.9 h using Eagle and it required between 1.8 and 3.4 GB of RAM with between 11,178 and 38,009 SNPs per chromosome for 6958 animals. The imputation of sequence variants using Minimac took between 2.5 and 9.4 h and it required between 6.3 and 21 GB of RAM per chromosome.
The estimation of haplotype phases for 1577 sequenced reference animals using Eagle (multi-threaded on 10 CPUs) took between 4.2 and 15.1 h with between 413,371 and 1,444,299 sequence variants, respectively, and it required between 15.4 and 51.7 GB of RAM per chromosome.
We evaluated the accuracy of imputation from dense genotypes to sequence variants using a population of 1577 sequenced animals which is a four- to eight-fold increase in sample size compared to previous studies in cattle [2, 17, 18]. Our findings are likely to be relevant for many bovine populations because we considered animals from two breeds with different effective population sizes as validation panel and inferred sequence variant genotypes on six chromosomes that reflect a broad spectrum of LD . The accuracy of imputation was the correlation between imputed and sequenced genotypes (rIMP,SEQ), which is a measure that depends less on the allele frequencies of the sequence variants than e.g., the proportion of correctly imputed genotypes . However, the cross-validation procedure that was applied in our study yields reliable results only when the quality of the sequence data is high. Low-fold sequencing coverage and sequencing or assembly errors may result in flawed genotypes . When genotypes at such positions are masked and subsequently imputed, genotype imputation might provide true genotypes that differ from the sequenced genotypes thereby underestimating the accuracy of imputation. Considering that the genotype error rates were low in the sequenced FV and HOL animals [2, 47], it is unlikely that our findings are biased due to flawed genotypes at some sequence variants.
The rIMP,SEQ-values varied considerably across chromosomes and dropped at several positions along the genome. Previous studies that used array-derived genotypes showed that a decline in imputation accuracy may indicate intra- or inter-chromosomal misplacement of SNPs [12, 48, 49]. We detected low rIMP,SEQ-values at regions where the bovine genome contains large segmental duplications [40, 41]. The population-wide imputation of sequence variant genotypes for BTA12 and 23 was not possible using Fimpute. Assessing the accuracy of imputation along these chromosomes using cross-validation within the data of sequenced animals revealed large segments with low imputation quality on both chromosomes including the highly variable major histocompatibility complex . We were able to eventually impute sequence variant genotypes for BTA12 and 23 using FImpute when those segments were excluded from the reference panel. Imputing genotypes within both segments was possible using Minimac. However, the inferred genotypes are flawed and must be treated with caution as our results show (see Additional file 7: Figure S6). While an improved assembly of the reference genome may resolve some of these problems, a better resolution of large structural elements is not possible using short paired-end DNA sequencing . Since flawed genotypes compromise downstream analyses [24, 25], it is advisable to exclude sequence variants within segments of low imputation quality from the target population and retain the array-derived genotypes only.
We assessed the performance of FImpute and Minimac because both methods can impute millions of polymorphic sites in large populations within a reasonable time. Previous studies showed that FImpute is more accurate than other tools with similar computing costs [52, 53]. The population-based genotype imputation algorithm implemented in Minimac is highly accurate and fast because it takes previously phased genotypes as input [11, 54]. We phased the target population using Eagle because it enables timely and accurate haplotype inference . Although the computing costs to impute 23 million sequence variants in 6958 animals were more than ten-fold higher with Eagle and Minimac than FImpute, the wall-clock time was only 1.67-fold higher because Eagle and Minimac enable multi-threading to reduce the process time [32, 54]. The difference in computing costs is greater when haplotypes are not available for the reference animals. Since many variant detection pipelines include phasing and imputation algorithms [2, 47, 55], the computing time required to phase the sequence data was not considered in our study. However, the phasing of the reference panel was necessary because the variant detection pipeline of the 1000 bull genomes project was applied to successive 5-Mb segments rather than whole chromosomes .
The accuracy of imputation was higher with Minimac than FImpute, although Minimac does not consider pedigree information. Most individuals of the 1000 bull genomes population were selected in such a way that they represent a large proportion of the genetic variation of current populations [2, 47, 56]. They are likely less related with each other than a random subset from the population, although some parent-offspring pairs were included. The accuracy of imputation might be higher when reference and target animals are closely related particularly when FImpute is used . However, dense genotypes for reference and target animals, e.g., sequence and HD-derived genotypes, can be used to identify short shared haplotypes among apparently unrelated animals that originate from ancient ancestors possibly predating the separation of breeds [9, 57]. Including reference animals from various breeds increased the accuracy of imputation, particularly in FV cattle. The principal components analysis revealed that animals from many breeds clustered nearby the FV population indicating that they are distantly related whereas the HOL animals formed a cluster that was separated from all other breeds. The benefit of a multi-breed reference population on the accuracy of imputation was less pronounced in HOL cattle. Since the number of reference animals was nearly twice as large in HOL, our results may also corroborate the suggestion that multi-breed reference panels are particularly useful to impute genotypes when the reference population is small [18, 58, 59]. In agreement with previous studies in cattle [2, 18], a multi-breed reference population increased the accuracy of imputation at low MAF variants likely reflecting the shared ancestry of many bovine populations and the limited number of sequenced animals per breed. However, it seems advisable to periodically evaluate different imputation strategies because multi-breed reference populations can also compromise the imputation of sequence variants when the diversity of the reference panel increases .
The population-based imputation of sequence variant genotypes was particularly advantageous at variants with a low MAF corroborating previous findings [12, 53, 60]. The true genotypes were more correlated to predicted allele dosages than best-guess genotypes which is in line with the findings of Brøndum et al. . Moreover, our results show that more sequence variants are polymorphic in the target population when allele dosages are considered instead of best-guess genotypes. An association study between allele dosages and milk fat percentage revealed that two causal variants with a MAF less than 10% in the DGAT1 [34, 35] and GHR  genes were the most significantly associated variants at two QTL on BTA14 and 20, respectively. This finding demonstrates that using predicted allele dosages in association studies with imputed sequence variants can pinpoint causal mutations which agrees with Zheng et al.  and Khatkar et al. . So far, imputed sequence variants did not substantially improve the accuracies of genomic predictions in real data which might partly result from flawed genotypes at low-frequency variants [17, 63]. We show that allele dosages are more accurate than best-guess genotypes particularly at variants with a low MAF because they better reflect imputation uncertainty. However, the analysis of allele dosages for millions of sequence variants in tens of thousands of individuals for genomic predictions is computationally costly and has not been attempted so far [17, 64]. Since the proportion of variants with a low MAF is more than three-fold higher in sequence than HD variants, further research is warranted to investigate if allele dosages may enhance genomic predictions with imputed sequence variants.
Our findings show that the composition of the reference population and the choice of the imputation algorithm are critical to infer accurate genotypes thereby enabling us to pinpoint causal mutations in association studies with imputed sequence variants. When the accuracy of imputation is low, causal variants may be “buried” among other variants in LD (e.g., Fig. 3a, d). In such situations, association testing will not reveal true causal mutations because variants with lower P values are likely to be prioritized as candidate causal variants . However, true causal variants are not necessarily the top variants in association studies even when their genotypes are almost perfectly imputed. Although the imputation accuracy for the causal p.A232K-polymorphism in the DGAT1 gene was greater than 99.5%, several adjacent variants had identical or slightly lower P values, which likely indicates sampling errors  or synthetic associations [66, 67]. Nevertheless, the causal mutations in the DGAT1 and GHR genes offer a convenient approach to evaluate the ability to detect causal mutations with imputed sequence variants in cattle, because both variants segregate in many breeds and phenotypes for fat percentage are readily available.
Our association study with more than 23 million sequence variants detected seven QTL that were significantly associated with milk fat percentage. Applying a less stringent significance threshold  would reveal additional QTL (see Additional file 9: Figure S8). Since we were able to pinpoint the causal mutations at two QTL, it is likely that causal mutations for other QTL are among the most significantly associated variants. However, the QTL identified in our study include several sequence variants in high LD and with similar P values, which renders the identification of causal sites difficult. Association studies in animals from multiple breeds may facilitate differentiation between causal and non-causal sites in LD . Causal sites may be located on ancient haplotypes that still segregate across multiple breeds as is the case for the fat percentage QTL on BTA5. Our association study revealed thirteen candidate causal variants in high LD in an intron of the MGST1 gene that had markedly lower P values than all other variants. It is very likely that one of those variants is the actual causal polymorphism because they were also highly significantly associated with milk composition in another dairy cattle breed . An improved functional annotation of the bovine genome and a fine-mapping strategy that prioritizes sequence variants using functional information may facilitate pinpointing causal mutations at such QTL [22, 69].
Genotypes for millions of sequence variants can be accurately inferred in large bovine cohorts using population-based imputation algorithms. The accuracy of imputation for variants with a low MAF is higher with multi-breed reference populations. The accuracy of imputation is not constant across chromosomes and may drop at regions where the genome contains large segmental duplications. Considering predicted allele dosages rather than best-guess genotypes as explanatory variables is beneficial for association studies with imputed sequence variants. The composition of the reference population and the choice of the approach used to infer genotypes are critical to detect causal mutations in association studies with imputed sequence variant genotypes.
HP, IMM, HDD and MEG designed the experiments, HP analyzed the data, PJB provided support in computing, RF and RE provided genotype and phenotype data, HP wrote the manuscript. All authors read and approved the final manuscript.
We thank Arbeitsgemeinschaft Süddeutscher Rinderzüchter und Besamungsorganisationen e.V. (ASR), Arbeitsgemeinschaft österreichischer Fleckviehzüchter (AGÖF) and Förderverein Bioökonomieforschung e.V. (FBF) for providing genotyping data. We acknowledge the 1000 bull genomes project for providing sequence variant genotypes for 1577 animals.
The authors declare that they have no competing interests.
Availability of data and materials
All relevant data are included in the manuscript and its supporting files.
HP was financially supported by a postdoctoral fellowship (Grant-ID: PA2789/1-1) from the Deutsche Forschungsgemeinschaft (DFG).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Wiggans GR, Cooper TA, VanRaden PM, Van Tassell CP, Bickhart DM, Sonstegard TS. Increasing the number of single nucleotide polymorphisms used in genomic evaluation of dairy cattle. J Dairy Sci. 2016;99:4504–11.View ArticlePubMedGoogle Scholar
- Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brøndum RF, et al. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat Genet. 2014;46:858–65.View ArticlePubMedGoogle Scholar
- Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–44.View ArticlePubMedPubMed CentralGoogle Scholar
- Burdick JT, Chen W-M, Abecasis GR, Cheung VG. In silico method for inferring genotypes in pedigrees. Nat Genet. 2006;38:1002–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84:210–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529.View ArticlePubMedPubMed CentralGoogle Scholar
- Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet. 2012;44:955–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Hickey JM, Kinghorn BP, Tier B, van der Werf JH, Cleveland MA. A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation. Genet Sel Evol. 2012;44:9.View ArticlePubMedPubMed CentralGoogle Scholar
- Sargolzaei M, Chesnais JP, Schenkel FS. A new approach for efficient genotype imputation using information from relatives. BMC Genomics. 2014;15:478.View ArticlePubMedPubMed CentralGoogle Scholar
- VanRaden PM, O’Connell JR, Wiggans GR, Weigel KA. Genomic evaluations with many more genotypes. Genet Sel Evol. 2011;43:10.View ArticlePubMedPubMed CentralGoogle Scholar
- Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010;11:499–511.View ArticlePubMedGoogle Scholar
- Pausch H, Aigner B, Emmerling R, Edel C, Götz KU, Fries R. Imputation of high-density genotypes in the Fleckvieh cattle population. Genet Sel Evol. 2013;45:3.View ArticlePubMedPubMed CentralGoogle Scholar
- Browning BL, Browning SR. Genotype imputation with millions of reference samples. Am J Hum Genet. 2016;98:116–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Kong A, Masson G, Frigge ML, Gylfason A, Zusmanovich P, Thorleifsson G, et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat Genet. 2008;40:1068–75.View ArticlePubMedPubMed CentralGoogle Scholar
- Druet T, Schrooten C, de Roos APW. Imputation of genotypes from different single nucleotide polymorphism panels in dairy cattle. J Dairy Sci. 2010;93:5443–54.View ArticlePubMedGoogle Scholar
- Zhang Z, Druet T. Marker imputation with low-density marker panels in Dutch Holstein cattle. J Dairy Sci. 2010;93:5487–94.View ArticlePubMedGoogle Scholar
- van Binsbergen R, Bink MC, Calus MP, van Eeuwijk FA, Hayes BJ, Hulsegge I, et al. Accuracy of imputation to whole-genome sequence data in Holstein Friesian cattle. Genet Sel Evol. 2014;46:41.View ArticlePubMedPubMed CentralGoogle Scholar
- Brøndum RF, Guldbrandtsen B, Sahana G, Lund MS, Su G. Strategies for imputation to whole genome sequence using a single or multi-breed reference population in cattle. BMC Genomics. 2014;15:728.View ArticlePubMedPubMed CentralGoogle Scholar
- Qanbari S, Pausch H, Jansen S, Somel M, Strom TM, Fries R, et al. Classic selective sweeps revealed by massive sequencing in cattle. PLoS Genet. 2014;10:e1004148.View ArticlePubMedPubMed CentralGoogle Scholar
- Pausch H, Emmerling R, Schwarzenbacher H, Fries R. A multi-trait meta-analysis with imputed sequence variants reveals twelve QTL for mammary gland morphology in Fleckvieh cattle. Genet Sel Evol. 2016;48:14.View ArticlePubMedPubMed CentralGoogle Scholar
- Meuwissen T, Goddard M. Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics. 2010;185:623–31.View ArticlePubMedPubMed CentralGoogle Scholar
- MacLeod IM, Bowman PJ, Vander Jagt CJ, Haile-Mariam M, Kemper KE, Chamberlain AJ, et al. Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits. BMC Genomics. 2016;17:144.View ArticlePubMedPubMed CentralGoogle Scholar
- Littlejohn MD, Tiplady K, Fink TA, Lehnert K, Lopdell T, Johnson T, et al. Sequence-based association analysis reveals an MGST1 eQTL with pleiotropic effects on bovine milk composition. Sci Rep. 2016;6:25376.View ArticlePubMedPubMed CentralGoogle Scholar
- Khatkar MS, Moser G, Hayes BJ, Raadsma HW. Strategies and utility of imputed SNP genotypes for genomic analysis in dairy cattle. BMC Genomics. 2012;13:538.View ArticlePubMedPubMed CentralGoogle Scholar
- Ertl J, Edel C, Emmerling R, Pausch H, Fries R, Götz KU. On the limited increase in validation reliability using high-density genotypes in genomic best linear unbiased prediction: observations from Fleckvieh cattle. J Dairy Sci. 2014;97:487–96.View ArticlePubMedGoogle Scholar
- Howie B, Marchini J, Stephens M. Genotype imputation with thousands of genomes. G3 (Bethesda). 2011;1:457–70.View ArticleGoogle Scholar
- Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. http://arxiv.org/abs/1303.3997. Accessed 4 July 2016.
- Zimin AV, Delcher AL, Florea L, Kelley DR, Schatz MC, Puiu D, et al. A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol. 2009;10:R42.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88:76–82.View ArticlePubMedPubMed CentralGoogle Scholar
- Loh PR, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, Finucane HK, et al. Reference-based phasing using the haplotype reference consortium panel. Nat Genet. 2016;48:1443–8.View ArticlePubMedGoogle Scholar
- Pausch H, Kölle S, Wurmser C, Schwarzenbacher H, Emmerling R, Jansen S, et al. A nonsense mutation in TMEM95 encoding a nondescript transmembrane protein causes idiopathic male subfertility in cattle. PLoS Genet. 2014;10:e1004044.View ArticlePubMedPubMed CentralGoogle Scholar
- Grisart B, Coppieters W, Farnir F, Karim L, Ford C, Berzi P, et al. Positional candidate cloning of a QTL in dairy cattle: identification of a missense mutation in the bovine DGAT1 gene with major effect on milk yield and composition. Genome Res. 2002;12:222–31.View ArticlePubMedGoogle Scholar
- Winter A, Krämer W, Werner FAO, Kollers S, Kata S, Durstewitz G, et al. Association of a lysine-232/alanine polymorphism in a bovine gene encoding acyl-CoA:diacylglycerol acyltransferase (DGAT1) with variation at a quantitative trait locus for milk fat content. Proc Natl Acad Sci USA. 2002;99:9300–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Blott S, Kim JJ, Moisio S, Schmidt-Küntzel A, Cornet A, Berzi P, et al. Molecular dissection of a quantitative trait locus: a phenylalanine-to-tyrosine substitution in the transmembrane domain of the bovine growth hormone receptor is associated with a major effect on milk yield and composition. Genetics. 2003;163:253–66.PubMedPubMed CentralGoogle Scholar
- Pausch H, Wurmser C, Reinhardt F, Emmerling R, Fries R. Short communication: validation of 4 candidate causative trait variants in 2 cattle breeds using targeted sequence imputation. J Dairy Sci. 2015;98:4162–7.View ArticlePubMedGoogle Scholar
- Kang HM, Sul JH, Service SK, Zaitlen NA, Kong S, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010;42:348–54.View ArticlePubMedPubMed CentralGoogle Scholar
- VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23.View ArticlePubMedGoogle Scholar
- Liu G, Ventura M, Cellamare A, Chen L, Cheng Z, Zhu B, et al. Analysis of recent segmental duplications in the bovine genome. BMC Genomics. 2009;10:571.View ArticlePubMedPubMed CentralGoogle Scholar
- Bickhart DM, Hou Y, Schroeder SG, Alkan C, Cardone MF, Matukumalli LK, et al. Copy number variation of individual cattle genomes using next-generation sequencing. Genome Res. 2012;22:778–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Ganai NA, Bovenhuis H, van Arendonk JA, Visker MH. Novel polymorphisms in the bovine beta-lactoglobulin gene and their effects on beta-lactoglobulin protein concentration in milk. Anim Genet. 2009;40:127–33.View ArticlePubMedGoogle Scholar
- Braunschweig MH, Leeb T. Aberrant low expression level of bovine beta-lactoglobulin is associated with a C to A transversion in the BLG promoter region. J Dairy Sci. 2006;89:4414–9.View ArticlePubMedGoogle Scholar
- Boitard S, Rodríguez W, Jay F, Mona S, Austerlitz F. Inferring population size history from large samples of genome-wide molecular data—an approximate bayesian computation approach. PLoS Genet. 2016;12:e1005877.View ArticlePubMedPubMed CentralGoogle Scholar
- Hickey JM, Crossa J, Babu R, de los Campos G. Factors affecting the accuracy of genotype imputation in populations from several maize breeding orograms. Crop Sci. 2012;52:654–63.View ArticleGoogle Scholar
- Wall JD, Tang LF, Zerbe B, Kvale MN, Kwok PY, Schaefer C, et al. Estimating genotype error rates from high-coverage next-generation sequence data. Genome Res. 2014;24:1734–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Jansen S, Aigner B, Pausch H, Wysocki M, Eck S, Benet-Pagès A, et al. Assessment of the genomic variation in a cattle population by re-sequencing of key animals at low to medium coverage. BMC Genomics. 2013;14:446.View ArticlePubMedPubMed CentralGoogle Scholar
- Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, et al. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci. 2012;95:4114–29.View ArticlePubMedGoogle Scholar
- Utsunomiya ATH, Santos DJA, Boison SA, Utsunomiya YT, Milanesi M, Bickhart DM, et al. Revealing misassembled segments in the bovine reference genome by high resolution linkage disequilibrium scan. BMC Genomics. 2016;17:705.View ArticlePubMedPubMed CentralGoogle Scholar
- Emam M, Tabatabaei S, Sargolzaei M, Cartwright SL, Schenkel FS, Miglior F, et al. Evaluating the accuracy of imputation in the highly polymorphic MHC region of genome. J Anim Sci. 2016;94:174–5.Google Scholar
- English AC, Salerno WJ, Hampton OA, Gonzaga-Jauregui C, Ambreth S, Ritter DI, et al. Assessing structural variation in a personal genome—towards a human reference diploid genome. BMC Genomics. 2015;16:286.View ArticlePubMedPubMed CentralGoogle Scholar
- Sun C, Wu X-L, Weigel KA, Rosa GJM, Bauck S, Woodward BW, et al. An ensemble-based approach to imputation of moderate-density genotypes for genomic selection with application to Angus cattle. Genet Res (Camb). 2012;94:133–50.View ArticleGoogle Scholar
- Ma P, Brøndum RF, Zhang Q, Lund MS, Su G. Comparison of different methods for imputing genome-wide marker genotypes in Swedish and Finnish Red cattle. J Dairy Sci. 2013;96:4666–77.View ArticlePubMedGoogle Scholar
- Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48:1284–7.View ArticlePubMedGoogle Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Goddard ME, Hayes BJ. Genomic selection based on dense genotypes inferred from sparse genotypes. Proc Assoc Advmt Anim Breed Genet. 2009;18:26–9.Google Scholar
- Schwarzenbacher H, Burgstaller J, Seefried FR, Wurmser C, Hilbe M, Jung S, et al. A missense mutation in TUBD1 is associated with high juvenile mortality in Braunvieh and Fleckvieh cattle. BMC Genomics. 2016;17:400.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Sargolzaei M, Schenkel F. Accuracy of whole- genome sequence genotype imputation in cattle breeds. In: Proceedings of the 10th world congress on genetics applied to livestock production: 18–22 August 2014; Vancouver; 2014.Google Scholar
- Bouwman AC, Veerkamp RF. Consequences of splitting whole-genome sequencing effort over multiple breeds on imputation accuracy. BMC Genet. 2014;15:105.View ArticlePubMedPubMed CentralGoogle Scholar
- Frischknecht M, Neuditschko M, Jagannathan V, Drögemüller C, Tetens J, Thaller G, et al. Imputation of sequence level genotypes in the Franches-Montagnes horse breed. Genet Sel Evol. 2014;46:63.View ArticlePubMedPubMed CentralGoogle Scholar
- Zheng J, Li Y, Abecasis GR, Scheet P. A comparison of approaches to account for uncertainty in analysis of imputed genotypes. Genet Epidemiol. 2011;35:102–10.View ArticlePubMedPubMed CentralGoogle Scholar
- Khatkar MS, Thomson PC, Raadsma HW. Utility of imputed SNP genotypes for genome-wide association studies in dairy cattle. Proc Assoc Advmt Anim Breed Genet. 2013;20:554–7.Google Scholar
- Gonzalez-Recio O, Daetwyler HD, MacLeod IM, Pryce JE, Bowman PJ, Hayes BJ, et al. Rare variants in transcript and potential regulatory regions explain a small percentage of the missing heritability of complex traits in cattle. PLoS One. 2015;10:e0143945.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang T, Chen YPP, Bowman PJ, Goddard ME, Hayes BJ. A hybrid expectation maximisation and MCMC sampling algorithm to implement Bayesian mixture model based genomic prediction and QTL mapping. BMC Genomics. 2016;17:744.View ArticlePubMedPubMed CentralGoogle Scholar
- Barendse W. The effect of measurement error of phenotypes on genome wide association studies. BMC Genomics. 2011;12:232.View ArticlePubMedPubMed CentralGoogle Scholar
- Bickel RD, Kopp A, Nuzhdin SV. Composite effects of polymorphisms near multiple regulatory elements create a major-effect QTL. PLoS Genet. 2011;7:e1001275.View ArticlePubMedPubMed CentralGoogle Scholar
- Kühn C, Thaller G, Winter A, Bininda-Emonds ORP, Kaupe B, Erhardt G, et al. Evidence for multiple alleles at the DGAT1 locus better explains a quantitative trait locus with major effect on milk fat content in cattle. Genetics. 2004;167:1873–81.View ArticlePubMedPubMed CentralGoogle Scholar
- Gao X, Becker LC, Becker DM, Starmer JD, Province MA. Avoiding the high Bonferroni penalty in genome-wide association studies. Genet Epidemiol. 2010;34:100–5.PubMedPubMed CentralGoogle Scholar
- The FAANG Consortium, Andersson L, Archibald AL, Bottema CD, Brauning R, Burgess SC, et al. Coordinated international action to accelerate genome-to-phenome with FAANG, the Functional Annotation of Animal Genomes project. Genome Biol. 2015;16:57.View ArticleGoogle Scholar