Accurate sequence variant genotyping in cattle using variation-aware genome graphs

Background Genotyping of sequence variants typically involves, as a first step, the alignment of sequencing reads to a linear reference genome. Because a linear reference genome represents only a small fraction of all the DNA sequence variation within a species, reference allele bias may occur at highly polymorphic or divergent regions of the genome. Graph-based methods facilitate the comparison of sequencing reads to a variation-aware genome graph, which incorporates a collection of non-redundant DNA sequences that segregate within a species. We compared the accuracy and sensitivity of graph-based sequence variant genotyping using the Graphtyper software to two widely-used methods, i.e., GATK and SAMtools, which rely on linear reference genomes using whole-genome sequencing data from 49 Original Braunvieh cattle. Results We discovered 21,140,196, 20,262,913, and 20,668,459 polymorphic sites using GATK, Graphtyper, and SAMtools, respectively. Comparisons between sequence variant genotypes and microarray-derived genotypes showed that Graphtyper outperformed both GATK and SAMtools in terms of genotype concordance, non-reference sensitivity, and non-reference discrepancy. The sequence variant genotypes that were obtained using Graphtyper had the smallest number of Mendelian inconsistencies between sequence-derived single nucleotide polymorphisms and indels in nine sire-son pairs. Genotype phasing and imputation using the Beagle software improved the quality of the sequence variant genotypes for all the tools evaluated, particularly for animals that were sequenced at low coverage. Following imputation, the concordance between sequence- and microarray-derived genotypes was almost identical for the three methods evaluated, i.e., 99.32, 99.46, and 99.24% for GATK, Graphtyper, and SAMtools, respectively. Variant filtration based on commonly used criteria improved genotype concordance slightly but it also decreased sensitivity. Graphtyper required considerably more computing resources than SAMtools but less than GATK. Conclusions Sequence variant genotyping using Graphtyper is accurate, sensitive and computationally feasible in cattle. Graph-based methods enable sequence variant genotyping from variation-aware reference genomes that may incorporate cohort-specific sequence variants, which is not possible with the current implementation of state-of-the-art methods that rely on linear reference genomes. Electronic supplementary material The online version of this article (10.1186/s12711-019-0462-x) contains supplementary material, which is available to authorized users.


Introduction
The sequencing of important ancestors of many cattle breeds revealed millions of sequence variants that are polymorphic in dairy and beef populations [1][2][3][4]. In order to compile an exhaustive catalog of polymorphic sites that segregate in Bos taurus, the 1000 bull genomes consortium was established [5]. The 1000 bull genomes project imputation reference panel facilitates to infer sequence variant genotypes for large cohorts of genotyped animals thus enabling genomic investigations at nucleotide resolution [5][6][7][8].
Sequence variant discovery and genotyping typically involves two steps that are carried out successively [9][10][11][12]: first, raw sequencing data are generated, filtered to remove bases with low sequencing quality, and aligned towards a linear reference genome using, e.g., Bowtie [13] or the Burrows-Wheeler Alignment (BWA) software [14]. Second, the aligned reads are compared to the nucleotide sequence of a reference genome in order to discover and genotype polymorphic sites using, e.g., SAMtools [15] or the Genome Analysis Toolkit (GATK) [16][17][18]. The sensitivity and specificity of sequence variant discovery is higher using multi-sample than single-sample approaches particularly when the sequencing depth is low [19][20][21][22][23]. However, the genotyping of sequence variants from multiple samples simultaneously is a computationally demanding task, particularly when the number of sequenced individuals is high [18]. The multi-sample sequence variant genotyping approach that has been implemented in the SAMtools software has to be restarted for the entire cohort once new samples are added. In contrast, the joint genotyping approach that has been implemented in GATK separates the discovery of polymorphic sites within samples from the estimation of genotype likelihoods across samples. Once new samples are added to an existing cohort, only the latter needs to be performed for the entire cohort, thus enabling computationally efficient parallelization of sequence variant genotyping in a large number of samples.
Ideally, the accuracy and sensitivity of sequence variant discovery is assessed using highconfidence sequence variants as a truth set e.g., Genome in a Bottle, GIAB [24]. For species where such a resource is not available, the accuracy of sequence variant genotyping may be evaluated by comparing sequence variant to microarray-derived genotypes. Due to the ascertainment bias of the microarray-derived SNPs, this type of comparison may overestimate the accuracy of sequence variant discovery [25,26].
Sequence variant genotyping approaches that rely on alignments to a linear reference genome have limitations to variant discovery, because a monoploid reference sequence does not reflect variation within a species. As a result, read alignments may be erroneous particularly at genomic regions that differ substantially between the sequenced individual and the reference sequence, thus introducing reference allele bias, flawed genotypes, and false-positive variant discovery around indels [27][28][29]. Aligning reads to population-or breed-specific reference genomes may overcome most of these problems [30][31][32].
However, considering multiple (population-specific) linear reference genomes with distinct genomic coordinates complicates the biological interpretation and annotation of sequence variant genotypes across populations [33].
Genome graph-based methods consider non-linear reference sequences for variant discovery [33][34][35][36][37]. A variation-aware genome graph may incorporate distinct (populationspecific) reference sequences and known sequence variants. Recently, the Graphtyper software has been developed in order to facilitate sequence variant discovery from a genome graph that has been constructed and iteratively augmented using variation of the sequenced cohort [34]. So far, sequence variant genotyping using variation-aware genome graphs has not been evaluated in cattle.
In this study, we compare sequence variant discovery and genotyping from a variationaware genome graph using Graphtyper to two state-of-the-art methods (GATK, SAMtools) that rely on linear reference genomes in 49 Original Braunvieh cattle. We compare sequence variant to microarray-derived genotypes in order to assess sensitivity and accuracy of sequence variant genotyping for each of the three methods evaluated.

Results
Following quality control (removal of adapter sequences and low-quality bases), we aligned more than 13 billion paired-end reads (2 x 125 and 2 x 150 basepairs) from 49 Original Braunvieh cattle to the UMD3.1 assembly of the bovine genome. On average, 98.44% (91.06-99.59%) of the reads mapped to the reference genome. 4.26% (2.0-10.91%) of the mapped reads were flagged as duplicates and not considered for further analyses. The average sequencing depth per animal was 12.75 and it ranged from 6.00 to 37.78. Although the re-alignment of sequencing reads around indels is not required when sequence variants are genotyped using the latest version of GATK (v 4), it is recommended to improve the genotyping of indels using SAMtools. To ensure an unbiased comparison of the three methods evaluated, we realigned the reads around indels and used the re-aligned BAM files as input for all tools (Figure 1). All sequencing read data were deposited at the European Nucleotide Archive (ENA) (http://www.ebi.ac.uk/ena) under primary accession PRJEB28191.

Sequence variant discovery and genotyping
Polymorphic sites (SNPs, short insertions and deletions) were discovered and genotyped in the 49 animals using either GATK (version 4), Graphtyper (version 1.3) or SAMtools (version 1.8). All software programs were run using well-established recommendations and workflow descriptions for variant discovery (Figure 1, also see Material and Methods). Only autosomal sequence variants were considered to evaluate the accuracy and sensitivity of sequence variant genotyping.
An intersection of 15,901,526 biallelic SNPs was common to all sequence-variant discovery tools evaluated (Figure 2A) i.e., SNPs that were detected by one but not the other tools was highest for GATK and lowest for Graphtyper.
The number of biallelic indels ( Figure 2B) that was common to all tools evaluated was 1,299,467 and 98,931 (13.13%) of those were novel, i.e., they were not present in dbSNP 150. The intersection of the three tools was considerably smaller for indels than SNPs.
Graphtyper had the highest proportion of indels in common with the other tools (74.11%).
SAMtools discovered the highest number (2,704,413) of biallelic indels and most of them (41.26%) were not detected using either GATK or Graphtyper. GATK (21.2 %) and Graphtyper (12.38%) discovered less private indels than SAMtools.   Blue horizontal bars represent the total number of sites discovered for each method. Vertical bars indicate private and common variants detected by the methods evaluated. Intersection of Indels (million)

Sequence variant genotyping using Graphtyper is accurate
Genotype concordance, non-reference sensitivity and non-reference discrepancy were calculated between the microarray-derived genotypes at 732,970 autosomal SNPs from the BovineHD Bead chip and the sequence variant genotypes that were obtained using either GATK, Graphtyper, or SAMtools ( Table 3). All metrics were calculated either before or after applying the algorithm implemented in the Beagle software for haplotype phasing and imputation.
The proportion of missing non-reference sites was 1.90%, 0.56%, and 0.47% using GATK, Graphtyper, and SAMtools, respectively. The genotype concordance between the sequenceand microarray-derived genotypes was higher using Graphtyper (97.72%) than either SAMtools (97.68%) or GATK (95.99%). The genotype concordance was higher for homozygous than heterozygous sites, particularly in animals that have been sequenced at low depth (Additional file 1 Figure S1). In order to take the variable proportions of missing genotypes in the sequence variants into account, we calculated non-reference sensitivity and non-reference discrepancy. The non-reference sensitivity is calculated for all positions where microarray-derived genotypes are either heterozygous or homozygous for the alternate allele, thus reflecting the sensitivity to discover variant sites. The non-reference discrepancy reflects the genotype concordance of all but the concordant reference sites.
Next, we analysed the proportion of opposing homozygous genotypes for SNPs and indels in nine sire-son pairs that were included among the sequenced animals ( Table 4). We observed that SNPs that were discovered using either Graphtyper or SAMtools had almost a similar proportion of genotypes with mendelian inconsistencies (0.33% and 0.36%). The proportion of genotypes with mendelian errors was two times higher using GATK (0.73%).

Table 3 Comparisons between microarray-derived and sequence variant genotypes.
Genotype concordance, non-reference sensitivity and non-reference discrepancy (in percentage) was calculated between the genotypes from the BovineHD SNP Bead chip and sequence -derived genotypes for 49 Original Braunvieh cattle considering either the raw or imputed (imp) sequence variant genotypes before (full) and after (filtered) recommended quality control was applied on the variants. Bold type indicates the best value that was observed for each parameter. Table 4 Proportions of opposing homozygous genotypes observed in nine sire-son pairs.
The ratio (in percentage) was calculated using autosomal sequence variants considering either the raw or imputed (imp) sequence variant genotypes before (full) and after (filtered) variants with low quality were removed from the data. Bold type indicates the best value that was observed for a given parameter. Graphtyper, and SAMtools datasets, respectively. The genotype concordance between sequence-and microarray-derived genotypes was slightly higher in the filtered than raw genotypes, but the non-reference sensitivity was less in the filtered than raw genotypes indicating that the filtering step also removed some true variant sites from the raw data (Table 3 & 4). We used the Beagle software to refine the primary genotype calls and infer missing genotypes in the raw and filtered datasets. Following imputation, the quality of the sequence variant genotypes increased for all tools evaluated particularly in cattle where the sequencing coverage was less than 10-fold (Figure 3). The largest increase in the concordance metrics was observed for the sequence variants obtained using GATK ( Table   3 & 4). Following imputation, the variants identified using Graphtyper had the highest quality in eight out of ten metrics evaluated.

Computing requirements
The multi-sample sequence variant genotyping pipelines that were implemented using either GATK or SAMtools were run separately for each chromosome in single-threading mode. The To genotype 21,140,196 polymorphic sequence variants in 49 animals, the GATK pipeline required a total of 2792 CPU hours (Figure 4).
The Graphtyper pipeline including the construction of the variation graph and the genotyping of sequence variants was run in parallel for 2538 non-overlapping segments of 1 million basepairs as recommended by Eggertson et al. [34]. The peak RAM required by Graphtyper was between 1 and 3 GB per segment. Twelve segments for which Graphtyper either ran out of memory or did not finish within the allocated time were subdivided into smaller segments of 10 Kb and subsequently re-run (Additional file 2 Table S1). The genotyping of 20,262,913 polymorphic sites in 49 animals using our implementation of the Graphtyper pipeline took a total of 1066 CPU hours (Figure 4).

Figure 4 Computing time required to genotype all autosomal sequence variants in 49
Original Braunvieh cattle.
The runtime of GATK and Graphtyper is shown for the different steps (see Figure 1 for more details). Step 1 Step 2 Step 3 Step

Discussion
In this study, we used either GATK, Graphtyper or SAMtools to discover and genotype polymorphic sequence variants in whole genome sequencing data of 49 Original Braunvieh cattle that have been sequenced at between 6 and 38-fold genome coverage. While SAMtools and GATK discover variants from a linear reference genome, Graphtyper locally re-aligns reads to a variation-aware reference graph that incorporates cohort-specific sequence variants [34]. Our graph-based variant discovery pipeline that has been implemented using the Graphtyper software used the existing bovine reference sequence to construct the genome graph. The graph was subsequently augmented with variants that were detected from linear alignments of 49 Original Braunvieh cattle [34]. The use of more sophisticated genome graph-based approaches that have been developed very recently facilitates to map raw sequencing reads directly against a genome graph without the need to first align reads towards a linear reference [36]. While genome graph-based variant discovery has been explored recently in mammalian-sized genomes [29,33,34,37], our work is the first to apply graph-based sequence variant genotyping in cattle.
In order to evaluate graph-based variant discovery in cattle, we compared accuracy and sensitivity of Graphtyper to SAMtools and GATK, i.e., two state-of-the-art methods on linear reference genomes that have been evaluated thoroughly in many species including cattle [2,19]. Our evaluation of the software tools may suffer from ascertainment bias because we relied on SNPs that have been included in the BovineHD SNP array, i.e., they are located predominantly at well accessible regions of the genome [25,26,38]. Thus, the global accuracy and sensitivity of sequence variant discovery might be overestimated in our study.
However, this ascertainment bias is unlikely to affect the relative performance of the methods evaluated. In 49 Original Braunvieh cattle, sequence variant genotyping was more accurate using Graphtyper than either GATK or SAMtools. However, the difference in accuracy was small between the three tools particularly in samples that were sequenced at high coverage. Irrespective of the methods evaluated, we observed heterozygous undercalling in animals that had been sequenced at low coverage, i.e., heterozygous variants were erroneously genotyped as homozygous due to an insufficient number of sequencing reads supporting the heterozygous genotype [9,[39][40][41]. In agreement with previous studies [2,5], Beagle imputation improved genotype concordance and reduced heterozygous undercalling particularly in cattle that had been sequenced at low coverage. After the imputation step, the genotype concordance, non-reference sensitivity, and non-reference discrepancy of the three tools was almost identical, indicating that genotyping sequence variants from samples with medium coverage is possible at high accuracy using either of the three tools evaluated and subsequent Beagle imputation. While this has been concluded previously for SAMtools and GATK [2,19], our findings demonstrate that the genotype likelihoods estimated from the Graphtyper software are also compatible with and benefit from the imputation algorithm implemented in the Beagle software. However, integrating the phasing and imputation of missing genotypes directly in a graph-based variant genotyping approach would simplify sequence variant genotyping from variation-aware graphs [33,42,43].
Using Graphtyper for variant genotyping and Beagle for genotype refinement enabled us to genotype sequence variants in 49 Original Braunvieh cattle at a genotypic concordance of 99.52%, i.e., higher than previously achieved using either GATK or SAMtools to genotype variants in cattle that had been sequenced at similar genome coverage [2-5, 19, 44], indicating that graph-based variant discovery improves sequence variant genotyping. The accuracy and sensitivity of graph-based variant discovery may be higher when known variants are incorporated in the initial construction of the variation graph [34]. Indeed, we observed a slight increase in genotype concordance (Additional file 3 Figure S2) when sequence variants were genotyped from a variation-aware genome-graph that incorporated variants that are listed in bovine dbSNP 150.
Using microarray-derived genotypes as a truth set does not allow for assessing accuracy and sensitivity of indel discovery because variants other than SNPs are currently not routinely genotyped with commercially available microarrays. The proportion of opposing homozygous genotypes between parent-offspring pairs may be a useful diagnostic metric to detect sequencing artefacts or flawed genotypes at indels [45]. Our results show that genotypes at indels are more accurate using Graphtyper than either SAMtools or GATK because Graphtyper produced less opposing homozygous genotypes at indels in nine sireson pairs than the other methods. These findings are in line with Eggertsson et al [34], who showed that the mapping of the sequencing reads against a variation-aware graph could improve read alignment nearby indels, enabling highly accurate sequence variant genotyping also for variants other than SNPs. Recently, Garrison et al. [36] showed that graph-based variant discovery may also mitigate reference allele bias. An assessment of reference allele bias was, however, not possible in our study because the sequencing depth was too low for most samples.
In our study, Graphtyper required less computing time than GATK to genotype sequence variants for 49 cattle. SAMtools required the least computing resources likely because the implemented mpileup algorithm readily produces genotypes from the aligned reads without performing the computationally intensive local re-alignment of the reads. However, with an increasing number of samples, the multi-sample variant genotyping implementation of GATK seems to be more efficient than SAMtools mpileup because variant discovery within samples can be separated from the joint genotyping across samples [18]. A highly parallelized graphbased variant discovery pipeline also offers a computationally efficient and scalable framework for variant discovery in thousands of samples [34]. However, the computing time for graph-based variant genotyping might be high in genomic regions where the nucleotide diversity is high or the assembly is flawed [37,46]. In our study, the algorithm implemented in the Graphtyper software failed to finish within the allocated time for twelve 1 Mb segments including a segment on chromosome 12 that contains a large segmental duplication [47][48][49] possibly because many mis-mapped reads increased graph complexity. Our results show that Graphtyper may also produce genotypes for problematic segments when they are split and processed in smaller bits. Moreover, most of these problems disappeared when we considered the latest assembly of the bovine genome, possibly corroborating that more complete and contiguous genome assemblies may facilitate more reliable genotyping from variation-aware graphs [25,50].

Conclusion
Genome graphs facilitate sequence variant discovery from non-linear reference genomes.
Sequence variant genotyping from a variation-aware graph is possible in cattle using Graphtyper. Sequence variant genotyping at both SNP and indels is more accurate and sensitive using Graphtyper than either SAMtools or GATK. The proportion of mendelian inconsistencies at both SNP and indels is low using Graphtyper indicating that sequence variant genotyping from a variation-aware genome graph facilitates accurate variant discovery at different types of genetic variation. Considering highly-informative variationaware genome graphs that have been constructed from multiple breed-specific de-novo assemblies and high-confidence sequence variants facilitates more accurate, sensitive and unbiased sequence variant genotyping in cattle.

Animal Selection
We selected 49 Original Braunvieh bulls that were either frequently used in artificial insemination or explained a large fraction of the genetic diversity of the breeding population.
Semen straws of the bulls were purchased from an artificial insemination center and DNA was prepared following standard DNA extraction protocols.

Sequence variant discovery
We followed the best practice guidelines recommended for variant discovery and genotyping using GATK (version 4.0.6) [16,17,23]. First, genotype likelihoods were calculated separately for each sequenced animal using GATK HaplotypeCaller [18], resulting in files in gVCF (genomic Variant Call Format) format for each sample [56]. The gVCF files from 49 samples were consolidated using GATK GenomicsDBImport. Subsequently, GATK CombineGVCFs was applied to genotype polymorphic sequence variants for all samples simultaneously. successively. First, sequence variants were identified from the read alignments that were produced using BWA mem (see above). Next, these cohort-specific variants were used to augment the UMD3.1 reference genome and construct the variation-aware genome graph.
Subsequently, the sequencing reads were locally re-aligned against the variation-aware graph. A clean variation graph was produced by removing unobserved haplotypes paths from the raw graph. In the final step, genotypes were identified from the re-aligned reads in the clean graph. The Graphtyper pipeline was run in segments of 1 million base-pairs and whenever the program failed to genotype variants for a particular segment either because it run out of memory or exceeded the allocated runtime of twelve hours, the interval was subdivided into smaller segments (10 kb).
Our implementation of SAMtools mpileup (version 1.8) [57] was run in multi-sample mode to calculate genotype likelihoods from the aligned reads for all samples simultaneously. The parameters -E and -t were used to recalculate (and apply) base alignment quality and produce per-sample genotype annotations, respectively. Next, the estimated genotype likelihoods were converted into genotypes using BCFtools call using the -v and -m flags to output variant sites only and permit that sites may have more than two alternative alleles, respectively.

Sequence variant filtering and genotype refinement
The GATK VariantFiltration module was used to parse and filter the raw VCF files. Quality control on the raw sequencing variants and genotypes was applied according to well

Evaluation of sequence variant genotyping
To ensure consistent variant representation across the different sequence variant genotyping methods evaluated, we applied the vt normalize software (version 0.5) [60].
Normalized variants are parsimonious (i.e., represented by as few nucleotides as possible and left aligned [60]). The number of variants detected and Ti/Tv ratios were calculated using vt peek [60] and BCFtools stats [57]. The intersection of variants that were common to the tools evaluated was calculated and visualized using BCFtools isec [57] and the UpSet R package [61], respectively.
Mendelian inconsistencies were calculated as the proportion of variants showing opposing homozygous genotypes in nine parent-offspring pairs that were included among the 49 sequenced animals. For this comparison, we considered only sites where the genotypes of both sire and son were not missing.
The sequenced animals of this study were also genotyped using the Illumina BovineHD Bead Chip comprising 777,962 SNPs. In order to assess the quality of sequence variant genotyping, the genotypes detected by the different variant calling methods were compared to the microarray-derived variants in terms of genotype concordance, non-reference sensitivity and non-reference discrepancy [23,38], see Additional file 4 Figure S3 for more details on the metrics.

Computing environment and statistical analysis
All computations were performed on the ETH Zurich Leonhard Open Cluster with access to multiple nodes equipped with 18 cores Intel Xeon E5-2697v4 processors (base frequency rated at 2.3 GHz) and 128 GB of random-access memory. Unless otherwise stated, the R (version 3.3.3) software environment [62] was used for statistical analyses and ggplot2 (version 3.0.0) [63] was used for data visualisation.

Declarations Animal ethics statement
Not applicable

Author contributions
Analyzed data: CD HP; Wrote the software pipelines: CD; Generated sequencing data: CW; Wrote the paper: CD HP; Read and approved the final version of the paper: CD CW HP

Additional files
Additional file 1 Figure S1

Additional file 2 Table S1
File format: xlsx Description: Twelve 1 Mb regions where Graphtyper initially failed to genotype sequence variants because the algorithm either ran out of memory or exceeded the allocated runtime (12 hours). Graphtyper eventually produced genotypes for the sequence variants when these regions were re-run in 10 kb segments.