Construction of a large collection of small genome variations in French dairy and beef breeds using whole-genome sequences

Background In recent years, several bovine genome sequencing projects were carried out with the aim of developing genomic tools to improve dairy and beef production efficiency and sustainability. Results In this study, we describe the first French cattle genome variation dataset obtained by sequencing 274 whole genomes representing several major dairy and beef breeds. This dataset contains over 28 million single nucleotide polymorphisms (SNPs) and small insertions and deletions. Comparisons between sequencing results and SNP array genotypes revealed a very high genotype concordance rate, which indicates the good quality of our data. Conclusions To our knowledge, this is the first large-scale catalog of small genomic variations in French dairy and beef cattle. This resource will contribute to the study of gene functions and population structure and also help to improve traits through genotype-guided selection. Electronic supplementary material The online version of this article (doi:10.1186/s12711-016-0268-z) contains supplementary material, which is available to authorized users.


Background
In recent years, advances in high-throughput sequencing technologies have offered the opportunity to partially or completely re-sequence genomes, in a relatively cost-effective manner. The availability of whole-genome sequence (WGS) data for an increasing number of individuals offers new opportunities to study genetic variations at the genomic level with unprecedented accuracy.
In the past few years, several whole-genome sequencing studies have been carried out in different dairy and beef cattle breeds and identified a huge number of single nucleotide polymorphisms (SNPs) and small insertions and deletions (InDels) [1][2][3][4][5]. To date, the Ensembl (http://www.ensembl.org) short variation database contains over 99 million SNPs and InDels identified in several cattle breeds. During the first phase of the 1000 bull genomes project, the genomes of 234 bulls were sequenced, which has enabled the identification of over 28 million reliable SNPs and InDels [5]. Only 13 French bulls were included in this phase.
In this work, we performed a large-scale study to investigate both SNPs and small InDels in whole-genome sequencing data for 274 animals from several major French dairy and beef breeds. The collection of genome variations reported in this study will be useful to study their potential links with the genetic variability of economically important traits. available in our laboratory and were produced as previously described [1].

Whole-genome sequencing and sequence alignment to the reference
The whole genome of 274 animals corresponding to both French dairy and beef breeds (Table 1) were used for 2 × 100 bp paired-end sequencing on an Illumina HiSeq 2000 with a TruSeq SBS v3-HS Kit (Illumina).
Sequence alignments were carried out using the Burrows-Wheeler Alignment tool (BWA-v0.6.1-r104) [6] with the aln option with default parameters for mapping reads to the UMD3.1 bovine reference genome [7]. Potential PCR duplicates, which can adversely affect the variant calls, were removed using the MarkDuplicates tools from the Picard package version 1.4.0 [8]. Only properly paired reads with a mapping quality of at least 30 (−q = 30) were retained. The resulting BAM files were then used for all subsequent analyses.

Identification of small insertions and deletions
Small genomic variations were detected using the Genome Analysis Tool Kit 2.4-9 (GATK) version and GATK-UnifiedGenotyper as SNP caller [9]. Prior to variant discovery, reads were subjected to local realignment, coordinate sorting, quality recalibration, and removal of PCR duplicates. In the GATK analysis, we used a minimum confidence score threshold of Q30 with default parameters. We also used multi-sample variant calling in order to distinguish between a homozygous reference genotype and a missing genotype in the analyzed samples.

Variant annotation
All variants were annotated with the Ensembl variant effect predictor (VEP) pipeline v81 [10] based on the Ensembl version 81 transcript set and using dbSNP build 143. The effect of the amino acid changes was predicted using SIFT [11,12], a sequence homology-based tool that can determine whether an amino acid substitution in a protein is deleterious or tolerant.

Functional characterization of protein-coding genes with LoF variants
A set of 8337 gene products was used for gene ontology (GO) enrichment and functional analyses, using the GO [13] and the KEGG (Kyoto Encyclopedia of Genes and Genomes) [14] database resources. The Cytoscape [15] ClueGO plugin [16] was used to identify the biological functions to which genes contribute. The enrichment of biological terms and groups were set as follows. First, we used the enrichment tests based on the hyper-geometric distribution. Second, we set the statistical significance to 0.05 (p ≤ 0.05), and we used the Benjamini-Hochberg adjustment to correct the p value for the terms and the groups created by ClueGO. Third, we used fusion criteria to reduce the redundancy of related terms that have similar associated genes. Finally, we set the Kappa-statistics score threshold to 0.6.
Gene Ontology (GO) enrichment was also performed using the MouseMine analysis tools available at the MGI international database resources (http://www.mousemine.org/mousemine/begin.do).

Validation of LoF variants by high-throughput genotyping
The efficiency of our calling approach and the relevance of the resulting variants were assessed by genotyping a selected panel containing 304 heterozygous deleterious missense and loss-of-function SNPs for which no homozygous individual for the alternative allele was observed in our population. Genotyping was performed using the already available Illumina BovineLD custom BeadChip [17] and a panel of 172,416 beef and dairy cattle animals ( Table 2).

Whole-genome sequencing and read mapping
Two hundred and seventy four animals corresponding to both French dairy and beef breeds were selected for whole-genome sequencing (Table 1), of which 62 whole-genome sequences were already published [1]; the Illumina short reads are available at the European Nucleotide Archive (ENA) with study accession number PRJEB9343 (http://www.ebi.ac.uk/ena/data/view/ PRJEB9343). Overall, 103 billion raw paired-end reads 100-bases long were generated, which resulted in over Cross-breed 1 ten thousand gigabases of data. On average, 95% (from 56 to 99%) of the paired-end reads were properly aligned on the UMD3.1 bovine reference genome (see Additional file 1), which is in agreement with previous studies [1,18]. The average genome-wide sequence coverage from the mapped reads was 13.8× and ranged from almost 5× to around 36× across the different genomes, with 236 samples sequenced at least at 10-fold average coverage (see Additional file 1).  (11,770) corresponded to tri-allelic SNPs and 32.7% (5723) were also InDels but with multiple alleles in the Ensembl variation 83 database. In addition, we identified 88,289 positions that displayed one type of variant (i.e. SNP or InDel but not both) in our dataset but which overlapped with multiple variant types in the Ensembl variation 83 database. We also identified 517,417 variants for which the alleles differed between our dataset and the Ensembl 83 variation database. These inconsistencies could be partly explained by the use of different variant calling algorithms. Indeed, a previous study in Danish Holstein dairy cattle also reported similar inconsistencies [3]. In that study, genotype accuracy was assessed for 15 variants for which samtools-derived genotypes differed from those predicted by GATK. Their results revealed that GATK provided more accurate genotype calls than samtools.

Evaluation of sequencing genotypes
To evaluate the quality of our sequencing data-derived genotypes, we performed three different analyses. First, we used the ratio of transitions over transversions (Ts/ Tv) as a diagnostic measure to assess the quality of our sequencing data. The average Ts/Tv ratio observed in our whole-genome sequencing data was 2.12 and ranged from 2.05 on BTA6 to 2.35 on BTA25 (Fig. 1). This average rate is within the same range as those observed in other species. For example, in human whole-genome sequence data, the genome-wide Ts/Tv ratio ranged from 2.0 to 2.2 [19,20]. In mouse and pig, similar ratios were reported i.e. about 2.0 [21] and 2.04 ± 0.28 [22], respectively. DePristo et al. [19] indicated that the Ts/Tv ratio should be around 2.1 for whole-genome sequencing and that lower ratios may indicate that the sequencing data includes false positives caused by random sequencing errors. Therefore, the Ts/Tv ratio estimated in our study is indicative of good sequencing data quality.
Second, we measured the call rate by estimating the percentage of samples presenting a known genotype for each variant. On average, 95% of the variants were called in more than 90% of the samples with 13% (3,655,506) of the variants being genotyped in all 274 samples (Fig. 2).
Third, we compared our sequencing data-derived genotypes to SNP array-derived genotypes using the Illumina High-Density (HD) Bovine SNP BeadChip ® which includes 777,962 SNPs [23]. Overall, both genotyping data sources were available for 152 samples. The average genotype concordance rate was around 99.1% and ranged from 91.7 to 99.8% (see Additional file 3). We also observed a dependency of chip genotype concordance on sequencing depth (see Additional file 3; Fig. 3). Lower accuracy rates were found for samples with a low depth of coverage (less than 10×). For 21 samples, the concordance rate was less than 98% but their depth of coverage was higher than 11×. Of these 21 samples, 20 had a concordance rate between 95 and 97% and were considered as acceptable. The observed lower concordance rates could be partly due to lower sequence data quality compared to the rest of our sample set. The low missing rate and high concordance rate observed in our study can be explained by the good average genome-wide sequence coverage of the mapped reads in our data. Indeed, more than 86% of our samples were sequenced at least at an average 10-fold coverage. Another reason is the efficiency of our variation calling strategy.

Functional annotation of small genome variants
Functional annotation of the identified small genome variants was carried out using the Ensembl VEP annotation software [10]. Overall, 66% of the annotated variants were located in intergenic regions and almost 30% were identified within gene intronic sequences ( Table 3). The remaining 4% were located within gene-coding, upstream and downstream regions. Of these, 85,038 variants were located within the 5′ or 3′ untranslated regions (UTR), 171 were located within genes coding for micro RNAs (miRNAs), 96,711 missense mutations were identified within gene coding regions, 358 InDels were predicted to cause inframe insertions and 814 InDels were predicted to cause inframe deletions.
Overall, we identified 2120 variants that affected splice sites. These included 1471 splice donor and 649 splice acceptor site variants. In addition, 1159 variants were predicted to create a premature stop codon and 68 to disrupt a termination codon. Around 2287 InDels were predicted to cause a frameshift in coding sequences which were considered as loss-of-function (LOF) variants and may result in reduced or complete inactivation of protein functions by disrupting either the protein-coding gene itself or genetic regulatory elements. These LOF variant candidates are of particular interest since they might have effects on economically important traits.
Among the annotated deleterious missense and LOF variants, we identified several mutations that were previously reported to be associated with dairy and beef traits in cattle. For example, the amino acid change of phenylalanine to leucine at position 94 (F94L) of the myostatin (MSTN) protein was identified in 31 samples, among which six animals were heterozygous (three Charolaise, two Aubrac and one Rouge-des-Prés) and 25 were homozygous (19 Limousine and six Aubrac) for this locus. We also observed the MSTN pQ204* mutation in eight samples, all of which corresponded to the Charolaise breed and all animals were heterozygous. Both F94L and Q204* substitutions are associated with double muscling phenotypes in Limousine [24] and Charolaise [25] cattle, respectively. The F279Y mutation within the growth hormone receptor (GHR) gene was observed in 35 samples corresponding to 29 dairy and six beef cattle animals (four Blonde d' Aquitaine, one Brown Swiss, one Charolaise, two Montbéliarde, five Normande and 22 Holstein) with the highest frequency observed in the Holstein breed (19 heterozygous and three homozygous individuals for the alternative allele). This SNP is located on BTA20 and has been shown to be associated with milk yield and composition [26,27], feed intake, feed conversion efficiency and body energy traits [28].

Missense and LOF variants for which no homozygous individuals for the alternative allele are observed
Further analysis of the annotated variants revealed the presence of 14,469 missense and LOF variants with a significant biological impact based on SIFT predictions and for which no homozygous animal carrying the alternative allele was observed among the 274 WGS (see Additional file 4). These were subsequently considered as our study panel in the rest of this paper.
The genotype distribution of our study panel revealed that seven frameshift variants were breed-specific (Table 4). Integrated Genome Viewer (IGV) visualization and inspections of BAM files for animals carrying these mutations revealed that four of the seven frameshift mutations were spurious variant calls (results not shown). The three remaining frameshift variants could be visualized and confirmed by IGV and were therefore considered as true variants. First, a five nucleotide insertion (-/CACGT) at position 66,552,044 on BTA1 was identified in two Blonde d' Aquitaine animals. This frameshift mutation was absent in both the Ensembl database and in the most recent 1000 bull genomes project dataset which contains small genomic variations for 1577 animals corresponding to 48 different breeds (Daetwyler HD, personal communication  results in the addition of ten new amino acids followed by a new downstream termination site. The POLQ gene has been identified in several other species and was reported to play a major role in the DNA repair mechanism of double strand breaks (DSB) by alternative endjoining (alt-EJ; also called alternative non homologous end-joining (alt-NHEJ) or microhomology-mediated end joining)) [29][30][31][32][33]. Unlike the classical non homologous end-joining (c-NHEJ) mechanism, alt-EJ depends on resection of DNA ends to find microhomologies, which results in larger deletions and insertions [34,35]. Inhibition of POLQ functions in mice were reported to be associated with chromosome instability phenotypes [36]. In mammalian cells, POLQ promotes the formation of chromosomal translocations and is essential for survival when the homology-directed repair (HDR) mechanism is impaired [31], which suggests that this mutation may cause embryonic lethality in cattle. The two other frameshift mutations were identified in the Charolaise breed. The first one is a GACC insertion at position 149,472 on BTA19 and is located within an olfactory receptor gene coding sequence (ENS-BTAG00000045560). This variant was identified in three samples in our dataset and is also present in the Ensembl database. It leads to a frameshift mutation within the  7tm_1 (PF000001) pfam domain at amino acids 81 and 82 and creates 39 new amino acids followed by a termination site, thus producing a truncated protein, which contains only 26% (82 of 311 amino acids) of the wild type protein. The second frameshift mutation is a four nucleotide (-/AGTT) insertion identified at position 21,913,213 on BTA18. It was identified in two samples in our panel but it is absent in the Ensembl database. It is located within the retinoblastoma-like 2 (RBL2) coding gene region and leads to a frameshift mutation within the RB_B box (PF01857 pfam domain) at amino acids 890-891, thus introducing 26 new amino acids before creating a premature termination site. Thus, a truncated protein representing only 78% (890/1140 amino acids) of the wild type protein is produced. RBL2, also called pRb2/p130, is a member of the retinoblastoma family of tumor suppressors [37] and its expression was reported to be altered in several cancer types [38][39][40]. RBL2 interacts with the E2F4 and E2F5 transcription factors and results in negative regulation of the cell cycle [41]. It is also involved in many other cellular processes, such as regulation of apoptosis and differentiation [37] and control of the length of telomeres [42]. Finally, we identified the p.Q579* mutation within the APAF1 gene (HH1: Holstein Haplotype 1), the p.N290T deleterious missense mutation within the GART gene (HH4), the p.V180F deleterious missense mutation within the SLC35A3 gene (CVM: complex vertebral malformation), the p.Q52* stop-gained variant within the SHBG gene (MH1: Montbéliarde Haplotype1) and the R12* stop-gained variant within the SLC37A2 gene (MH2). All these substitutions are specific to the Holstein (HH1, HH4 and CVM) and Montbéliarde (MH1 and MH2) breeds, respectively and are considered to be strong candidate mutations for embryonic lethal defects [43].

Gene ontology and pathway analysis
In order to gain additional insight into the biological pathways and molecular functions that are affected by these variants, we performed a gene ontology (GO) enrichment and functional analysis using 8337 known Ensembl ID-associated genes retrieved from our variant annotation study (see Additional file 5). Several GO terms were significantly over-represented. For example, the six most enriched GO categories corresponding to biological processes were related to the regulation of GTPase-, Ras-, and Rho-mediated signal transduction. The three most enriched GO categories corresponding to cellular components were related to cytoskeleton and myosin complex and the five most enriched GO categories corresponding to molecular functions were related to ATP binding, adenine nucleotide binding, ATPase activity, motor activity and ribonucleotide binding.

Experimental validation of the study panel by high-throughput genotyping
Previous studies reported a significant rate of false positive calls among deleterious missense and loss-of-function variants [3,44]. Thus, the efficiency of our calling approach and the relevance of the resulting variants were assessed by genotyping a selected panel containing 304 heterozygous deleterious missense and loss-of-function mutations for which no homozygous individual for the alternative allele was observed in our population. They were also selected based on their mapping quality (above 50) and their calling quality (above 30) scores. Genotyping was performed using the already available Illumina BovineLD custom BeadChip [17] and a panel of 172,416 animals corresponding to both beef and dairy cattle breeds (Table 2). Overall, 276 (~91%) SNPs were polymorphic in all genotyped animals and were considered as true variants (see Additional file 6). Among these, 61 SNPs were private and were polymorphic only in one breed. Thus, they were considered as breed-specific variants i.e. two in Brown Swiss, three in Limousine, 12 in Montbéliarde, 27 in Normande, 16 in Holstein and one in Blonde d' Aquitaine. For 51 polymorphic SNPs, we observed only two genotypes. No homozygous individual for the alternative allele was observed among all genotyped samples. For these 51 variants, we determined the expected proportions of homozygous individuals for the alternative allele in each breed and then calculated the significance probability (p value) from the binomial distribution, with event probability equal to zero (which corresponded to the proportion of observed homozygous individuals for the alternative allele), and the number of observations was equal to the number of animals in each breed. For 41 of the 51 variants, there was no significant difference between the expected and the observed proportions (see Additional file 6). However, for the other 10 variants, the expected proportion was significantly different from the observed proportion in at least one breed (see Additional file 6). These corresponded to nine missense deleterious mutations and one LOF variant. This latter one corresponded to the p.Q579* mutation within the APAF1 gene (HH1: Holstein Haplotype 1) which was previously reported as a strong candidate mutation for embryonic lethal defects [43]. As expected, significant differences between the observed and estimated proportions of homozygous individuals for the alternative allele were only observed in the Holstein breed. Two other deleterious missense mutations were also located within CBX3 (chromobox protein homolog 3) and RBBP5 (RB binding protein 5, histone lysine methyltransferase complex subunit) genes which are known to be associated with male germ cell survival and spermatogenesis [45] and sterility [46], respectively.
The 51 SNPs for which only two genotypes were observed were located within 42 known gene coding regions. Thus, these genes were used to carry out gene ontology (GO) and mammalian phenotype ontology (MPO) enrichment analyses using the MouseMine analysis tools (see Additional file 7). The most significant enriched MPO categories were related to abnormal nervous system morphology and phenotype, preweaning lethality, and abnormal brain development (see Additional file 7, sheet1). However, no significant GO category enrichment was obtained (see Additional file 7, sheet 2). It will be very interesting to study the effect of these variants on phenotypes of interest in cattle.

Conclusions
In this study, we performed a pan-genome assessment of small genome variations in cattle using whole-genome sequence data. Analysis of WGS data of 274 animals from both dairy and beef cattle breeds allowed the identification of over 28 millions small variations, among which we identified more than 25 million SNPs and around 3 million small insertions and deletions. To assess the quality of both our sequencing data and calling approach, we analyzed the transition to transversion ratio and the call rate, and we also compared the sequence-derived genotypes with array-derived ones. Results from all these analyses confirmed the efficiency of our sequencing data as well as the good quality of our variant calling procedure. Annotation of these variants revealed several deleterious missense and loss-of-function variants, among which we identified several mutations that were previously reported to be associated with either dairy or beef traits. Genotypic and allelic frequency distributions revealed the presence of more than 14,000 heterozygous candidate deleterious and LOF variants that segregated in the absence of individuals homozygous for the alternative allele in our population. Of these, we genotyped 172,416 animals from dairy and beef breeds with a panel of 304 SNPs, using the already available Illumina BovineLD custom BeadChip. Two hundred and seventy-six of these variants (~91%) were polymorphic in at least one breed and, thus, were considered as true variants. For 51 of the 276 polymorphic variants, we did not observe any homozygous individual for the alternative allele. These 51 variants will be useful to study their link with genetic variability of economically-important traits in cattle.