Deleterious protein-coding variants in diverse cattle breeds of the world

The domestication of wild animals has resulted in a reduction in effective population sizes, which can affect the deleterious mutation load of domesticated breeds. In addition, artificial selection contributes to the accumulation of deleterious mutations because of an increased rate of inbreeding among domesticated animals. Since founder population sizes and artificial selection differ between cattle breeds, their deleterious mutation load can vary. We investigated this question by using whole-genome data from 432 animals belonging to 54 worldwide cattle breeds. Our analysis revealed a negative correlation between genomic heterozygosity and nonsynonymous-to-silent diversity ratio, which suggests a higher proportion of single nucleotide variants (SNVs) affecting proteins in low-diversity breeds. Our results also showed that low-diversity breeds had a larger number of high-frequency (derived allele frequency (DAF) > 0.51) deleterious SNVs than high-diversity breeds. An opposite trend was observed for the low-frequency (DAF ≤ 0.51) deleterious SNVs. Overall, the number of high-frequency deleterious SNVs was larger in the genomes of taurine cattle breeds than of indicine breeds, whereas the number of low-frequency deleterious SNVs was larger in the genomes of indicine cattle than in those of taurine cattle. Furthermore, we observed significant variation in the counts of deleterious SNVs within taurine breeds. The variations in deleterious mutation load between taurine and indicine breeds could be attributed to the population sizes of the wild progenitors before domestication, whereas the variations observed within taurine breeds could be due to differences in inbreeding level, strength of artificial selection, and/or founding population size. Our findings imply that the incidence of genetic diseases can vary between cattle breeds. Supplementary Information The online version contains supplementary material available at 10.1186/s12711-021-00674-7.


Background
Population genetics theories predict that, at low frequencies, deleterious single nucleotide variants (SNVs) can contribute significantly to the heterozygosity of a population [1,2]. In contrast, SNVs are prevented from reaching high frequencies and are eventually eliminated by purifying selection [2]. Domestication of wild plants and animals results in a population bottleneck because only a small subset of the wild population is sampled to form the founder stock [3]. Artificial selection for desired traits and inbreeding also lead to a further reduction in the effective population sizes during breed formation [4]. As a result, domesticated plants and animals are expected to accumulate an excess of deleterious mutations compared to their wild types. A number of previous studies investigated this issue by comparing the deleterious mutational loads of wild and domesticated plants and animals, and the diversity ratio ( ω ) between amino acid changing (nonsynonymous) and silent (synonymous) SNVs was used as a measure of deleterious mutational load [3,[5][6][7][8][9][10][11][12]. A previous study [5] showed that ω was much higher in domesticated pig breeds than in wild pigs and also much higher in commercial white layer chickens than in wild African village chickens (putatively close to the jungle fowl) [5]. Similarly, much higher ω have been reported for domesticated breeds of horse [6], dog [7], rabbit [8], and silkworm [8] compared to their wild relatives, and for cultivated crops such as rice [3,9], soybean [10], cassava [11], and sunflower [12] compared to their wild progenitors.
All the genetic variants that reduce the fitness of an organism are collectively designated as deleterious mutations. While lethal and highly deleterious variants are immediately removed from the populations, the mildly deleterious variants segregate within a population for a short period of time. In this study, mildly harmful variants (i.e. that reduce fitness) are referred to as deleterious mutations. Natural selection prevents such deleterious mutations from reaching high frequencies and eventually eliminates them from the population [2]. Therefore, deleterious mutations are expected to be present at a low frequency, predominantly in the heterozygous state. However, when the size of a population declines, deleterious mutations drift to high frequencies [1]. Furthermore, a reduction in population size or bottleneck will also increase the number of homozygous deleterious mutations, which result in a deviation from the Hardy-Weinberg equilibrium. For instance, human migration out of Africa resulted in a series of bottlenecks in the non-African populations as they were successively subsampled along the migratory routes, which led to a higher proportion of high-frequency and homozygous deleterious SNVs in non-African compared to African populations [13,14]. Since the process of domestication also introduces bottlenecks, a similar pattern is expected in domesticated animals. This was confirmed by a previous study that compared the exomes between dog breeds and wild wolves and found a higher proportion of homozygous deleterious SNVs in dogs than in wolves [7]. Similarly, domesticated yak populations were reported to have a larger number of homozygous deleterious amino acid-changing SNVs than that estimated for wild yaks [15]. High-frequency deleterious variants that cause diseases such as retinal degeneration in European cattle breeds have been attributed to result from the process of domestication and artificial selection [16].
Cattle breeds belong to the subspecies Bos taurus taurus and/or Bos taurus indicus [17]. The level of heterozygosity and the deleterious mutational load in each breed can be determined by the size of their progenitor population [18]. In addition, the differences in the degree and patterns of artificial selection and in the rate of inbreeding can also contribute to the variation in deleterious mutation load among cattle breeds [4]. In this study, we estimated the deleterious mutational load in various cattle breeds and investigated the potential contributions of the above-mentioned factors by analysing whole-genome data from 432 animals belonging to 54 distinct worldwide cattle breeds.

Genome data
Whole-genome data from 108 cows and 314 bulls were obtained from the Bovine Genome Variation Database (BGVD) [19]. These animals belong to 54 breeds, including breeds from Europe (Central and Western Europe), Northeast Asia (Japan and South Korea), Africa (Western Africa and Guinea), Middle East (Iran), South Asia (India, Pakistan, and Sri Lanka) and East Asia (China). The number of individuals in each breed ranged from 1 to 45 [for details, (see Additional file 1: Table S1)]. In this study, indicine × taurus breeds were referred to as indicine due to their similarity in diversity and deleterious mutational load with indicine breeds. To orient the direction of mutations and to find the derived SNVs, we used whole-genome data of American Bison (Bison Bison). For this purpose, the pairwise whole-genome alignment between the cow and American Bison created with the LASTZ program was downloaded from the Ensembl genome data resource (ftp:// ftp. ensem bl. org/ pub/ relea se-102/ maf/ ensem bl-compa ra/ pairw ise_ align ments/ btau_ ars-ucd1.2. v. bbbi_ bison_ umd1.0. lastz_ net. tar. gz). The corresponding nucleotides of the Bison genome was used to determine the orientation of the cattle SNVs. While the Bovine Genome Variation Database is based on the reference Btau 5.0.1 build, it also contains the corresponding chromosomal coordinates for the reference ARS-UCD1.2. Therefore, we used the ARS-UCD1.2 coordinates to detect variants, link annotations, and for genome evolutionary rate profiling (GERP) score information. Furthermore, the two alleles of each SNV in the cow genome were compared with those in the bison genome. For our analysis, we included only the SNVs that had at least one allele matching with that of the bison SNVs, which allowed us to confirm the variants independently of the build that was used to map them.

Functional annotations
To identify amino acid changing (nonsynonymous) SNVs and silent (synonymous) SNVs, the genome annotation file containing the information on functional consequences was obtained from the Ensembl server (ftp:// ftp. ensem bl. org/ pub/ relea se-102/ varia tion/ vcf/ bos_ taurus/ bos_ taurus_ incl_ conse quenc es. vcf. gz). The coding sequences for 15,392 unique reference genes of the cow were downloaded from the GenBank reference genes database. Using the program codeml of the software PAML [20], the exact numbers of synonymous and nonsynonymous positions were calculated. In order to identify the deleterious amino acid-changing SNVs, the GERP score was used [21]. The GERP score for each chromosomal position of the cow was calculated based on a whole-genome alignment of 90 mammalian genomes, and this data was downloaded from the Ensembl server (http:// ftp. ensem bl. org/ pub/ curre nt_ compa ra/ conse rvati on_ scores/ 90_ mamma ls. gerp_ conse rvati on_ score/).

Determination of the deleteriousness of amino acid changing SNVs
A number of studies have used the GERP score to determine the deleteriousness of SNVs [13,14,22,23]. However, Huber et al. [24] and Lawrie et al. [25] have raised concerns about the use of this score [24,25]. Huber et al. [24] investigated the relationship between the GERP scores and the fitness effects on the organism in terms of selection coefficient ( s ) and effective population size ( N e ) and showed that while lower GERP scores (< 0) predict neutral mutations that are not under selective constraints ( N e S> − 1), high GERP scores (> 5.5) accurately predict deleterious mutations that are under high purifying selection ( N e S< < − 1), but GERP scores (0-4.5) with moderate values are ambiguous and are not able to distinguish neutral from deleterious mutations. Furthermore, they showed that selection coefficients of the functional elements in the noncoding regions change significantly over time (functional turnover), and hence the statistical power of predicting deleterious single nucleotide polymorphisms (SNPs) in noncoding regions using the GERP method is low. However, high GERP scores (> 5.5) do have the power to detect deleterious SNPs (those under purifying selection) in zero-fold and two-fold degenerate sites of protein-coding genes [24]. Therefore, based on the results of Huber et al. [24], we used a GERP score threshold higher than 5.5 to identify deleterious nonsynonymous SNVs in the above-mentioned sites of coding genes.

Data analysis
Nucleotide diversity (π) per base was estimated using the following equations [26]: where p i is the allele frequency of SNV i , S is the total number of SNVs in the whole genome or exome, n is the number of chromosomes sampled and L is the number of sites or bases in the genome, at synonymous or nonsynonymous positions. The ratio ( ω ) was estimated as: where π N and π S are nonsynonymous and synonymous nucleotide diversities. Only biallelic SNVs are used in the analysis. To test the significance between mean estimates, we used the Z-test, and to estimate the strength of relationships, we used the Pearson correlation. The nonparametric Spearman correlation also produced similar strengths of correlations.

Results
To examine the pattern of genomic variation, the whole genome nucleotide diversity was estimated for 54 cattle breeds. The X-axis on Fig. 1 shows that taurine breeds have low diversities (0.0004-0.0016) compared to indicine breeds (0.0020-0.0031). These diversity ranges suggest that, within the taurine breeds (red), there is a four-fold variation in the genome diversity, whereas within the indicine breeds (blue) it does not exceed 70%. The nonsynonymous-to-synonymous diversity ratio ( ω ) was estimated for each breed, and the values were plotted against the genomic diversities. This analysis revealed a highly significant negative correlation (Pearson r = − 0.95, P < 0.000001) between the two variables ( Fig. 1), which suggests that the breeds with a low genomic diversity have a high proportion of amino acidchanging SNVs compared to those with a high genomic diversity. Hence, taurine breeds have a much higher ω (0.18-0.22) than the indicine (0.16-0.17) breeds. However, within the taurine breeds, the ω values observed for Fig. 1 Relationship between whole-genome diversity ( π ) and the ratio of nonsynonymous heterozygosity to synonymous heterozygosity ( ω ) estimated for 54 cattle breeds. Colour codes of the data points distinguish subspecies of the breeds, and shapes denote the geographical locations of the breeds. The correlation was highly significant (Pearson's correlation r = − 0.95, P < 0.000001). The best-fitting regression line is shown European and Northeast Asian (Japanese and Korean) taurine breeds (0.20-0.22) were higher than those for African and East Asian taurine breeds (0. 18-0.19). In contrast, ω did not vary much among the indicine breeds.
The ω estimates indirectly revealed the nonsynonymous deleterious mutation loads in various cattle breeds. We identified the deleterious nonsynonymous SNVs by setting a threshold higher than 5.5 for the GERP score and separated them into two groups: those with a DAF lower than 0.51, and those with a DAF higher than 0.51. The number of derived deleterious amino acid changing SNVs per genome was calculated separately for each DAF category and for each breed. This was done by counting the total number of deleterious low-frequency SNVs observed in a breed and dividing it by the number of animals in the breed. The same was calculated for high-frequency deleterious SNVs. These counts per genome were then correlated with the genomic diversity. This analysis revealed contrasting patterns for high and low-frequency SNVs. Figure 2a reveals a significant positive correlation (r = 0.88, P < 0.000001) between the number of low-frequency deleterious SNVs per genome and genomic diversity. The number of low-frequency deleterious nonsynonymous SNVs varied drastically between breeds (from 3.9 to 24.0). Analysis of the high-frequency (DAF > 0.51) nonsynonymous deleterious SNVs showed a significant negative relationship (r = − 0.67, P < 0.000001) between genome diversity and the counts of deleterious SNVs per genome (Fig. 2b). We observed a two-fold difference (from 14.7 to 30.6) in the SNV counts among the breeds. Some of the data points in Fig. 2 represent breeds that comprise only one or two animals, which might influence the results. However, highly significant relationships were also observed when only the breeds with more than two animals were included (P < 0.00002).
To obtain the pattern of the mutational load in breeds from various geographical locations and belonging to different subspecies, we grouped the breeds into six categories, and the average number of deleterious SNVs per genome for each group was calculated (Fig. 3). On average, the taurine breeds have predominantly highfrequency deleterious SNVs, and the indicine breeds have predominantly low-frequency deleterious SNVs. The number of low-frequency deleterious SNVs varied (between 6.3 and 18.5) between breed categories (Fig. 3a). The average number of low-frequency SNVs present in the indicine breeds was significantly larger than that in the taurine breeds (P < 0.00001). Similarly, high-frequency deleterious SNVs also varied between the six groups of breeds (between 17.6 and 25.3) (Fig. 3b). The mean count of high-frequency deleterious SNVs in the taurine breeds was significantly higher than that observed in the indicine breeds (P < 0.00001). Within the taurine breeds, the Northeast Asian and European breeds have a larger number of high-frequency deleterious SNVs but a smaller number of low-frequency deleterious SNVs than the East Asian taurine breeds. Such variations were not observed within the indicine breeds.

Discussion
The whole-genome diversity estimated in this study varied significantly between the cattle breeds analysed and are very similar to previously reported values [27]. The correlation between diversity and ω suggests a higher nonsynonymous mutation load in breeds with a low Fig. 2 Correlation between whole-genome diversity ( π ) and the number of derived deleterious nonsynonymous SNVs per genome. Deleterious SNVs were divided into two groups based on their derived allele frequencies (DAF): a low-frequency deleterious SNVs (≤ 0.51) and b high-frequency deleterious SNVs (> 0.51). The positive (r = − 0.88, P < 0.000001) and negative (r = − 0.67, P < 0.000001) correlations were highly significant and best-fitting regression lines are shown diversity than in those with a high diversity. This result is similar to the correlations observed between diversity and ω estimated for various dog breeds [7] and domestic breeds of rabbit, pig, and chicken [8]. Furthermore, the higher ω observed for many domestic crop varieties and animal breeds compared to their wild relatives also support our results [3, 5-12, 15, 28]. Previous studies showed that the number of homozygous deleterious SNVs was larger in domesticated canines or yaks than in their respective wild relatives [7,15]. Since homozygous SNVs represent high-frequency variants, our results are consistent with those reported by the above-mentioned studies.
Previous studies using high-density (HD) SNP arrays revealed a much higher genetic diversity for taurine than for indicine cattle [29][30][31][32], e.g. in [29] the observed heterozygosities for indicine breeds were lower than 0.21 and those for taurine breeds were higher than 0.3. However, a deeper analysis detected an ascertainment bias that explained this difference since the SNP array chips were predominantly based on taurine breeds [30,33,34]. A number of later studies (including the 1000 bull genomes project) based on whole-genome sequencing data showed that indicine breeds had an almost two-fold higher nucleotide diversity, i.e. ~ 0.003 for indicine versus ~ 0.0015 for taurine breeds [18,27], which are comparable to our estimates i.e. from 0.0019 to 0.0031 for the indicine versus from 0.0004-0.001 for the taurine breeds. Also, two other studies reported a two-fold larger number of SNPs in indicine than in taurine breeds [35,36]. Nevertheless, the very low genomic diversity of some of the taurine breeds analysed in our study could also be due to the increased rate of inbreeding resulting from intense selective breeding.
Studies on mitochondrial genomes reported a similar or lower diversity in indicine mitogenomes compared to those of taurine breeds [37][38][39][40]. However, this could be attributed to the number and types of haplogroups present in each breed. The haplogroups belonging to the taurine mitochondrial lineage are: T (T1-T5), P, Q, R, and E, and those to the indicine lineage are: I1 and I2 [41,42]. Unlike the nuclear genome, a single breed population can contain one or many mitochondrial haplogroups, including those from taurine and indicine lineages. Therefore, the mitochondrial genome diversity depends on the combination of haplogroups. Breeds with both indicine (I) and taurine (T, P, Q, R or E) haplogroups have a high diversity, whereas breeds with only indicine or only taurine haplogroups have a low diversity [38][39][40]. Furthermore, within the taurine lineage, breeds containing multiple haplogroups (e.g. T and P lineages) have a higher diversity than those that have only one haplogroup (e.g. T) [40]. This is because diversity is proportional to the coalescence time (age) of the mitochondrial genomes of the breed [41,42]. Breeds containing multiple haplogroups are old (longer coalescence time) and their diversity is expected to be high. These patterns are evident from many studies ( Table 1 in [40]; Supplementary Table S1 in [39]; and Table 1 in [38]). Therefore, the relationship between breed types (taurine or indicine) and mitochondrial nucleotide diversities is complex and needs to be inferred based on the context of haplogroups. Since the taurine mitochondrial lineage has more (9) haplogroups than the indicine lineage (2), the breeds containing the former lineage tend to have a higher level of diversity than the latter. Due to these uncertainties and confounding factors, the present investigation was restricted to the study of the mutational load of nuclear genomes. Fig. 3 The average number of deleterious derived nonsynonymous SNVs per genome was calculated for each category of cattle breeds that were grouped based on their geographical location and subspecies. a Low-frequency SNVs (DAF ≤ 0.51) and b high-frequency SNVs (DAF > 0.51). Error bars denote the standard error of the mean The deleterious SNVs estimated for groups of breeds showed that, on average, taurine cattle breeds have a higher mutational load than indicine breeds. The number of high-frequency deleterious SNVs also varies significantly within the taurine and within the indicine breeds. For instance, Northeast Asian and European taurine breeds have a larger number of high-frequency deleterious SNVs, and a smaller number of low-frequency deleterious SNVs than the East Asian taurine breeds. Population genetic theories predict that breeds (populations) with small effective population sizes are expected to have more high-frequency derived SNVs due to the influence of genetic drift, which is strong in small populations [2,13]. This expectation also holds true for SNVs with low fitness effects [1]. Therefore, the higher counts of derived deleterious SNVs observed in taurine compared to indicine breeds could be attributed to the difference in the effective population sizes of their progenitors before domestication, as previously suggested [18,43]. In addition, this could also be due to a difference in the size of the bottleneck that occurred during their respective domestication and breed formation [4]. For instance, East Asian taurine breeds have a larger number of deleterious SNVs than East Asian indicine breeds and this difference may reflect a difference in the effective population sizes of the progenitors of taurine and indicine breeds. However, European and Northeast Asian taurine breeds have much more deleterious SNVs than East Asian taurine breeds, which could be the result of highly selective breeding and a severe bottleneck created by a much smaller number of founders used during the formation of European and Northeast Asian taurine breeds.

Conclusions
Our study revealed a higher mutation load and a larger number of high-frequency deleterious SNVs in cattle breeds with a low genomic diversity than in those with a high genomic diversity. These results suggest that diversity, deleterious mutation load, and frequency of deleterious mutations are determined by their effective population sizes as predicted by population genetic theories. While we found higher mutational load in taurine breeds than in indicine breeds, mutational load did vary within the taurine breeds owing to differences in their effective population sizes. These results have implications regarding the health of cattle breeds since the mutations causing genetic diseases and their frequencies are expected to vary between breeds. For instance, the incidence of genetic diseases caused by recessive homozygous variations could potentially be higher in breeds that have small effective population sizes.
Additional file 1: Table S1. Locations and number of samples for each cattle breed.