To determine the best strategy for performing GWAS in African cattle, we compiled two large datasets of genotyping data. The first dataset was a collection of 409 WGS spanning the three lineages (European taurine, African taurine and indicine) from which global cattle breeds derive, and that included 40 genomes each of the NDama, Boran and Holstein–Friesian breeds. The second dataset included 1740 animals genotyped with the Illumina HD array and originating from four African countries (Tanzania, Ghana, Nigeria, and Burkina Faso). For comparison purposes, these data were combined with other available samples that spanned global breeds and genotyped with the same Illumina HD array, for a combined total of 2481 samples. Among the 1740 cattle sampled from the four African countries and genotyped with the Illumina HD array, 602 samples genotyped with the GeneSeek bovine 50 k array were also available.
The commercial arrays tagged indicine cattle poorly
After variant QC, we identified 32.0, 17.8 and 13.3 million SNPs in the Boran, NDama and Holstein–Friesian cohorts, respectively. Although more than twice the number of SNPs segregated among the Boran (indicine) than among the taurine breeds, the percentage of WGS variants tagged by the arrays was consistently higher in the NDama and Holstein–Friesian breeds than in the Boran breed (Fig. 1). This is potentially partly due to the fact that the Boran breed has a slightly lower LD profile compared to the taurine breeds (see Additional file 7: Figure S4). Furthermore, the performance of any given array, in terms of the number of variants that it tags with a high R2, is largely a function of the number of variants it carries. It may be surprising, but the arrays that target indicine breeds did not show a substantial improvement for the Boran compared to the NDama and Holstein–Friesian cattle. This confirms that variants on commercially available arrays do not tag well the variants that segregate in indicine breeds, which has implications for performing GWAS in African breeds.
This comparison reveals more differences between the taurine breeds than between the indicine breeds across the 23 commercial bovine genotyping arrays. For example, the HD array tagged up to 25 and 17% (for Holstein–Friesian and NDama breeds, respectively) of the WGS variants, considering an LD r2 > 0.8, whereas only about 6% for the indicine Boran (Fig. 1). Intriguingly, there were more tagged variants for the taurine breeds than for the Boran breed, even in the indicus-based array such as IND90KH.
Imputation performance in African cattle
Consequently, it seems that a large proportion of the variants specific to African indicine cattle are poorly captured across all of the major genotyping arrays. Thus, we investigated to what extent imputation could help mitigate this issue. For this analysis, we used the WGS data of 289 broadly unrelated individuals [14], belonging to a diverse range of breeds and geographic locations (55 populations, among which 13 were European, 12 African, 28 Asian, and 2 Middle Eastern), as a Reference Panel (i.e., Global Reference Panel). The target sets were created by retaining the WGS genotypes that overlapped with the variants of each of the 23 available bovine genotyping arrays. The Global Reference Panel was also split into African (n = 87), Asian (n = 106) and European (n = 77) subsets, which allowed us to perform imputation with only the subset of variants found on each specific array and to compare the imputed genotypes to those found in the WGS data, and thus to evaluate which array performed better in African cattle.
Table 1 shows the numbers of variants on each of the 23 commercially available arrays and how many were retained from the WGS data, before and after QC (65,107,956 and 10,282,187 variants, respectively). Eighteen of the 23 arrays are well represented in the unfiltered WGS, with more than 80% of variants retained before QC, suggesting that the sites were mostly polymorphic among the 289 samples. The panel with the lowest proportion of variants retained from the unfiltered WGS was the GGPF250 array (61%). After QC, these proportions dropped further, with only 23% variants from the GGPF250 array retained. This likely reflects the allelic frequencies of the variants in the population under study. The GGPF250 array contains many very rare variants. For a variant occurring at a very low allele frequency, the probability of observing it in 289 samples is close to zero, even if it is present in the population.
Arrays with less than 10,000 variants remaining after QC were excluded from further analysis (Table 1) and imputation was performed using Minimac4 (https://genome.sph.umich.edu/wiki/Minimac4). Two related metrics were calculated for each array. ER2 is calculated directly by Minimac4 and measures the imputation accuracy of variants on the array, with each variant being left out one by one when performing imputation, and dosage R2 represents the genotype concordance between imputed and WGS genotypes at the variants that are not on the array.
Figure 2 shows the mean ER2 for all the bovine arrays when using the Global target set (i.e., the complete set of 289 samples as the reference panel). The highest accuracies were observed for the HD and BOS1 arrays (mean ER2 across target sets ranging from 0.55 to 0.96 and from 0.48 to 0.91, respectively), which is consistent with the fact that these two arrays have the largest number of variants. These results were supported by those obtained with the leave-one-out cross-validation (see Additional file 8: Figure S5) and in the comparison of the number of variants per array vs. mean accuracy (either ER2 or dosage R2) per array (Fig. 3).
The scatterplot in Fig. 3 shows that the HD and BOS1 arrays present the highest mean accuracies. Since our results show that the best performing arrays are those with the largest number of variants, the result that the GGPHDV3 array outperforms the GGPF250 array on these metrics (i.e., ER2 and dosage R2) may appear contradictory although they have almost half as many variants. However, as reported in Table 1, the number of variants retained from the WGS data for GGPHDV3 is larger than the number retained for GGPF250 (72,456 vs. 51,803).
A similar trend was observed when considering the dosage R2 between imputed genotypes and those called in the WGS data (Fig. 4). Although the HD and BOS1 arrays were primarily designed by focusing on European taurine cattle breeds, they perform better than lower density arrays that were designed explicitly for indicine breeds (i.e., IND90KH and GGPIND35), which suggests that a large number of variants in the target set is the most critical array feature for imputation. However, the GGPF250 array is a notable outlier. Although GGPF250 is the third largest commercially available bovine array, after the HD and BOS1 arrays, the number of variants retained from the WGS was relatively small, with only 61 and 23% retained when considering the WGS before and after QC, respectively. While the ER2 was reasonably high for this array (~ 0.6–0.7 across MAF thresholds \(\ge\) 0.05), i.e. the variants on the array could be imputed with reasonable accuracy, imputation performance was much poorer for variants that were not on the array (dosage R2 < 0.4) and worse than for arrays with less than half the number of variants.
Effect of breed on imputation performance
As the genotyping arrays were originally designed for different breeds and lineages (i.e., taurine vs. indicine), we investigated whether imputation performance was better for specific populations of animals and whether some arrays might perform better than others when applied to these population subsets.
As shown in Fig. 5, imputation performance of the African and European animals (i.e., African and European target sets) were broadly similar when imputing from the HD array variants to the Global Reference Panel. In contrast, imputation was notably lower among the Asian samples, which likely reflects that Asian cattle originate predominantly from Bos indicus, and are less well represented by the variants on the HD array. For most arrays, the highest accuracies were estimated when imputing European samples, with the mean ER2 or dosage R2 being similar to or higher than for the global set (see Additional file 9: Figures S6–S20). In general, the imputation performance of the Asian samples was poorest across all arrays, including those that were designed to capture indicine variation, such as IND90KH and GGPIND35.
Array performances at functional variants
In GWAS, the ability to genotype or impute functional variants is essential, given their role in driving phenotypes. Thus, we characterised the performance for precisely capturing these variants. First, we investigated the proportions of functional variants annotated by the Ensembl VEP v95 software (LOW, MODERATE, and HIGH predicted functional impact) that were tagged by the genotyping arrays in the WGS data, for both taurine (i.e., Holstein–Friesian and NDama) and indicine (i.e., Boran) breeds. Figure 6 shows the trends for two arrays (GGPF250 and HD).
Although the HD array can tag more variants at an r2 > 0.80 than the other arrays (Fig. 1), it does not perform well for tagging functional variants, especially those predicted to have a moderate or high impact. In spite of its lower density, the GGPF250 array tagged higher proportions of functional variants across all allele frequencies, which is consistent with its design that is aimed at targeting functional variants. However, the performance of an array for tagging functional variants is potentially also partly due to the similarity between the allele frequencies of the two sets of variants. Since the LD between two variants, as measured by r2, is maximised when they have equal allele frequencies, the common variants (moderate to high MAF) on most arrays will do a poor job at tagging the functional variants simply due to the difference in allele frequency. Therefore, the GPPF250 array potentially performs better than the other arrays for tagging functional variants because, besides including a high proportion of variants in genic regions, the allele frequencies of these variants are more closely matched. However, functional variants that segregate among the Boran cattle were still less well captured by this array than those among the taurine breeds. Notably, the GGPF250 array performed best at tagging variants with a MODERATE impact (missense variants) rather than those with a HIGH impact (e.g. stop gain or splice variants).
To assess how well functional variants can be imputed, we calculated the imputation accuracy (i.e., dosage R2) for each of the three classes (LOW, MODERATE and HIGH) of functional variants (both SNPs and indels). The masked analysis was performed considering the four target sets (i.e., Global, African, Asian, and European) and the Global Reference Panel. As shown in Fig. 7a, when considering the Global set of 289 individuals, SNPs with a low to moderate frequency (MAF < 0.15) and of HIGH predicted impact (e.g. stop gained or splice variants) are particularly poorly imputed in general by the HD array. In contrast, the imputation of the SNPs of MODERATE and LOW predicted impact generally follows the trend of all the other SNPs. Therefore, these results suggest that although the HD array is the best performing array according to the results from our masked analysis, it might not be ideal for imputing high impact SNPs, especially if they are rare.
A similar trend was observed for most arrays (see Additional file 10: Figures S21–S35, panels a). However, the GGPF250 and GGPIND35 arrays performed better when imputing HIGH, MODERATE and LOW impact SNPs than when imputing all other SNPs at any frequencies (see Additional file 10: Figures S22 and S33, respectively). This reflects the design of these arrays, with the GGPF250 array including 199,000 predicted functional variants [3], and the GGPIND35 array including 35,090 SNPs that are optimally selected for Bos indicus breeds [24] as well as Bos indicus specific SNPs for parentage testing; GGPIND35 also includes variants known to be causative for particular genetic diseases (https://www.neogen.com/globalassets/pim/assets/original/10000/official_ggp-indicus_brochure.pdf). Similar trends were also observed when using the continent-specific target sets (i.e., African, Asian and European). However, dosage R2 tended to be consistently lower for all target sets (see Additional file 10: Figures S21–S35, panels b–d), which might be explained by the relatively smaller size of these target sets compared to the global one. It should be noted that imputation for high impact variants generally performed less well when considering the European compared to the African target set.
We also examined the imputation performance of indels that are more likely to be functionally important. These were consistently poorly imputed across arrays, irrespective of their frequencies and the target set used [i.e., Global, African, Asian, and European; (see Additional file 11: Figures S36–S51)]. It should be noted that this may partly reflect the difficulty of accurately calling indels in WGS data, rather than necessarily being only a problem with their imputation.
Validation analysis imputing 50 k and HD genotype target sets to WGS
An issue with testing imputation performance by subsetting variants from WGS data is that not all variants for an array can be called in the dataset, which reduces the apparent performance of each array. Thus, we also tested imputation performance using large collections of real array data collected from four African countries (Tanzania, Ghana, Nigeria, and Burkina Faso). This included 602 African samples genotyped on the GeneSeek 50 k array and 1740 on the Illumina HD array. Principal component analysis (PCA) of these samples shows that they represent well cattle from both West and East Africa (see Additional file 12: Figures S52 and S53). The 1740 samples were combined with other available samples genotyped with the Illumina HD array, reaching a total of 2481 samples. The PC plot of PC1 and PC2 for the combined HD genotyped individuals (see Additional file 12: Figure S54) shows a distinct separation between the European taurine, African taurine and indicine samples. PC1 seems to separate individuals by their degree of taurine or indicine lineage, with the Jersey and Holstein breeds to the left of the plot, whereas PC2 separates European and African taurine animals. The 1740 samples collected across the four African countries separate from the former breeds, and form a cluster between the African taurine and the indicine breeds. Using these data, we compared the use of different imputation approaches and reference panels on imputation accuracy. The flowchart for this analysis is in Fig. 8.
We compared two imputation approaches. The 50 k and HD genotype data imputed directly to WGS, as well as a two-step imputation from 50 k to HD then WGS (50 k > HD > WGS), both using the Global Reference Panel and the continent level subsets (i.e., African, Asian and European Reference Panels). Generally, imputation from the 50 k array directly to WGS performed poorly irrespective of the reference panel used (Fig. 9), with mean ER2 lower than 0.4 across all allele frequencies. The global and Africa-specific reference panels performed the best, with little difference in their respective performances. The reference panels of European and Asian cattle breeds also performed poorly when attempting to impute from the Illumina HD array, with mean ER2 values of 0.65 or lower across the allele frequency spectrum. In spite of this, supplementing the African reference panel with these European and Asian panels produced the best imputation performance (mean ER2 ranging from 0.81 to 0.93). This Global reference panel performed better than the Africa specific reference panel alone (mean ER2 ranging from 0.41 to 0.89).
The performance of the two-step imputation from 50 k to WGS (50 k > HD > WGS) was between that of the imputation from 50 k or Illumina HD directly to WGS, with ER2 trends similar to those of the Illumina HD array, for all panels tested. Consequently, larger global reference panels are likely to be more effective for imputation than more targeted representative panels, and this is consistent with previously reported results e.g., [3, 25]. Likewise, a two-step imputation strategy performs better when starting from a low-density array.
Considering the variety of breeds/populations represented in the HD combined data [for a PC plot (see Additional file 12: Figure S54)], another analysis was carried out by splitting the samples to be imputed into six subsets (samples collected across the four African countries, all African, African taurine, European taurine, all indicine, and Asian indicine samples). The sizes of the sets were balanced to 55 individuals each, and imputation accuracies (ER2) are shown in Fig. 10. All subsets show very low ER2 at low MAF, probably due to the comparatively small size of the sets.
While the imputation accuracies obtained when considering the first two African target subsets (i.e., samples collected from the four African countries and all African) were consistent with those obtained for the combined HD data, a slightly different trend was observed when all indicine and the European taurine target subsets were used. All indicine sets (which included Asian and African indicine cattle) performed better than the combined HD data at lower MAF but followed the same trend as the latter at MAF higher than 0.3. In contrast, the European taurine set performed better, with mean ER2 ranging from 0.88 (at MAF ≈ 0.03) to 0.97. We hypothesise that the European taurine subset performs better because, although its sample size is small, the individuals are more similar to each other (given the small effective population size of these breeds), and it is therefore easier to define haplotypes and subsequently impute missing data. Moreover, the Illumina HD array was primarily designed for European cattle breeds.