Meta-analysis for milk fat and protein percentage using imputed sequence variant genotypes in 94,321 cattle from eight cattle breeds

van den Berg, Irene; Xiang, Ruidong; Jenko, Janez; Pausch, Hubert; Boussaha, Mekki; Schrooten, Chris; Tribout, Thierry; Gjuvsland, Arne B.; Boichard, Didier; Nordbø, Øyvind; Sanchez, Marie-Pierre; Goddard, Mike E.

doi:10.1186/s12711-020-00556-4

Research Article
Open access
Published: 07 July 2020

Meta-analysis for milk fat and protein percentage using imputed sequence variant genotypes in 94,321 cattle from eight cattle breeds

Irene van den Berg ORCID: orcid.org/0000-0002-9292-8636¹,
Ruidong Xiang^1,2,
Janez Jenko³,
Hubert Pausch⁴,
Mekki Boussaha⁵,
Chris Schrooten⁶,
Thierry Tribout⁵,
Arne B. Gjuvsland³,
Didier Boichard⁵,
Øyvind Nordbø³,
Marie-Pierre Sanchez⁵ &
…
Mike E. Goddard^1,2

Genetics Selection Evolution volume 52, Article number: 37 (2020) Cite this article

4368 Accesses
39 Citations
6 Altmetric
Metrics details

Abstract

Background

Sequence-based genome-wide association studies (GWAS) provide high statistical power to identify candidate causal mutations when a large number of individuals with both sequence variant genotypes and phenotypes is available. A meta-analysis combines summary statistics from multiple GWAS and increases the power to detect trait-associated variants without requiring access to data at the individual level of the GWAS mapping cohorts. Because linkage disequilibrium between adjacent markers is conserved only over short distances across breeds, a multi-breed meta-analysis can improve mapping precision.

Results

To maximise the power to identify quantitative trait loci (QTL), we combined the results of nine within-population GWAS that used imputed sequence variant genotypes of 94,321 cattle from eight breeds, to perform a large-scale meta-analysis for fat and protein percentage in cattle. The meta-analysis detected (p ≤ 10⁻⁸) 138 QTL for fat percentage and 176 QTL for protein percentage. This was more than the number of QTL detected in all within-population GWAS together (124 QTL for fat percentage and 104 QTL for protein percentage). Among all the lead variants, 100 QTL for fat percentage and 114 QTL for protein percentage had the same direction of effect in all within-population GWAS. This indicates either persistence of the linkage phase between the causal variant and the lead variant across breeds or that some of the lead variants might indeed be causal or tightly linked with causal variants. The percentage of intergenic variants was substantially lower for significant variants than for non-significant variants, and significant variants had mostly moderate to high minor allele frequencies. Significant variants were also clustered in genes that are known to be relevant for fat and protein percentages in milk.

Conclusions

Our study identified a large number of QTL associated with fat and protein percentage in dairy cattle. We demonstrated that large-scale multi-breed meta-analysis reveals more QTL at the nucleotide resolution than within-population GWAS. Significant variants were more often located in genic regions than non-significant variants and a large part of them was located in potentially regulatory regions.

Background

The identification of causal mutations is important to take full advantage of sequence data to improve the accuracy of genomic prediction [1, 2]. Furthermore, it contributes to a better understanding of the biological mechanisms that underlie variation in quantitative traits and diseases. Since an increasing amount of sequence data is becoming available, it is possible to perform sequence-based genome-wide association studies (GWAS) to identify candidate causal mutations or markers in high linkage disequilibrium (LD) with them. However, due to the stringent thresholds that are necessary to avoid false positive associations, very large cohorts with both sequence variant genotypes and phenotypes are required to identify quantitative trait loci (QTL) with small to moderate effects.

Increasing the statistical power to identify candidate causal variants is possible by combining multiple datasets. However, in practice, the sharing of data at the individual level is not always possible, and prevents the compilation of large mapping cohorts. An alternative approach to compiling large mapping cohorts without exchanging data at the individual level is to perform a meta-analysis that uses GWAS summary statistics to approximate a GWAS using the full, combined dataset [3, 4]. For stature, an international collaboration has shown that a large-scale, across-population meta-analysis provides high power to detect trait-associated variants [5]. Recently, several other meta-analyses have been carried out in cattle for multiple traits including fat and protein percentages in milk [6,7,8,9]. To date, the largest meta-analysis for fat and protein percentage included the single nucleotide polymorphism (SNP) chip genotypes of 78,772 cows from three dairy cattle breeds in France [9]. Performing a meta-analysis of summary statistics generated from multiple breeds across multiple countries also facilitates the combination of data from mapping cohorts that do not contain the same set of variants or that were prepared using different imputation or association methods.

In addition to achieving sufficient power to identify trait-associated variants, high precision is important in GWAS in order to prioritize a small number of variants as candidate causal mutations. Because of the long-range LD that is present in most cattle breeds [10], many variants in high LD tag the same QTL, which makes the identification of causal variants a difficult task. Because LD is conserved over shorter distances across breeds than within a breed, a multi-breed GWAS or across-breed meta-analysis can improve mapping precision [6].

Although direct selection on milk composition in dairy cattle has been more limited than selection on milk yield, a correlated response would be expected due to the genetic correlation between milk yield and composition. While all dairy breeds have been selected for milk yield and hence composition, slight differences in selection pressures may have occurred in different breeds. Pausch et al. [7] reported higher F_ST values for QTL than non-QTL, suggesting differences in selection pressures in the breeds used in their study.

To maximise both power and precision of a GWAS for fat and protein percentages in milk, we carried out meta-analyses of the summary statistics of nine within-population GWAS that used imputed sequence variant genotypes of 94,321 individuals representing eight cattle breeds. Besides identifying QTL for each trait, significant and non-significant variants were compared in terms of minor allele frequency (MAF), functional annotations and F_ST.

Methods

As input for the meta-analyses, summary statistics of nine within-population GWAS were used. In total, the GWAS included imputed sequence variant genotypes of 94,321 individuals. The within-population GWAS are summarized in Table 1.

Table 1 Description of GWAS used in the meta-analysis

Full size table

Phenotypes used for within-population GWAS

Phenotypes were either yield deviations (YD) of cows, i.e. own mean performances adjusted for environmental effects, or daughter yield deviations (DYD) of bulls, i.e. average daughter performance adjusted for environmental effects and for breeding value of the mates, or a combination of those. Two studies simply used estimated breeding values (EBV) of bulls, which were not deregressed. However, considering the high relatability of the traits (0.89 in BSW and 0.95 in HOL [7]), the contribution of information from relatives to the EBV is very small and should not have any major consequences on the GWAS [11].

The GWAS for the Australian dataset were performed across breeds, but separately for bulls (AUSB) and cows (AUSC). The Australian animals and the GWAS model are described in a previous report [12]. Briefly, the AUSB dataset contained 9739 Holstein, 2059 Jersey and 125 Australian Red bulls, and the AUSC dataset consisted of 22,899 Holstein, 6174 Jersey, 424 Australian Red and 2850 crossbred cows. Phenotype data included 6569 CRV bulls (https://www.crv4all-international.com/) with phenotypes derived from their Interbull MACE breeding values (https://interbull.org/ib/interbullactivities), deregressed to the Australian scale, and converted to the scale of the daughter trait deviation. The remaining 5354 bulls and all 32,347 cows were from DataGene (https://datagene.com.au/). The GWAS for the Norwegian population (Norwegian Red cattle, NR) was performed using data on 21,540 and 21,550 bulls and cows, for fat and protein percentage, respectively. All other GWAS were performed within breed and sex. More details on the HOLG, BRAU and FLCK GWAS can be found in [7].

Genotypes used for within-population GWAS

Only variants with a MAF lower than 0.002 or a minor allele count (MAC) higher than 4 and, if available, an imputation r² (as provided by the imputation program) ≥ 0.4 was considered for the within-population GWAS. In total, 25,702,992 (25,702,995) distinct variants were analysed for fat (protein) percentage, with the number of variants per within-population GWAS ranging from 12,985,160 to 17,042,717. In total, 7,520,048 (7,520,050) variants were common to all GWAS for fat (protein) percentage.

Most GWAS populations were imputed using multi breed reference populations that comprised 1147, 1557 or 2333 individuals from Run 4, 5 or 6 of the 1000 bulls genomes project, respectively [13], except for the Norwegian Red population, that was imputed using a within-breed reference population of 378 Norwegian Red bulls. Imputation was done using Minimac3 [14], Minimac4 [14] and FImpute [15].

Within-population GWAS

GWAS were carried out by single SNP regressions using best-guess genotypes and the mixed linear model association (MLMA) analysis as implemented in the GCTA software [16], or using imputed allele dosages and the MLMA approach as implemented in the EMMAX software [17].

Meta-analysis

All 25,702,992 variants that were present in at least one of the within-population GWAS were included in the meta-analysis. The meta-analysis was based on the weighted Z-scores model as implemented in the METAL software [18] that considers the p-value, direction of effect and number of individuals included in each within-population GWAS. Because the scaling of the phenotypes used for the within-population GWAS differed between the populations, we used the weighted Z-scores model that uses the significance and direction of marker effects as input, rather than alternative models that use allele substitution effects and corresponding standard errors. Van den Berg et al. [6] found that, when combining GWAS with summary statistics from multiple GWAS with difference in scaling of the phenotypes, the weighted Z-scores model yielded results that were very similar to those obtained by a full analysis combining all data used for the GWAS. For each variant and each within-population GWAS, Z-scores were computed as:

$$Z_{k} = \varPhi^{ - 1} \left( {1 - \frac{{p_{k} }}{2}} \right) \times \Delta_{k} ,$$

where $Z_{k}$ is the Z-score for GWAS $k$, $p_{k}$ the p-value estimated in GWAS $k$, $\Delta_{k}$ the direction of effect in GWAS $k$, and $\varPhi$ and $\varPhi^{ - 1}$ are the standard normal cumulative distribution function and its inverse, respectively. Subsequently, overall Z-scores were computed as:

$$Z = \frac{{\mathop \sum \nolimits_{k} z_{k} w_{k} }}{{\sqrt {\mathop \sum \nolimits_{k} w_{k}^{2} } }},$$

where $w_{k}$ is the square root of the number of individuals used in GWAS $k$. An overall p-value was then computed as:

$$p = 2\varPhi \left( { - \left| Z \right|} \right).$$

QTL detection

All variants with a p-value lower than 10⁻⁸ were declared significant. To account for multiple testing, the false discovery rate (FDR) was calculated for each within-population GWAS and the meta-analysis as $FDR = \left( {nVariants \times 10^{ - 8} } \right)/nSign$, where $nVariants$ is the number of variants included in the GWAS and $nSign$ is the number of variants with a p-value lower than 10⁻⁸. QTL were selected by first ordering the significant variants based on their p-values, and subsequently selecting the most significant variants first, with at least 1 Mb between adjacent QTL. Variants within 1 Mb of a more significant variant were assumed to be part of the more significant QTL and not selected as additional QTL.

COJO

Because LD may be conserved along longer distances than 1 Mb, we performed a conditional and joint analysis (COJO) as implemented in GCTA [19] to test how many of the QTL detected in the meta-analysis appeared to be independent. As a reference sample to estimate the LD structure, we used sequence data of 53 Fleckvieh, 451 Holstein, 90 Jersey, 55 Montbéliarde, 45 Normande and 25 Norwegian Red individuals that were included in Run 6 of the 1000 Bulls genome project [13]. We only included the top variants selected as QTL in the COJO analysis and set the window size to 100 Mb.

Validation meta-analysis

To validate the QTL detected in the meta-analysis, we performed a second meta-analysis using data on 34,860 cows not included in the original meta-analysis. These cows originated from two countries, Australia and France, and four breeds, Holstein, Jersey, Montbéliarde and Normande. The French populations used in the validation study are described in more detail by Sanchez et al. [20]. Table 2 summarizes the four within-population GWAS that were used as input for the validation meta-analysis. The validation meta-analysis was performed only for the QTL that were detected in the first meta-analysis and that segregated in at least one of the four validation populations. These within-population GWAS and the validation meta-analysis were performed in the same manner as described above for the original analysis. To validate QTL, we compared the direction of the Z-score and p-value in the meta-analyses. Our previous study showed that the comparison of the direction of the effect of variants across different GWAS results can be more powerful in detecting consistent signals than the sole comparison of p-values between different GWAS [12].

Table 2 Description of the GWAS used in the validation meta-analysis

Full size table

Minor allele frequencies

To compare the MAF of significant variants with the MAF of all variants, we estimated the allele frequencies of the total population used for the meta-analyses. First, the allele counts at each position were computed using the allele frequency in each population. Then, the allele counts were combined and used to estimate the MAF of each variant in the whole population used for the meta-analyses.

Functional annotations

Functional annotations were compared between significant and all other variants in order to determine if certain functional categories were enriched for trait-associated variants. Genomic coordinates and functional annotations were obtained according to the UMD3.1 assembly of the bovine genome and Ensembl’s Variant Effect Predictor [21, 22]. We used LiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver) to convert the positions of the detected QTL from UMD3.1 to their positions on the new ARS-UCD1.2 genome.

eQTL analysis

Results of the meta-analyses were compared with those of a previous eQTL study [23, 24] to identify potential overlap between QTL and eQTL. The eQTL study contained data of 105 Holstein and 26 Jersey cows. In total, 9,191,239 and 8,587,100 variants were included in both the eQTL study using white blood cells and cells collected from milk samples, respectively, and the meta-analysis. The cells collected from milk samples included immune cells and mammary gland epithelial cells. The transcriptome of cells collected from milk samples shared a high similarity with that of the mammary gland tissue. A detailed description of the RNA sequence data generation for each tissue is reported in [25]. The association between the variants and gene expression was estimated using a linear model. A variant was declared as an eQTL if its genotype was significantly associated with the expression of a gene located within 1 Mb of the variant with a p-value ≤ 10⁻⁶.

F_ST

To investigate whether significant SNPs are associated with higher F_ST values and investigate the potential presence of different selection pressures between breeds, we calculated F_ST values for 16,626,224 sequence SNPs using allele frequencies in 53 Fleckvieh, 451 Holstein, 90 Jersey, 55 Montbéliarde, 45 Normande and 25 Norwegian Red individuals that were included in Run 6 of the 1000 Bulls genome project [13]. F_ST values were computed for all breeds combined according to Weir and Cockerham [26], as implemented in GCTA [16], to measure the divergence between the breeds in the meta-analysis.

DAVID analysis

We used the Database for Annotation, Visualization and Integrated Discovery (DAVID) functional annotation tool [27, 28] to investigate if gene ontology terms were enriched for genes located within the QTL. Genes with variants that were significant in the meta-analysis and located in or near the gene, according to previously described annotation, were used as input for DAVID. In the DAVID analysis, we included the following terms: COG_ONTOLOGY, UP_KEYWORDS, UP_SEQ_FEATURE, GOTERM_BP_DIRECT, GOTRM_CC_DIRECT, GOTERM_MF_DIRECT, KEGG_PATHWAY and UP_TISSUE.

Results

Number of QTL detected

Figures 1 and 2 show Manhattan plots of the meta-analysis for fat and protein percentage, respectively. Manhattan plots of the within-population GWAS are in Additional file 1: Figure S1. Table 3 compares the number of significant variants and QTL in the within-population GWAS and the meta-analysis. The number of significant variants detected in the within-population GWAS ranged from 2117 for protein percentage in BRAU to 13,955 for fat percentage in AUSB. For protein percentage, the meta-analysis detected more variants than all the GWAS combined together, while for fat percentage, all the GWAS combined together detected more significant variants than the meta-analysis.

Table 3 Number of variants and QTL detected in the GWAS and meta-analysis for fat and protein percentage

Full size table

For both fat and protein percentage, more QTL were detected in the meta-analysis than with the within-population GWAS. The lists of the QTL detected in the meta-analysis are in Additional file 2: Tables S1 and Additional file 3: Table S2.

Several of the QTL detected in the meta-analysis were not significant in any of the within-population GWAS. For example, the meta-analysis detected a QTL for protein percentage located at 7,924,949 bp on chromosome 3, that had a p-value of 9.4 × 10⁻¹⁴ in the meta-analysis. In the within-population GWAS, the highest significance for this variant was for the NR breed with a p-value of 8.4 × 10⁻⁴.

Multiple variants were significant in the within-population GWAS, but not in the meta-analysis. Most of these variants had inconsistencies in direction of effect between populations in the within-population GWAS. For example, a QTL for protein percentage was detected in the NR breed (p-value of 2.7 × 10⁻¹⁰) at 35,509,237 bp on chromosome 25 (see Additional file 4: Figure S2). The alternate allele of the lead variant had a positive effect in the AUSB, AUSC, HOLF, MON, NOR, and FLCK breeds, but a negative effect in the NR, HOLG and BRAU breeds, and was not significant in any GWAS except in the GWAS for NR. The meta-analysis revealed a QTL nearby at 36,527,270 bp that was only included in the GWAS for NR and had a p-value of 3.7 × 10⁻²⁵. Visual inspection of this region on chromosome 25 indicated that a possible peak visible in the GWAS for NR and the meta-analysis, that encompasses both the QTL at 35.5 Mb and 36.5 Mb, with fewer significant variants associated with the peak in the meta-analysis than the GWAS (see Additional file 4: Figure S2)

COJO

Out of the 138 and 176 QTL detected for fat and protein percentage, 132 and 159 were present in the dataset that was used to estimate the LD structure for COJO. The COJO analyses retained 74 QTL for fat percentage and 84 QTL for protein percentage with a p-value ≤ 10⁻⁸. In most cases, the discarded variant was close to another variant that was retained. This implies that the two variants mark only a single QTL and not two independent QTL. Nevertheless, four variants were retained between the start of chromosome 14 and 5 Mb, which implies that there are at least three other QTL for fat percentage in this region as well as DGAT1. For each QTL, Additional file 2: Tables S1 and Additional file 3: Table S2 indicate whether QTL were retained by COJO or not, and the p-value in the COJO analysis.