Genetic variants in mammary development, prolactin signalling and involution pathways explain considerable variation in bovine milk production and milk composition

Background The maintenance of lactation in mammals is the result of a balance between competing signals from mammary development, prolactin signalling and involution pathways. Dairy cattle are an interesting case study to investigate the effect of polymorphisms that affect the function of genes in these pathways. In dairy cattle, lactation yields and milk composition (for example protein percentage and fat percentage) are routinely recorded, and these vary greatly between individuals. In this study, we test 8058 single nucleotide polymorphisms in or close to genes in these pathways for association with milk production traits and determine the proportion of variance explained by each pathway, using data on 16 812 dairy cattle, including Holstein-Friesian and Jersey bulls and cows. Results Single nucleotide polymorphisms close to genes in the mammary development, prolactin signalling and involution pathways were significantly associated with milk production traits. The involution pathway explained the largest proportion of genetic variation for production traits. The mammary development pathway also explained additional genetic variation for milk volume, fat percentage and protein percentage. Conclusions Genetic variants in the involution pathway explained considerably more genetic variation in milk production traits than expected by chance. Many of the associations for single nucleotide polymorphisms in genes in this pathway have not been detected in conventional genome-wide association studies. The pathway approach used here allowed us to identify some novel candidates for further studies that will be aimed at refining the location of associated genomic regions and identifying polymorphisms contributing to variation in lactation volume and milk composition.

Conclusions: Genetic variants in the involution pathway explained considerably more genetic variation in milk production traits than expected by chance. Many of the associations for single nucleotide polymorphisms in genes in this pathway have not been detected in conventional genome-wide association studies. The pathway approach used here allowed us to identify some novel candidates for further studies that will be aimed at refining the location of associated genomic regions and identifying polymorphisms contributing to variation in lactation volume and milk composition.

Background
There have been many attempts to identify the genes that control milk production and functional traits in dairy cattle since they have high economic value [1,2]. Linkage studies and genome-wide association studies (GWAS) have led to the identification of a handful of causative mutations that affect milk production traits in dairy cattle [3][4][5][6][7]. However, the mutations that underlie most of the genetic variation remain elusive, reflecting the fact that the majority of these mutations are likely to have small effects and, therefore, individually explain a small proportion of the genetic variance [8,9]. New methods are needed to analyse the large quantity of genetic information provided by highdensity SNP (single nucleotide polymorphism) panels in order to identify novel genetic variants that have a functional role in lactation traits.
One potential approach is to first filter genetic variants for association analysis by considering pathways of genes that are likely to be involved in lactation. The advantage of this method is that less stringent significance thresholds can be used than in traditional GWAS, since the level of multiple testing is not as high. This also means that associations of smaller effect can be detected. However, the approach does have the limitation that any mutations that affect the traits outside the selected pathways will be missed, which means that the variation we can identify may be reduced compared with that from wholegenome studies.
For dairy traits, genes that are involved in mammary gland development, prolactin signalling and involution pathways are relevant candidates. Genes in the lactation pathway have been well-described but are largely inferred from mouse studies [10][11][12][13]. Development of the mammary gland (or mammogenesis) involves the formation of the rudimentary mammary structure before puberty and is triggered by secreted signalling proteins and transcription factors that regulate developmental processes, such as the Wnt, notch and hedgehog signalling pathways [12]. When the mammary structure begins to form, genes for growth hormone and proteins involved in basement membrane architecture are expressed. At puberty, the concentration of several hormones increases and stimulates the formation of alveolar buds [14]. Prolactin signalling is vital for lobulo-alveolar development and establishment of lactation but appears less important after teat formation in dairy cattle [15,16]. One hypothesis is that in cattle, prolactin may be more important for immune support at calving [17]. Prolactin interacts with its receptors to trigger paracrine signalling mechanisms through a highly regulated feedback mechanism involving JAK/STAT and map kinase activity, as well as other downstream targets, which in turn regulate proliferation and cell differentiation [14]. In involution, milk producing epithelial cells are removed via cell detachment and apoptosis. Cytokines, interleukins and MMP (matrix metalloproteinases) are involved in complex signal transduction cascades to regulate proliferation and apoptosis in this pathway. The mammary epithelium undergoes several rounds of proliferation, differentiation and apoptosis over up to eight lactations in dairy cattle [18]. These processes are regulated by a number of genes, which represent excellent candidates for harbouring mutations that explain part of the observed variation in milk production traits and thus link genetic variation with the biological mechanisms underlying the phenotype. In this study, we have assembled sets of genes involved in mammary gland development, prolactin and involution biological pathways. Then, we tested SNPs in windows of 200 kb surrounding these genes for association with milk production traits in dairy cattle. Our hypothesis is that genes in these pathways will harbour genetic mutations that explain variation in production traits in dairy cattle, and that our approach will detect more of these associations than a traditional GWAS, since we can test variants at lower significance thresholds because of the smaller number of tests conducted.

Genome-wide association studies
To determine whether SNPs within key lactation pathways were significant for milk production traits, an association analysis was used. We analysed several traits, including fat kg, fat percentage, milk volume, protein kg, and protein percentage [19,20]. A total of 16 812 dairy cattle were genotyped using the Illumina Bovine HD BeadChip, or the BovineSNP50 array [21] and imputed to the higher density [22] (1785 animals were actually genotyped at the higher density). After quality control (as in [22]), the final number of SNPs was 632 003. The genotyped animals included 9015 Holstein cows, 2770 Holstein bulls, 4202 Jersey cows, and 825 Jersey bulls [see Additional file 1: Table S1]. Phenotypes of bulls and cows were constructed as daughter trait deviations (the average of the bull's daughters trait deviations corrected for breed of mate) and trait deviations, respectively (corrected for herd year season and permanent environment effects) [see Additional file 2: Table S2]. The distributions of the number of lactations (for cows) and daughters (for bulls) are in Additional file 3: Figure S1. Records were standardised in both breeds to have a mean of 0 and a standard deviation of 1. In all analyses, phenotypes on bulls were weighted as where n represents the number of records [23] and h 2 is the heritability of the trait (0.33 for milk volume, fat kg and protein kg, and 0.5 for protein percentage and fat percentage, for both breeds [20]). Phenotypes on cows were weighted using the formula [23]: where r 2 is the repeatability (0.56 for all milk production traits) and l is the number of lactations. For the percentage traits, we were not able to fit weights for bulls in the model due to problems with convergence, likely because the heritability for these traits was high. The linear mixed model used to determine the association between individual SNP and each milk production trait: where y is the vector of phenotypes, expressed as the trait deviations for cows and daughter trait averages for bulls, β is the vector of fixed effects, including the overall mean and the effects of breed and sex, X is a design matrix allocating phenotypes to fixed effects, W is the vector of animal genotypes (the number of copies of the second allele at the SNP that the animal carries, coded as 0, 1 or 2), b is the additive effect of the second allele of the SNP, Z is an incidence matrix mapping phenotype to animals, u is the vector of polygenic effects (one for each animal), and e is the vector of random residuals. The polygenic breeding values were fitted as random effects following a normal distribution N 0;Aσ 2 α À Á where A is the expected relationship among individuals constructed from the pedigree (which dates back to the 1940s) and σ 2 a is the polygenic genetic variance. Variance components and fixed effects were estimated for each SNP with ASReml [24].

Analysis of key lactation pathways
Gene sets for analysis were chosen using published reviews of three important developmental stages of the lactating mammary gland. These included the mammary development pathway [12] and the prolactin signalling [14] and involution pathways [25]. We identified 64 genes involved in mammary development, 27 genes involved in prolactin signalling, and 40 genes involved in involution (Tables 1, 2 and 3). The gene families MAP kinase, P13K and frizzled were not included in the pathways since specific genes were not suggested in the reviews and these gene families have a wide range of signalling functions. The genomic location of these genes were determined using UMD3.1 in the NCBI database [26]. The SNPs within the genes of a pathway, or within 100 kb to each side of those genes, were then tested for association with each trait using the model above. The effect of a SNP was determined to be significant at P ≤ 0.05. The GWAS was repeated using other significance thresholds (P < 10 −3 and P < 10 −5 ) but 0.05 had the greatest power to detect enrichment (results not shown). The number of SNPs significant for each pathway was expressed as a proportion of the total number of SNPs in that pathway (PropSig).
To determine if the proportion of significant SNPs observed for each pathway was significantly greater than by chance at an experiment-wise level, distributions under the null hypothesis of no association were constructed with random permutations of the data. A list of 24 617 uniquely annotated bovine genes was created from the Ensembl Biomart database [27,28]. From this, three sets of genes, each with a length equal to the respective pathway tested were selected at random. SNPs were selected from within and 100 kb surrounding these genes to reflect the moderate to high linkage disequilibrium in Holstein cattle [29,30]. Each pathway SNP set was analysed in ASReml using the mixed linear model described above. This procedure was repeated 10 000 times to construct null distributions and the 500th highest proportion of significant SNPs was taken at the experimentwise P < 0.05 threshold. If the observed ratio for a pathway was greater than this value for a particular trait, the pathway was considered significant.
To account for differences in functional clustering of genes in the experimental pathways and in the random control gene sets, we compared the distance between genes on the same chromosome [see Additional file 4: Figure S2]. The experimental and control sets were distributed similarly but, due to the smaller number or paired genes for the experimental pathways, there were fewer gene pairs at long distances across the chromosomes (particularly > 10 Mb).
KEGG annotations were used to determine the gene sets that represented other biological pathways [31,32].
Finally, a variance component analysis was used to determine whether the SNPs within each pathway explained a greater proportion of the genetic variance than an equal number of randomly selected SNPs from the whole genome. The model fitted was where terms were the same as above, and g is a vector of random effects, assumed distributed N 0; where G is a genomic relationship matrix, constructed using the rules of [33]. The genomic relationship matrix was based on the SNPs from each pathway, plus a set of 4000 SNPs randomly selected from the whole genome.  Bold values indicate where >50% of SNPs in a gene region were significant.
The reason for adding the 4000 randomly chosen SNPs was that SNPs in the genes of the pathways are typically clustered by genomic location (i.e. a number of the genes are located in close proximity) [see Additional file 4: Figure S2]. Given the large number of animals in our dataset, this means that a considerable number of animals can have genomic relationships that are equal to or close to 1, i.e. they have inherited the same segments of the genome at all of the locations of the pathway genes. Consequently, the genomic relationship matrix is singular and impossible to invert. Adding 4000 random SNPs removed the singularities and the genomic relationship matrix could be inverted and variance components estimated. However, with the 4000 SNPs included, we could only assess the marginal contribution of adding SNPs in the pathway. Estimates of the variance components σ 2 g and σ 2 e were obtained from the REML analysis with ASREML [24]. The proportion of variance explained by the SNPs in these pathways was compared to that explained by the same number of randomly chosen SNPs within 100 kb of a gene, i.e. the additional SNPs were chosen to be close to genes, plus the set of 4000 randomly chosen SNPs corresponding to each pathway. Five replicates of the randomly chosen sets were performed to obtain standard errors.

Mammary development pathway
The 64 genes identified in the mammary development pathway included 3968 SNPs (Table 1). When the proportion of significant SNPs, at P < 0.05, (PropSig) was compared to the null distributions, the mammary development pathway was significantly associated with protein percentage (PropSig = 0.340, P < 0.01; Table 4 and Additional file 5: Figure S3). The null distributions compared with the experimental results are shown in Additional file 5: Figure S3, Additional file 6: Figure S4 and Additional file 7: Figure S5. The genes that contained the largest proportion of significant SNPs (> 50% significant SNPs) were the following: AREGB, CASB, DKK1, FGF1, FGF10, GHR, PRLR, SOCS2, STAT5A, STAT5B, TGFB1 and WNT10B (Table 1 and Additional file 8:  Table S3 for gene abbreviations).
Four genes in the mammary development pathway were located on BTA20, which contains a well-known QTL for milk production [5]. These genes included FGF10, MSX2, PRLR and GHR. FGF10 is located 1 Mb downstream of GHR, which is the gene often described with, though not necessarily underlying [34], this large QTL. To account for any potential bias associated with over-represented genes, we re-ran the pathway test and control permutations without BTA20. The mammary development pathway still reached significance for protein percentage when this chromosome was removed [see Additional file 9: Figure S6]. KEGG annotations of these 64 genes found 25 genes in pathways associated with cancer and 8 to 14 other genes in signalling pathways, such as JAK-STAT, that are known to be activated during lactation ( Table 5). The PI3K-Akt pathway is involved in mammary development, and mutations in genes of this pathway are found in approximately 70% of breast cancers [35]. There were eight genes involved in Wnt signalling pathways, which are prominent in mammary development and cancers [36].
To determine the extent of pleiotropy for variants in the pathway, we correlated the SNP effect estimates (for the 3968 SNPs in the pathway) for each pair of traits. Milk volume was negatively correlated with fat percentage and protein percentage, while fat percentage and protein percentage were highly positively correlated (Table 6). Fat kg and milk volume were also highly positively correlated with protein kg, as expected.

Prolactin signalling pathway
The prolactin signalling gene set was considerably smaller (27 genes, 1569 SNPs) than the involution and mammary development sets, since it only represents only one signalling The observed proportion of SNPs significant was compared to the mean (μ) of the simulated null distribution (μ(PropSig)), created by permutation testing, to determine if the observed proportion was significant at an experiment-wise level (* denotes significant at P < 0.05 experiment-wise, ** denotes significant at P < 0.01 experiment-wise) (see Methods). The total number of SNPs is presented in bold.
pathway, while mammary development and involution represent the combined effects of several sub-pathways (Table 2). Protein kg, fat kg and fat percentage were significantly associated with the prolactin signalling gene set (Table 4) and [see Additional file 6: Figure S4]. The SOCS2, STAT3, STAT5A, STAT5B, PRLR and CASB genes had more than 50% of SNPs significant for three or more milk production traits ( Table 2).
KEGG annotations for genes in the prolactin pathway showed 12 associations with the JAK-STAT signalling pathway, followed by the PI3K-Akt and insulin signalling pathways (Table 5).

Involution pathway
The involution pathway contained 40 genes and 2521 SNPs ( Table 3). The proportion of associated SNPs was significant at the experiment-wise level for all milk production traits, except fat [see Additional file 7: Figure  S5] and (Table 4). We identified a large ratio of significant SNPs for ATF4, IGFBP4, IRF1, LIFR, OSMR, PTK2, STAT3, STAT5A and STAT5B (Table 3). KEGG analysis showed a trend towards infection-related pathways (Table 5). JAK-STAT, hepatitis B and PI3K signalling pathways were also highly represented. Traits showed moderate to high correlations, which suggested pleiotropy for milk production traits within SNPs in the involution pathway (Table 6).
Three genes in the involution pathway were located on BTA14 and may be biased by associations with the large QTL at the beginning of BTA14 associated with the mutation in DGAT1 [37]. The CEPBD and MYC genes are located more than 13 Mb upstream of this QTL but PTK2 sits 2 Mb upstream from DGAT1, well within the bounds of this very large QTL. When BTA14 was removed from the analysis, the involution pathway remained significant for the traits for which this was tested [see Additional file 9: Figure S6].  There was some overlap in the genes of the three pathways. Genes STAT5A, STAT5B and SOCS3 were common to all three pathways (Figure 1). Prolactin and mammary development pathways showed the largest overlap, which included TNF, SOCS and prolactin genes. KEGG analyses showed that similar pathways were represented in mammary development and involution but infection-related pathways were more prominent due to the abundance of acute phase response genes such as interleukins and STAT genes ( Table 5).

Proportion of variance explained by mutations in pathways
For milk production traits, SNPs in the involution pathway explained 10 to 13% more genetic variation than expected by chance for all traits (Table 7). SNPs in the mammary development pathway explained 7 to 9% more genetic variation than expected by chance for milk, protein percentage and fat percentage. SNPs in the prolactin pathway explained less variation than expected by chance, although results were not significantly different from zero. This could be the result of a combination of two factors, i.e. (1) SNPs within the prolactin signalling pathway do not really explain much variation, and (2) because of the small number of genes in this pathway, the SNPs did not cover all chromosomes (and therefore did not capture variation on those chromosomes), unlike the randomly sampled SNPs. The overall significance of each milk production trait for each pathway tested was very similar, though not identical, to the results from SNP by SNP association testing (perhaps a result of random sampling to construct the null distributions).

Discussion
We used information on mammary development, prolactin signalling and involution pathways to identify candidate gene regions that could be associated with milk production traits. SNPs in genes that are involved in the mammary development pathway were highly associated with protein percentage and explained a considerable proportion of the variance for three milk production traits. The prolactin signalling pathway did not explain  any additional variance in milk production traits, but contained a significant number of associated SNPs for protein kg, protein percentage and fat percentage. SNPs in genes involved in the involution pathway explained the greatest level of variance in milk production traits in our variance component approach. The involution pathway was also significant for all milk production traits except fat in the association testing approach. Mammary development, prolactin signalling and involution pathways contained highly significant genes that have been described in GWAS or are known to be important lactation genes. These include, CASB, SOCS2, GHR, PRLR, LIFR and the STAT genes. In particular, SNPs within STAT5A have a large effect on milk composition and have been validated in vitro [38,39]. Figure 2 shows a GWAS for protein percentage as an example, and displays the relationship between genes studied from these pathways and genome-wide QTL patterns. Most genes are located in regions that could not be identified by a traditional GWAS. SNPs within regions not previously associated with milk production traits, such as AREGB, ATF4, IRF1, DKK1, and TGFB1, which were significant for mammary development, may contain novel mutations that affect milk production traits and may represent key genes from the mammary development pathway that explain some of the variance in these traits in cattle.
The reason why the involution pathway explained the greatest level of variance in milk production traits in our variance component approach, although only half the number of SNPs of the mammary development pathway were available, could be because this pathway includes genes in or close to a previously described QTL with quite large effects on milk production traits (Figure 2), particularly protein percentage [5]. However, when the analysis was ran without the genes on BTA20 (FGF10, MSX2, PRLR and GHR), this pathway was still significant, even for protein percentage. Note that removing the GHR gene from the analysis is questionable because the growth hormone receptor is a vital component of the lactation pathway since it interacts with several relevant substrates during lactation [5]. Similarly, removing the CEBPD, MYC and PTK2 genes on BTA14 (because they were in the region of DGAT1) did not affect the overall significance of the mammary development pathway. The clustered expression of the genes in a pathway, i.e. they are expressed with other secreted milk genes [40], may result in significant associations that are due to nearby, co-expressed genes. The permutation method generated some replicates with similar genome distributions to the experimental data [see Additional file 4: Figure S2], which implies that the clustered expression of genes probably does not greatly affect the results. There is currently no ideal approach to control for the complicated genetic architectures of traits in pathway analyses. While these genetic structures should be accounted for, caution should be taken to avoid losing information from highly relevant genes.
One of the main limitations of our approach is that if a mutation that affects milk production is not in the analysed pathways, it will automatically be excluded. Perhaps even more importantly, our interpretations could be biased if irrelevant genes are included in the pathways. This may have occurred in cases where broad-acting cellular processes are represented in the gene sets. Improved descriptions of pathways would increase the power to identify genomic regions that influence these traits. The pathways used in this study were primarily derived from mouse studies and are relatively poorly described in cattle. For mammary development, the signalling interactions in the placode epithelium are particularly poorly described. For the prolactin signalling pathway, little is known about the downstream signalling of progesterone receptors. For the involution pathway, it is not known how membrane apoptosis is triggered although this would represent a significant contribution to the description of this biological process. Approaches such as microarray and RNAseq technologies using time-course data could help refine this method so that it represents more closely the true biological action. These approaches have successfully identified genes acting at different physiological states in the lactation cycle. Another potential limitation of our study is that the phenotypes were averages of several records across lactation. The same analyses could be performed using just early or late lactation records. Lactation curve parameters have been used in similar modelling experiments and may further refine these numerous SNP associations [41].
Finally, the value of KEGG pathway annotations was questionable. The relevance of these annotations for the target traits is difficult to establish for genes that are involved in broad and numerous biological processes. A further problem is that KEGG annotations are heavily dominated by cancer-related information.

Conclusions
We have successfully used the information from characterised mammary development, prolactin signalling and involution pathways to identify novel SNP associations with milk production traits. The proportion of significant SNPs in or near genes from the mammary development pathway was considerably greater than expected by chance for protein percentage. Of the three pathways studied, the involution pathway was highly associated with milk production traits and explained the highest level of variation above that expected by chance (up to 13% for protein kg). While we have reported many novel candidates useful for further studies, we must point out that pathway-based methods are restricted by the quality of annotations and completeness of pathway information.

Additional files
Additional file 1: Table S1a. Number of phenotypes for production traits. Sample sizes adapted from [42]. Table S1b. The minimum and maximum phenotypes for production traits in dairy cattle. Phenotypes are expressed in standard deviations, with a mean of zero within each breed. Adapted from [42].
Additional file 2: Table S2. Description and measurement of milk production traits. All non-production traits are expressed as a percentage of the standard deviation from the phenotypic mean.
Additional file 3: Figure S1a. Distribution of number of lactations for cows. X-axis is labelled with the mid-point of each bin. Figure S1b. Distribution of number of daughters per bull. X-axis is labelled with the mid-point of each bin.
Additional file 4: Figure S2. Distance between genes in lactation pathways and control permutations. Control plots are scaled to represent 10 000 replicates of randomly selected genes of size equivalent to the experimental pathway.
Additional file 5: Figure S3. Permutation tests for SNP within the mammary development pathway. Associations were created using 800 k SNP data from Holstein and Jersey cattle. Purple bars represent the null hypothesis distribution. SNP sets were randomised from the 200 kb region spanning 67 genes. The vertical red line is the experimental result (e.g. the observed proportion of SNP in that pathway), while the green line is the P ≤ 0.05 significance threshold for the pathway from the permutation test, for a) fat kg, b) milk volume, c) protein kg, d) fat percentage, e) protein percentage.
Additional file 6: Figure S4. Permutation tests for SNP within the prolactin signalling pathway. Purple bars represent the null hypothesis distribution. SNP sets were randomised from the 200 kb region spanning 27 genes. The vertical red line is the experimental result (e.g. the observed proportion of SNP in that pathway), while the green line is the