Geographic distribution of haplotype diversity at the bovine casein locus

The genetic diversity of the casein locus in cattle was studied on the basis of haplotype analysis. Consideration of recently described genetic variants of the casein genes which to date have not been the subject of diversity studies, allowed the identification of new haplotypes. Genotyping of 30 cattle breeds from four continents revealed a geographically associated distribution of haplotypes, mainly defined by frequencies of alleles at CSN1S1 and CSN3. The genetic diversity within taurine breeds in Europe was found to decrease significantly from the south to the north and from the east to the west. Such geographic patterns of cattle genetic variation at the casein locus may be a result of the domestication process of modern cattle as well as geographically differentiated natural or artificial selection. The comparison of African Bos taurus and Bos indicus breeds allowed the identification of several Bos indicus specific haplotypes (CSN1S1*C-CSN2*A2-CSN3*AI/CSN3*H) that are not found in pure taurine breeds. The occurrence of such haplotypes in southern European breeds also suggests that an introgression of indicine genes into taurine breeds could have contributed to the distribution of the genetic variation observed.


INTRODUCTION
The bovine casein locus, mapped on BTA6q31-33 [43], contains four milk protein genes which are closely linked, and in the order α s1 -casein (CSN1S1), β-casein (CSN2), α s2 -casein (CSN1S2), and κ-casein (CSN3). The genes are organised in a cluster of approximately 250 kB [13,41] and share common transcription regulating elements [41]. The locus is considered to influence milk production traits [8,9,17,22,46] and antibacterial activities of derived peptides [29] may also affect the biological fitness of the offspring. Moreover, casein genes harbour a number of variants with suggested effects concerning traits such as the manufacturing properties of milk [1,35]. Therefore casein genes could be subject to natural and artificial selection [47]. Polymorphisms in the casein genes allow the determination of casein haplotypes, which can be used for studies concerning quantitative traits [14,22,45] or phylogeny [7,28] since they provide more information than the individual genes [21]. Novel casein variants at CSN2 [18] and CSN3 [16,37,38] have been described recently, but up to now it has not been clear how these are linked within the haplotypes.
The population structure of cattle (Bos taurus) reflects its phylogeny. After the domestication during the Neolithic transition in the Near East, human migrants introduced plants and animals from the domestication centre to Europe [2] and also created the genetic basis of the present cattle breeds [3,32,44]. According to the demic expansion model, genetic diversity is expected to be higher at the centre of origin and to decrease with distance [5,42]. The genetic diversity of cattle measured by biochemical or microsatellite markers follows this pattern with allele frequency gradients following the expansion routes [4,30,32]. These studies also suggest a higher genetic diversity of south eastern European breeds compared with those of north western Europe. Additionally, separate domestication and subsequent introgressions of indicine genes into taurine populations in Africa [27] and the Near East [25] produced higher genetic diversity within the hybridisation zones.
The objective of this study was to investigate the diversity of the casein locus in the context of the origin and phylogeny of taurine cattle, including variants, which until today have not been the subject of phylogenetic studies.

Sampling and DNA-extraction
A total of 1396 blood and DNA samples were collected from 30 cattle breeds of taurine and indicine origin (8-77 unrelated animals per breed) (Tab. I). From most breeds, a minimum of 30 animals were analysed. The exceptions were Slovenian-syrmian (8 samples), a population with an effective population size of less than 10 animals [12], Belgian Blue (mixed purpose, 18 samples), and N'Dama (26 samples). European and Anatolian breeds were selected to represent most of the likely genetic variation of European Bos taurus and according to their geographic origin as specified by the longitude (LO) and latitude (LT) of the sampling area (Tab. I). DNA was extracted from leukocytes by standard protocols [33].

Genotyping of casein polymorphisms
The α s1 -casein gene was typed for a MaeIII polymorphism in the promoter region (CSN1S1prom) by PCR-RFLP according to the protocol of [20] and for a polymorphism in exon 17 (CSN1S1) with PCR-SSCP which differentiates CSN1S1*B from CSN1S1*C [19]. Within the α s2 -casein gene, the nucleotide exchange differentiating CSN1S2*A and D was analysed by ACRS [36].
The β-casein (CSN2) and κ-casein (CSN3) genes were genotyped by PCR-SSCP which differentiates alleles which cannot be identified by isoelectric focusing at the protein level. The techniques used differentiate the CSN2 alleles A 1 , A 2 , A 3 , B, C, and I [6], and the CSN3 alleles A, A I , B, C, E, F, G, H, and I [38], respectively.

Estimation of allele frequencies and test for Hardy-Weinberg equilibrium
Allele frequencies and deviation from the Hardy-Weinberg equilibrium were estimated using GENEPOP V3.1 software [40]. Deviation from the Hardy-Weinberg equilibrium was analysed using a Markov chain method with 1000 iterations.

Effective number of alleles and effective number of haplotypes
For each locus of each breed, the effective number of alleles was calculated using POPGENE V1.31 software [49]. The effective number of haplotypes (N hap ) was calculated by the same software, where the effective number of haplotypes is defined as the reciprocal of the expected homozygosity derived from the haplotype frequencies. Table I. -: no clear assignment.

Haplotype frequencies
Haplotype frequencies were estimated under the assumption of allelic association on the basis of all genotype combinations found using EH software [48]. The program uses an iterative Maximum-Likelihood algorithm and compares haplotype frequencies under the assumption of allelic association (calculated value) with those under the assumption of independence (expected value). In addition it gives χ 2 values for this comparison, which were used to calculate P-values for the hypothesis that the calculated values differ from the expected values.

Analysis of principal components, correlations and regressions
The analysis of the principal components, correlations and P-values, regressions, and variances of allele frequencies, intra-breed diversity and geographic data were performed using SPSS 8.0.0 Software (SPSS Inc., Chicago, USA). For regression analysis of frequency or diversity data with the geographic origin of the breeds, only European and Anatolian Bos taurus breeds were used.

Allele frequencies at the casein loci and test for Hardy-Weinberg equilibrium
As indicated in Table I, there were great differences in the occurrence and frequencies of the different alleles at the casein loci between breeds.
Twelve out of 150 tests for Hardy-Weinberg equilibrium (for each gene and breed separately) rejected the null hypothesis of Hardy-Weinberg equilibrium at a 5% probability level. Most of these 12 deviations were found at CSN3 in Brahman, Banyo Gudali, Istrian, Piemontese, and Pezzata Rossa. Pezzata Rossa, Piemontese, and Nellore also deviated significantly from Hardy-Weinberg equilibrium when all five loci were pooled together.

Casein haplotype frequencies and linkage disequilibrium
The 19 alleles at the five linked loci were combined in 83 haplotypes. Twenty-one of those were estimated with frequencies over 0.10 in at least one breed (Tab. II). In the 30 breeds analysed, the most frequent haplotype was CSN1S1prom*B-CSN1S1*B-CSN2*A 2 -CSN1S2*A-CSN3*A with a mean frequency of 0.17, followed by BBA 1 AA with a mean frequency of 0.15. Neither of these haplotypes were present in Anatolian Black (AB) and Nellore (NE). The related haplotypes BBA 1 AB and BBA 2 AB were also widely distributed, being present in 26 and 22 breeds respectively. Various haplotypes were limited to specific breed groups e.g. BCA 2 AA I and BCA 2 AH in Brahman (BH) and Nellore (NE). The latter appears as the predominant haplotype in these breeds, but was also found in Banyo Gudali (GB), Istrian (IS), Polish Red (PR), and Turkish Grey Steppe (TG). Also BCA 2 AB occurs at a high frequency only in the hybrid Bos indicus-Bos taurus breeds Anatolian Black (AB) and Santa Gertrudis (SG). The BBCAH, BCCAH, and BBA 1 AE haplotypes are completely or almost completely breed-specific, the first two in the Slovenian-syrmian (SS) and the third is a predominant haplotype in the Ayrshire (AY).
The distribution of the casein haplotypes shows a clear dependence on the geographic origin of the breeds (Tab. II, Fig. 1). The haplotypes BBA 2 AA and BBA 1 AA were found predominantly in north western and central (NC) European cattle breeds; haplotypes BBA 1 AB and BBA 2 AB are predominant in southern European and African taurine breeds (SE), while in Bos indicus breeds (BI) the haplotypes BCA 2 AA I , CCA 2 AA I , BCA 2 AH, and CCA 2 AH occur as specific haplotypes or at a high frequency. Such haplotypes were assigned as the basis haplotypes to the corresponding breed groups. In southern Europe many breeds show predominance or a high frequency of further haplotypes which cannot be related to specific breed groups and which may have originated from recent mutations or recombination within haplotypes. In four British (Aberdeen Angus, Ayrshire, Hereford, Jersey) and one African zebu breed (Banyo Gudali), significant (P < 0.05) differences between the calculated and expected haplotype frequencies were observed and in two further breeds (Charolais, Santa Gertrudis), marginal differences (P < 0.1) were seen.

Variability within breeds
The effective number of haplotypes (N hap ) as a measurement of intra-breed diversity is indicated in Table II. Piemontese (PI) and Turkish Grey Steppe (TG) had the highest N hap values, while the lowest N hap was found in the British Friesian (BF).
The effective number of haplotypes (N hap ) was significantly correlated (P = 0.014) with the latitude (LT) of the corresponding sampling area. Regression analysis revealed a fit to the linear equation of N hap = 13.9823 − 0.1700*LT. A correlation between N hap with the longitude (LO) of breed origin was also Assignment of haplotypes to breed groups due to predominance or specific occurrence is indicated (Asm*) and the effective number of haplotypes (N hap ), including also rare haplotypes, are indicated. found to be significant (P = 0.040) with a linear regression of N hap = 5.5184 + 0.07464*LO.

Principal components of haplotype distribution
The first principal component (PC1), accounts for 27.84% of the complete variation of haplotype frequencies and the second component (PC2) accounts for 20.02%.
Within the plot of the first two components in the principal component analysis (PCA) (Fig. 2)  The first principal component (PC1) was found to be dependent on the geographic origin of the samples. A highly significant linear regression (P < 0.00) was found as PC1 = 5.2935 − 0.1179*LT. A logarithmic equation also showed a highly significant fit (P < 0.00) as PC1 = 20.6947 − 5.4486*ln(LT). No significant correlation between PC1 and the longitude of breed origin was found. Further components from the PCA were not correlated with the geographic data.

DISCUSSION
The DNA-based genotyping allowed those alleles to be identified that have not been included in diversity studies up to now, and which cannot be separated by protein phenotyping: CSN3*A I , H, and I, cannot be distinguished from CSN3*A and likewise CSN2*I cannot be separated from CSN2*A 2 by electrophoresis of milk samples. Variants in the promoter region of CSN1S1prom*B and C have not been included in previous phylogenetic studies. Up to now CSN3*A I , H, and I have only been described in Bos indicus [38], but in this study they were also found at a lower frequency in taurine breeds. CSN3*H is present in various southern or eastern European breeds, occurs with a relatively high frequency in Turkish cattle breeds and is predominant in Bos indicus breeds. These observations suggest zebu introgressions in southern and eastern European cattle and confirms the results obtained by studies using microsatellites [25] and mitochondria DNA sequences [10].
Haplotype frequencies could not be enumerated by direct gene counting, because multiple heterozygous individuals cannot be resolved when the haplotypic phase is unknown. Therefore the application of iterative methods is necessary to estimate the distribution of haplotypes behind the recognisable genotype combinations found [48]. This approach may result in a bias, especially for rare haplotypes due to a limited sample size, however, this is the only possible approach to estimate haplotype frequencies of unrelated animals. The assumption of Hardy-Weinberg equilibrium for the distribution of haplotypes used by the algorithm in the EH software is problematic in some breeds which were found to deviate from the Hardy-Weinberg equilibrium. This limitation should not affect the final results of the study appreciably because the extent of the deviation was relatively small and restricted to a few breeds.
The observation that casein haplotype frequencies are geographically distributed is in accordance to the findings of former studies based on protein polymorphism [7,23,28]. Mahé et al. [28] described the predominance of a haplotype on the basis of three casein genes CSN1S1*C-CSN2*A 2 -CSN3*A in zebu breeds. However, the electrophoretic methods they used does not allow the discrimination between CSN3*H and CSN3*A I from CSN3*A. Consequently, the occurrence of haplotypes CA 2 A I and CA 2 H, which are within BCA 2 AA I , CCA 2 AA I , BCA 2 AH, and CCA 2 AH, is in agreement with these findings and indicates the introgression of Bos indicus in southern and eastern European cattle breeds. These breeds also show an increased gene diversity and haplotypes, which apparently originate from recombination events between taurine and indicine haplotypes e.g. BBA 2 AH and BBCAH. Similarly mt-DNA-analyses [10] and casein haplotype typing [7] indicate the influences of African cattle on the breeds of the Iberian Peninsula, which is confirmed by the predominant appearance of common haplotypes ("southern haplotypes") in African and in southern European cattle. In contrast to the southern breeds, Lien et al. [23] observed BA 2 A (CSN1S1-CSN2-CSN3) and CA 2 B widely distributed in northern European breeds. The CA 2 B was found mainly in autochthonous Nordic breeds and BA 2 A was predominant in the highly selected commercial dairy breeds like Finnish Ayshire or Holstein-Friesian. Other studies [24,34] have found positive effects of CSN1S1*B on the milk yields of dairy cows and [45] describe a positive effect of BA 2 A on milk yield in one of four examined grandsire families in the Finnish Ayrshire. Our results show that geographic patterns of haplotype distributions follow frequency trends of alleles at CSN1S1 (CSN1S1*B: P ≤ 0.010) and CSN3 (CSN3*A: P ≤ 0.00) along the latitude with highest frequencies for the yield related alleles CSN1S1*B and CSN3*A in north western Europe. The CSN1S1 alleles seem to be an indicator of artificial selection on the distribution of haplotypes. Most breeds selected for milk yield (e.g. Angler, Holstein-Friesian, Ayshire) originated in north western Europe and are near fixation for CSN1S1*B. In comparison central European Highland breeds are mostly dual-purpose breeds. The dual purpose breeds analysed in this study (German Yellow, Pezzata Rossa), and in other studies [11,15] have a slightly higher frequency of CSN1S1*C. Unselected breeds and those used in extensive production systems analysed in this study originate in southern Europe (Casta Navarra, Fighting Bull, Istrian and Slovenian-syrmian) or Turkey (Anatolian Black and Turkish Grey Steppe) and showed much higher frequencies of CSN1S1*C. The κ-casein gene is suggested to be strongly effected by selection [47] which may explain the observed deviation of this locus from Hardy-Weinberg equilibrium. The observed frequency trend of CSN3 did not seem to be influenced by the way the breed has been selected, indicating that its distribution could be caused by natural rather than by artificial selection.
Medjugorac [31,32] discussed the frequency gradients at milk protein and further biochemical loci, in the context of Neolithic expansion of cattle following domestication in the Near East. Introgression of migrating domesticated cattle into wild aurochs populations could have caused a "diffuse gene gradient" from the domestication centre.
A similar argument could be used to explain the differences in genetic diversity in Europe, with high diversity in south eastern European breeds in comparison with breeds of northern European origin [26,30,44]. This is in agreement with our results of a genetic diversity gradient at the casein genes. It must be mentioned that these gradients are partly caused by an extremely low variability of British Friesian and Aberdeen Angus which might be rather breed than region specific: this could be caused by strong selection pressure on milk production traits in British Friesian and a low effective population size in Aberdeen Angus [12]. If both these breeds are removed from the analysis, correlations of LT and LO with genetic variation show only marginal significance (P = 0.076 for both) and should therefore be assessed with care. It appears possible that the diversity gradients were also formed by differentiated selection or by an introgression of the genetically diverse Bos indicus, increasing the variability in a southern European hybridisation zone [7,10,25]. Finally, surprising low linkage disequilibrium (LD) was observed between the casein genes, resulting in few breeds with significant differences between the calculated haplotype frequencies and the haplotype frequencies expected under an assumption of allelic independence. This may be a consequence of the small datasets used, because applied methods require large amounts of data to prove a significant difference between the calculated and expected haplotype frequencies [48]. However, it can be concluded that despite the close physical linkage of the genes, recombination seems to be relatively frequent within the casein locus. This confirms the observations already described by [19] and [39].
It can be concluded that the geographic origin of breeds is not independent from selection effects on the haplotype distribution and the genetic variability. In addition the effects of the Neolithic expansion and migration of modern cattle breeds can be found in the correlations between geographic data and the allele frequencies or genetic diversity. An analysis of further breed groupspecific markers and a comparative study using neutral markers may elucidate the relative role of selection and cattle migration events on the genetic diversity and distribution of haplotypes at the bovine casein locus.