Linkage disequilibrium on chromosome 6 in Australian Holstein-Friesian cattle

We analysed linkage disequilibrium (LD) in Australian Holstein-Friesian cattle by genotyping a sample of 45 bulls for 15 closely-spaced microsatellites on two regions of BTA6 reported to carry important QTL for dairy traits. The order and distance of markers were based on the USDA-MARC linkage map. Frequencies of haplotypes were estimated using the E-M approach and a more computationally-intensive Bayesian approach as implemented in PHASE. LD was then estimated using the Hedrick multiallelic extension of Lewontin normalised coefficient D'. Estimates of D' from the two approaches were in close agreement (r = 0.91). The mean estimates of D' for marker pairs with an inter-marker distance of less than 5 cM (n = 13) are 0.57 and 0.51, and for distances more than 20 cM (n = 44) are 0.29 and 0.17, estimated from the E-M and Bayesian approaches, respectively. The Malecot model was fitted for the exponential decline of LD with map distance between markers. The swept radii (the distance at which LD has declined to 1/e (~37%) of its initial value) are 11.6 and 13.7 cM for the above two methods, respectively. The Malecot model was also fitted using map distance in Mb from the bovine integrated map (bovine location database, bLDB) in addition to cM from the MARC map. Overall, the results indicate a high level of LD on chromosome 6 in Australian dairy cattle.


INTRODUCTION
Linkage disequilibrium, or non-random association between alleles at different loci within a population, has recently become important in the context of gene mapping, especially with the availability of high-density maps and reports of extensive LD in various species [5,17].
The potential advantage of LD mapping over conventional linkage mapping is in utilising historic meioses which provide higher mapping resolution [36]. LD analysis has proved powerful for high-resolution mapping of genes for monogenic disorders [5] and complex diseases in humans [5,18]. In dairy cattle, successful efforts have been made towards positional cloning of causal mutations affecting fat percent and protein percent using combined linkage and LD mapping approaches [4,11]. However, this type of analysis is limited to specific types of resource populations and is not practicable for novel traits measured only in groups of animals with no family structure. With increasing numbers of single-nucleotide polymorphism (SNP) markers becoming available, it is now possible to conduct genome-wide LD-mapping in populations with no family structure [16].
Recent simulation studies and reviews have suggested that optimum designs for such LD mapping studies depend mainly on the extent and pattern of LD across the genome in the studied population [1,48]. Marker-marker LD provides a benchmark for evaluating optimal marker density. Hence, a number of studies have attempted to describe the extent of inter-marker LD in various species/populations, and have investigated the relationship between LD and inter-marker map distance. Some of these are described below.
The extent of LD has been studied most thoroughly in humans, where substantial LD extends over physical distances up to 100 kb in several populations [3,16,39], and up to several megabases in isolated populations [3,13,29,38]. The extent of LD in livestock is expected to be higher than in humans, since the forces producing LD (admixture, selection and small effective population sizes) are more extreme in livestock populations [33]. The evidence supporting this expectation is accumulating: high levels of LD have been found to extend for many centimorgans in dairy cattle [10], sheep [28], pigs [34] and horses [43]. These studies used family information to infer the most likely phase of the dams' haplotypes. Tenesa et al. [42] reported some evidence of LD on BTA6 in a sample of unrelated United Kingdom dairy cattle. Sutter et al. [41] used several samples of unrelated animals to estimate LD in a range of dog breeds. Studies in humans, dogs and pigs indicate that the pattern and extent of the LD is often population-specific [19,34,37]. This emphasises that estimates of LD in specific populations are required before carrying out LD-based fine-mapping studies in those populations. Chromosome 6 (BTA6) has been reported to carry important QTL affecting milk production traits. In this study we explored the extent of LD in Australian Holstein-Friesian cattle in the two important regions of chromosome 6 known to carry QTL affecting milk production in dairy cattle. To estimate LD, we used algorithms that do not rely on family information for inferring haplotypes. The animals in the samples are representative of the current Holstein-Friesian population in Australia and thus provide an indication of the extent of LD in this population for chromosome 6.

MATERIALS AND METHODS
A panel of 45 progeny-tested Australian Holstein-Friesian bulls born during 1989-1999 was selected for this study. Using pedigree information obtained from the Australian Dairy Herd Improvement Scheme (ADHIS; http://www.adhis.com.au), these 45 bulls were selected from all the bulls in these cohorts, on the basis of least relationship, as assessed by the coefficient of co-ancestry. To represent the population better, the bulls with more daughters were also preferentially selected. The bulls were considered unrelated for the purpose of this study. The mean kinship between these bulls is 0.009 with first and third quartile for kinship being 0 and 0.012, respectively. The kinships between these bulls were computed using FORTRAN programs in the PEDIG package of D. Boichard (http://dga.jouy.inra.fr/sgqa/diffusions/pedig/pedigE.htm).

Extraction and amplification of DNA
Semen samples for the bulls were obtained from Genetics Australia (Bacchus Marsh, Vic, Australia). Genomic DNA was extracted from straws of frozen semen samples by a salting-out method adapted from Heyen et al. [15]. The quantities of original DNA available were very small. Hence all DNA samples were amplified genome-wide using a GenomiPhi TM DNA Amplification Kit (Amersham Biosciences, Castle Hill, NSW, Australia). Un-amplified and amplified DNA from five of the bulls was genotyped for nine microsatellites and the results were identical. All subsequent genotyping on this panel was done on amplified DNA.

Markers and genotyping
Fifteen BTA6 microsatellites, mainly spanning two regions of interest, were selected for genotyping (Tab. I). These regions had previously been identified in a meta-analysis of published QTL [22], as being of potential interest. The average inter-marker distance is 4.6 cM. All the distances are based on the MARC-USDA linkage map (http://www.marc.usda.gov -accessed 10 January, 2005). Since the genetic distance between close markers may not be precise, hence as an alternative, the distances were also calculated from the bovine Location DataBase (bLDB) (http://medvet.angis.org.au/ldb/; [32]) with distance expressed in kb. The bLDB is an integrated map constructed from all the major public-domain physical and genetic maps of cattle, using the LDB software of Morton et al. [30], which had previously been used for construction of an integrated human map before completion of the human annotated sequence. Genotyping was carried out using a LI-COR 4200 sequencer (LI-COR, Lincoln, USA) with fluorescent dye-labeled primers. Two duplicate DNA samples and one blank sample were included for quality control. Genotypic scoring was done using Gene ImagIR software (http://www.licor.com/bio/GeneIm/ GeneIm1.jsp).

Testing for Hardy-Weinberg Equilibrium
The number of alleles, observed heterozygosity and expected heterozygosity under Hardy-Weinberg equilibrium (HWE) for each microsatellite locus were computed using the web version of Genepop (http://wbiomed.curtin.edu.au/genepop). The P-values of the Fisher exact test for deviations from HWE were computed as per Guo and Thompson [12] using Genepop (Tab. I). The numbers of alleles in the MARC versus our study were compared using loglinear modelling (to cater for low-number counts), taking into account the paired nature of the data, i.e. two observations for each marker. The observed heterozygosity in MARC versus our study were compared by a paired t-test.

Estimating linkage disequilibrium between pairs of markers
A number of measures of LD have been suggested in the literature (Devlin and Risch [8]; Hedrick [14]; Jorde [21]; Lewontin [23]). We estimated LD by using the Hedrick [14] multiallelic extension of the Lewontin [23] normalised coefficient D computed from two-loci haplotype frequencies. For a pair of loci A and B, D was estimated as and where p i• and p • j are frequencies of alleles i and j at loci A and B, respectively; p i j is the frequency of the pairwise haplotype ij; and n A and n B are the total numbers of the alleles at loci A and B, respectively. This statistic has been used to measure the extent of LD in human [3] and livestock populations in a range of studies (e.g. [10,28,34,42]), thereby facilitating the comparison of the results in this study with other studies. D allows LD to be measured with highly polymorphic markers and is less sensitive to variation in marker allele frequencies than other measures of LD [14,46].

Haplotype frequency estimation
The haplotype frequencies were estimated using the GOLD-ldmax procedure (http://www.sph.umich.edu/csg/abecasis/GOLD/; [2]). This procedure implements a maximum likelihood based expectation-maximisation (E-M) algorithm [9]. Haplotype frequencies were also estimated using a Bayesian algorithm as implemented in PHASE 2.0 [40]. These haplotype frequencies were then used to calculate D values as detailed above. Chi-square statistics, to test for independence between alleles at two loci, were also computed for each pair of markers using GOLD. PHASE by default assume a stepwise mutation model for microsatellite markers and is appropriate when the length of each allele is known. However, in our data set the actual length of each microsatellite allele was not known. Hence this condition was relaxed to the parent independent model in which each allele has the same chance to mutate to any of the other alleles. The algorithm used in PHASE has been reported as being more reliable than the E-M approach in inferring haplotypes [35,45].

Malecot model
There was a general trend of decreasing LD with increasing inter-marker distance, and to quantify this relationship for the current data sets, we fitted the Malecot model [6,24] to the data. This takes the exponential-asymptotic form, where D i is the observed LD for the ith pair of syntenic loci distance d i apart, L (0 ≤ L ≤ 1) is the residual association at large distances (i.e. between unlinked loci), M (0 ≤ M ≤ 1) is the proportion of alleles transmitted from founders (and so is 1 if alleles are monophyletic and less than 1 otherwise), κ (> 0) is the exponential decay rate of D with physical distance d, and ε i is the random error deviation for the ith pair of loci, with var(ε i ) = σ 2 . ρ(d) represents the expected value of D at distance d. This declines from ρ(0) = (1 − L)M + L at d = 0 to a level ρ(∞) = L for unlinked loci. Since smaller values of the exponential decay parameter κ are associated with more extensive LD, Morton et al. [31] proposed that 1/κ is a good measure of useful LD for mapping, naming this as the "swept radius". This is the distance over which LD declines to e −1 (∼37%) of its initial value. We have also adopted this measure in the current study. The 95 percent of the confidence interval of the swept radius was computed as 1/(κ ± 1.96 × SE(κ)). The Malecot model was fitted using the nls() function in S-PLUS / R.

RESULTS
Marker names, positions on the MARC and bLDB maps, observed and expected heterozygosity, tests of departure from HWE, and the number of alleles in this data set are shown in Table I. The average distance between markers is 4.6 cM (ranged from 0.3 to 17.4 cM). The average number of alleles in this data set is significantly (P < 0.001) lower (5.2) as compared to MARC (7.9). The average observed heterozygosity in our study (56.3%) was not significantly different from that in the MARC mapping population (60.5%) (P = 0.54). The average observed heterozygosity is lower than the average expected heterozygosity in this study (62.9%) and two markers BM143 and BL1099 show significant deviation from HWE. These two markers were excluded for the further analysis of LD between markers.
D values were estimated between all pairwise combinations of the thirteen markers, based on two different approaches to infer haplotypes. The mean D for all pairs is 0.38 ± 0.019 and 0.27 ± 0.019, with haplotypes inferred by E-M and Bayesian approaches, respectively. The mean estimates of D for marker pairs with an inter-marker distance of less than 5 cM (n = 13) are 0.57 and 0.51, and for distances greater than 20 cM (n = 44) are 0.29 and 0.17, estimated from the E-M and Bayesian approaches, respectively. The individual values of D based on haplotypes inferred by the two approaches were highly correlated (r = 0.91, see Fig. 1). The agreement between the two methods appeared greater for the higher values of D , although in general values of D calculated using GOLD tend to be higher than those calculated using PHASE.
The distribution on the D over distance is presented in Figure 2A-D, in each case with the Malecot model fitted. The general decline in the D over intermarker distance was evident (Fig. 3). The inverse of the parameter κ of the Malecot model can be used to estimate the swept radius, a distance to which LD may be useful for mapping [26]. We estimated the swept radii as 11.6 (95% confidence interval (CI) 6.6 to 46.9) cM and 13.4 (CI 8.9 to 29.6) cM based on haplotypes inferred from the E-M and Bayesian approaches, respectively ( Fig. 2A-B). The Malecot model was also fitted with distance in Mb from the bovine integrated map (bLDB), giving estimates of swept radii of 15.4 (CI 8.9 to 55.7) Mb and 17.9 (CI 12 to 35) Mb, for the two approaches, respectively (Fig. 2C-D). Note that the relatively wide CI were due to the small number of markers and bulls in this study. The parameter estimates from the Malecot model from the two methods and using the two types of map distances are presented in Table II. A steeper decline was observed in the chi-square values when plotted over distance between markers (Fig. 4).

DISCUSSION
In this study we quantify the extent of LD in Australian Holstein-Friesian dairy cattle in the two regions of BTA6 previously known to carry important QTL [22]. Australian Holstein-Friesian cattle are under intense selection for milk production traits and have a small effective population size of ∼95 [25]. Also, there has been a continuous import of germplasm from different countries in addition to an ongoing within-country progeny testing program (http://www.adhis.com.au).
The D values estimated from haplotypes inferred using the E-M and Bayesian approaches were comparable (r = 0.91). However, the Malecot model fitted better (R 2 = 0.63) with the D estimated from Bayesian-inferred haplotypes as compared to E-M-inferred haplotypes (R 2 = 0.43) (Fig. 2 A-B; Tab. II). This conclusion about the fit of the Malecot model was similar when the distances in Mb units from the integrated map bLDB were used: the R 2 was 0.66 and 0.46 for the two methods of inferring haplotypes, respectively ( Fig. 2C-D; Tab. II).
The integrated bLDB map uses as its framework the best available radiationhybrid mapping information, which provides the best physical map available at present in cattle. Genetic maps may be inaccurate over small intervals where one is most interested in precisely quantifying the relationship between LD and distance. Inaccurate estimates of distances in the genetic maps are mainly because of the limited recombination events available in the resource population used for linkage mapping. This fact can be noted in the substantial changes recently in the genetic distances in MARC maps after incorporation of new map information [20]. From the mapping perspective, it is important to know the extent of LD over physical distances, so that marker density on the physical map can be determined. Both the approaches for estimating the frequency of haplotypes used in this study have been reported to be satisfactory when compared with empirically estimated haplotypes [44], the Bayesian approach is somewhat more accurate [35,45] but requires considerably more computation. Consequently, the E-M algorithm has been implemented in a number of analytical tools to survey LD based on population data e.g. [26]. The estimated swept radii from the two approaches are comparable. Based on the estimates of κ in the Malecot model, the results indicate that useful LD extends around 15 cM in Australian Holstein-Friesian dairy cattle (Tab. II). The extent of LD observed in this study in Australian dairy cattle is comparable to that reported for a Dutch Holstein population [10] in which LD between syntenic markers extended over "several tens of centimorgans". However, there were differences in the sizes of the sample and the marker density used in these two studies.
On the one hand, high LD makes the application of LD mapping feasible for genome-wide QTL mapping, but on the other hand it can limit the resolution of   fine mapping and may hinder positional cloning of the actual mutation. However, the results obtained from population-wide LD of markers with economic traits can be readily applicable in improving breeding stock through MAS [7] without the need to find the causative mutation.
In the human genome, the extent of LD is much smaller (5-100 kb) than reported in this study and in the Dutch Holstein population [10]. Recently Hinds et al. [16] reported the analysis of 1.58 million human SNP and suggested that the same power in association mapping of complex traits can be achieved by using around 300 000 SNP. In contrast, in dairy cattle, the far higher levels of LD mean that far fewer markers will be required for similar power.
The standard measure of LD provides a starting point only, and may not fully capture the complexity of the occurrence of LD. By fitting the Malecot model in windows of small segments of the genome with very close markers, it is possible to assess how patterns of LD change not only between chromosomes, but also within chromosomes. Using such approaches, recent studies in humans have suggested that the genome is organised into the blocks of haplotypes with high LD, separated by regions of low LD [16,26]. Recently, metric LD maps scaled in LD units (LDU) have been developed in humans, and have been suggested as a tool for targeted and optimal marker placement, which should lead to more powerful gene mapping and population studies [27,47]. Similar efforts of creating LDU maps to understand the pattern of LD over the cattle genome are required to design optimum experiments for LD-based localisation of genetic variants affecting economic traits in cattle. The advent of high density SNP maps in bovine and high throughput genotyping platforms [17] now provide scope to undertake such studies in dairy cattle.