The efficiency of designs for fine-mapping of quantitative trait loci using combined linkage disequilibrium and linkage

In a simulation study, different designs were compared for efficiency of fine-mapping of QTL. The variance component method for fine-mapping of QTL was used to estimate QTL position and variance components. The design of many families with small size gave a higher mapping resolution than a design with few families of large size. However, the difference is small in half sib designs. The proportion of replicates with the QTL positioned within 3 cM of the true position is 0.71 in the best design, and 0.68 in the worst design applied to 128 animals with a phenotypic record and a QTL explaining 25% of the phenotypic variance. The design of two half sib families each of size 64 was further investigated for a hypothetical population with effective size of 1000 simulated for 6000 generations with a marker density of 0.25 cM and with marker mutation rate 4 × 10-4 per generation. In mapping using bi-allelic markers, 42~55% of replicated simulations could position QTL within 0.75 cM of the true position whereas this was higher for multi allelic markers (48~76%). The accuracy was lowest (48%) when mutation age was 100 generations and increased to 68% and 76% for mutation ages of 200 and 500 generations, respectively, after which it was about 70% for mutation ages of 1000 generations and older. When effective size was linearly decreasing in the last 50 generations, the accuracy was decreased (56 to 70%). We show that half sib designs that have often been used for linkage mapping can have sufficient information for fine-mapping of QTL. It is suggested that the same design with the same animals for linkage mapping should be used for fine-mapping so gene mapping can be cost effective in livestock populations.


INTRODUCTION
In the last decade, numerous QTL for economically important traits in domestic species have been positioned within 30 centimorgan (cM) confidence intervals, using linkage analysis. However, the genomic region of 30 cM still contains too many genes to find causal mutations; e.g. the bovine genome has approximately 30 000∼40 000 genes and the length of the genome is approximately 3000 cM [9]. The exact location and determination of the causal mutation responsible for the observed effect have been reported for only a few QTL; e.g. the double muscling gene [12], the booroola gene [20], the DGAT 1 gene [6].
In many mapping studies, it has now become pertinent to use fine-mapping to decrease the potential genomic region containing QTL to a few cM. Recently, several studies have proposed theory and methods to refine the mapping position of QTL [2,13,14,17]. Among them, a variance component (VC) method using combined LD and linkage [14] has been considered as a promising approach for fine-mapping.
VC methods which fit QTL as random effects can fully account for complex relationships between individuals in outbred populations [5,10]. LD mapping can take into account the historical recombinations, the number of which is far greater than that of pedigree-based linkage studies [21]. On the other hand, linkage is also important because it can give extra information in addition to the LD information especially when there are many relatives. The VC fine mapping method combining LD and linkage has proven to result in a mapping resolution accurate enough to narrow down the QTL confidence interval to a few cM of the genomic region [15].
In mapping studies, design of family structure may be important for accurate mapping resolutions. However, efficiency of different designs for fine-mapping have hardly been reported. For coarse QTL mapping in outbred populations, half sib designs are often used. Such designs contain also information for finemapping as LD information can be used across maternal haplotypes. Besides the design of the experiment, other properties of the population used in the study may be important. For example, the effective size (Ne) has an important effect on the degree of LD. Hayes et al. [7] have also shown that LD patterns are affected by whether the population size has effectively increased (in humans) or effectively decreased (in most livestock) in recent times. Also, the apparent age of the putative favourable QTL mutation may be relevant for the efficiency of LD mapping as it will affect the LD pattern of marker haplotypes surrounding the QTL.
The aim of this study is to investigate the efficiency of various experimental designs for fine-mapping of QTL. Several hypothetical situations with varying effective population size (Ne) and various mutation ages (MA) are used to test the usefulness of existing and proposed designs in livestock for fine scale mapping.

Simulation study
There were two parts to the simulation model. The first part develops the population in a historical sense beyond recorded pedigree. The second part describes the population in the last generations with a family structure and phenotypic data.
The first part of the simulation was designed to generate a variety of populations modeled by varying numbers of effective population size (Ne) and the length of the population history. In each generation, the number of male and female parents are equal, and their alleles were inherited to descendents based on Mendelian segregation using the gene dropping method [11]. Unique numbers were assigned as mutant alleles to QTL in a given generation (depending on mutation age). In the last generation, one of the surviving mutant alleles was randomly chosen and treated as the favourable QTL allele. The marker alleles were mutated at a rate of 4 × 10 −4 per generation as mutation rates have been found in the order of 10 −3 ∼10 −5 [1,3,19]. In the bi-allelic marker model (e.g. single nucleotide polymorphisms), a mutated locus was substituted by the other allele whereas in the multi-allelic marker model (e.g. microsatellites), a new allele was added.
The second part of the simulation model was designed to enable comparison of a variety of family structures with recorded data sets to be modeled by a varying number of sires, dams and offspring. The sires and dams were randomly selected in the last generation (t) of the first part of the simulation. Descendents in generation t + 1 were given a phenotypic record and pedigree was only known for these animals (i.e. animals from generations t were considered unrelated base animals).
Marker genotypes were available for animals from generation t and t + 1 and phases were assumed known. When marker information is available for parents and progeny, the correct linkage phase can often be assigned with a high certainty, using closely linked multiple markers [13]. Pong Wong et al. [16] reported that if more than 10 bi-allelic markers are used, the proportion of individuals having at least one informative marker locus to assign correct phase is more than 90%. If multiple markers (>10) are used in a small region (<10 cM), the assumption of known marker phase is quite reasonable. For a fair comparison between experimental designs, phenotypic value was only available for a fixed number of progeny in generation t + 1. Phenotypic values were simulated as (1) The population mean (µ) was 100, values for u were drawn from N(0, Aσ 2 u ) with σ 2 u = 25, and values for e were from N(0, σ 2 e ) with σ 2 e = 50. For fixing the variance of QTL effect (σ 2 q = 25), the frequency of the favourable QTL allele was estimated among the progeny. The QTL effect (α) was calculated from V q = 2pqα 2 [4], and given to the animal that had a favorable QTL allele.
We only considered frequencies of the favourable QTL allele between 0.1∼0.9 because the QTL effect would become very large with more extreme values. The QTL effect ranged from 7.07 to 11.8 in this situation. The frequency between 0.1∼0.9 may be reasonable for a QTL that was previously detected by linkage mapping [13]. The number of replicates studied was equal to 400 for the family design part of the study, and 200 for studying population properties.

Effect of family structure on efficiency of fine-mapping
Various experimental designs for fine-mapping of QTL were investigated. Mutation occurred at generation 0. An effective population size of 100 was applied for 100 generations in the first part of simulation. At generation 101, full sib and half sib families were generated. The number of families was 64, 32, 16, 8 or 2 with in all cases a number of total progeny of 128 (i.e. 2,4,8,16 and 64 progeny per family). Ten markers were positioned at 1 cM interval. The proportion of replicates positioning the QTL within 3 cM of the true location was determined in each design.

Properties of the population used for LD mapping
In a second part of this study, certain properties in the population used for fine-mapping based on LD will determine the efficiency of the method. Therefore, several populations were simulated varying in effective size and age of the mutation. Initially, a population with effective size of 1000 was simulated for 6000 generations (i.e. t = 6000) with various mutation ages. The mutation occurred at the 2000th, 4000th, 5000th, 5500th, 5800th or the 5900th generation, respectively. The reason of the population history of 6000 generations is because population properties such as haplotype homozygosity or homozygosity of marker genotypes are stable after 2000 generations and a mutation occurs from this time onwards (see discussion). At generation t + 1, two half sib families of size 64 were generated. Ten bi-allelic or multi-allelic markers were positioned at 0.25 cM (or 1 cM) intervals. In each case of mutation age, the proportion of replicated simulations positioning the QTL within three markers (0.75 or 3 cM) of the true location was estimated. A population with linearly decreasing Ne with the various mutation ages was tested. In the linearly decreasing model, Ne = 1000 decreased linearly to Ne = 100 over the last 50 generations.

Mixed linear model
A vector of phenotypic observations simulated from (1) is written as, where y is a vector of N observations on the trait of interest, β is a vector of fixed effects, u is a vector of n random polygenic effects for each animal, q is a vector of n random effects due to QTL and e are residuals. The random effects (u, q and e) are assumed to be normally distributed with mean zero and variance σ 2 u , σ 2 q and σ 2 e . X, Z 1 , and Z 2 are design matrices for the effects in β, u, and q, respectively. From (2), the associated variance covariance matrix of all observations (V) for a given pedigree and marker genotype set is modeled as where A is the numerator relationship matrix based on additive genetic relationships, G is the genotype relationship matrix whose elements are IBD probabilities between individuals at a putative QTL, and R = Iσ 2 e (I is an identity matrix).

Building a genotype relationship matrix (GRM)
Meuwissen and Goddard [13] used the gene dropping method [11] to estimate IBD probabilities between unrelated animals based on similarity of marker haplotypes. Using the IBD probabilities between the unrelated animals, IBD probabilities between animals in the following generations can be recursively estimated from pedigree and observed marker genotypes. Therefore, IBD probabilities between all animals can be estimated based on combined LD and linkage information. Meuwissen and Goddard [14] applied a deterministic prediction method rather than genedropping to estimate IBD probabilities. Although the deterministic prediction is accurate and computationally efficient, it is not flexible for an ongoing marker mutation model (as is the case in our study) because the change of marker allele due to mutation cannot be accounted for in the method. Therefore, we used a genedropping method to be able to accommodate this in the calculation of IBD probabilities. However, there were only small differences in mapping accuracy compared and we used the deterministic method further throughout this study.

GRM and the position of the QTL
There are a number of different GRMs for putative QTL positions across a tested chromosome region. The maximum of the log likelihood and the variance components are estimated with the GRMs for the putative QTL positions. Therefore, each putative QTL position has a maximum value for the log likelihood for model parameters. Comparison of log likelihood values for all positions across the chromosome will give the most likely position.

Restricted maximum likelihood (REML) estimation using an average information (AI) algorithm
By assuming multivariate normality of the data with vector Xb and variance covariance matrix V, the resulting likelihood can be written and a numerical procedure can be used to estimate the parameters (QTL position and variance components). The log of the likelihood for the model in (2) can be written as, where ln is a natural log and |V| is the determinant of V. An efficient algorithm to obtain REML estimates is one that uses the average of the information (AI) from the observed derived Hessian coefficients and the expected derived Fisher's scoring coefficients [8]. The AI algorithm obtains the REML estimate using the following equation: where Θ is a column vector of variance components (σ 2 u , σ 2 q and σ 2 e ), k is kth iteration, ∂L ∂Θ is a column vector of the first derivatives of the log likelihood function with respect to each variance component, and AI is the average information matrix which consists of the average of the Hessian matrix and the Fisher information matrix.

Efficient designs for fine-mapping of QTL
The effect of family structure on accuracy of QTL mapping is illustrated in Figure 1. When the number of full sib families is 32 each with four individuals, the accuracy reached a plateau. The proportion of replicated simulations with the QTL positioned within 3 cM of the true location is 0.7. When the number of families is 2 each with 64 individuals, this proportion is decreased to 0.55. Hence, for combined LD and linkage mapping, many families of small size provide more information than few families of big size. The same result was found when mapping based on LD information only (IBD probabilities were estimated treating all animals unrelated). However, the accuracy is slightly less than with the combined method (Fig. 1), showing that linkage information can help to improve the accuracy. When the results are compared with that based on linkage information only (IBD probabilities between base animals were assumed to be equal to zero), the accuracy and the best design are changed. The accuracy of mapping resolution based on linkage information is highest when the number of families is low and accuracy is much lower when the number of families is high. This is because that if the number of progeny is small, recombination events hardly occur in such a small region (10 cM). It should be noted that the accuracy of 0.3 is no better than randomly positioning the QTL within 3 cM out of 10 cM.
The results show that family structure is important as well as the information (linkage or LD information) that is used. In mapping of QTL, when there are few families each with large size, there is little advantage of LD mapping over linkage mapping (the proportion of positioning QTL within 3 cM in combined LD and linkage mapping is 7% higher than that based on linkage mapping). However, with many families of small size, the advantage of LD mapping over linkage mapping is large.
In the half sib design a large number of families each of small size also give the higher mapping accuracy with combined LD and linkage mapping. However, the difference between using many and using few families is much smaller than in the full sib design. Figure 2 shows that 64 families each with two individuals result in 70% and two families each with 64 individuals result  in 68% of replicates positioning the QTL within 3 cM of the true location. In mapping based on LD information only, the accuracy is slightly reduced, but the pattern of accuracy is the same as in combined mapping. In mapping based on linkage information only, the accuracy is much reduced and the design with few large families provides most information. Two families each with 64 individuals result in 34% of replicates with the QTL positioned within 3 cM of the true location and 64 families each with two individuals result in 30%. As noted in the full sib analysis, an accuracy of less than 0.3 does not have any significant meaning. This lack of information from linkage is also demonstrated by the two linkage curves reaching similar accuracies when the number of families was more than 16. When comparing results between full and half sib designs, there is a different pattern in the combined mapping. With few families of large size, the accuracy in half sib designs is much higher than that in full sib designs whereas the difference is small for many families (e.g. with two families each with 64 individuals, the difference between full sib and half sib designs is 12.8% and with 64 families each with two individuals, the difference is 1.3%). Apparently, with half sib mapping, few families with big size can also give a reasonable mapping accuracy. This is likely due to the fact that in half sib designs, there is substantial LD information in the dam population which can be used. Note that the number of base dams is constant in the different half sib structures.

Effective population size and mutation age
In an analyses based on two half sib families of size 64, and with bi-allelic markers positioned every 0.25 cM, the overall proportion of replicates with the QTL positioned within 0.75 cM of the true location is 42∼55% with constant Ne = 1000 (Fig. 3). When mutation age (MA) is less than 100 generations, the accuracy is lowest (42%). The accuracies are higher when MA is 200 generations and 500 generations (53 and 55%, respectively) and it decreases until MA = 1000 (49%). Beyond a mutation age of 1000 generations, the accuracy is not significantly changed.
With low MA, the chance of common haplotypes carrying different alleles at the QTL is larger, affecting the power of QL detection, and therefore accuracy of positioning. Furthermore, with small MA the time to sufficiently break up chromosomal segments around the QTL is smaller and IBD segments will be longer. The relationship between Ne and the length of a chromosomal region that is IBD can be described as [7,18] LD = 1/(4 * Ne * c + 1), (6) where c is length of the region (Morgan) and Ne is effective population size at the time of mutation. The length of the haplotype that is not broken up by recombination depends on mutation age: c = 1/(2 * MA). LD is defined here as the probability of a region of length c being IBD when two random haplotypes are taken from the population. For example, for Ne = 1000, LD = 0.05 in case of MA = 100, and the length of the IBD region (c) is 0.5 cM while in case of MA = 200, LD is 0.09 and c is 0.25 cM. When mutation age is higher, the degree of LD is higher, and the length of the IBD region is smaller. Therefore, the haplotype having the mutation can be distinguished by smaller chromosome segments as MA increases. However, ongoing marker mutation will disturb haplotype similarity of animals that are IBD. This may explain the lower accuracy for larger values of MA (>1000 generations) (Fig. 3). When multi-allelic markers are positioned every 0.25 cM, overall accuracy is improved compared with using bi-allelic markers (Fig. 3). When only 100 generations passed since the mutation, the accuracy is low (48%). After 200 generations since the mutation, the accuracy is improved (68%) and highest at a mutation age of 500 generations (76%). For the same reason as in the bi-allelic case, the accuracy is slightly lower for higher values of MA (e.g. 72% for MA = 1000; 72% for MA = 2000; 69% for MA = 4000). Compared with mapping using bi-allelic markers, the pattern of accuracy is similar, however, the accuracy under the multi-allelic marker model is much higher. This is likely due to the fact that a high polymorphism under the multi-allelic model can help to distinguish the original haplotypes where mutation occurred from other haplotypes.
When Ne was linearly decreased over the last 50 generations (from 1000 to 100), overall accuracy was lower than with constant Ne (Fig. 3). With decreasing Ne more haplotypes come from recent ancestors and the population has lost more haplotypes that come from more distant ancestors. This situation is improved when MA is older because the degree of LD is higher and the IBD region is smaller. It is noted that the accuracy increases linearly which is different from CONS. This is likely due to the fact that the accuracy was not interrupted by marker mutation because most haplotypes come from recent ancestors. In the case of MA = 100, the accuracy of M LIND somehow increases compared with that in M CONS. With lower MA, a smaller effective size is more advantageous, as the chance of having different alleles at the QTL for the same haplotypes is decreased. However, the accuracy in the case of MA = 100 is lower compared with older mutations (Fig. 3, M LIND). Figure 4 shows standard deviation (SD) of positioning the QTL when multiallelic markers are positioned at 1 cM intervals and 0.25 cM intervals, respectively. Because different marker spacing made it difficult to directly compare the proportion of positioning within three brackets, we calculated SD of positioning the QTL assuming that position error is normally distributed. As shown in Figure 4 the SD of QTL position is much higher with a marker spacing of 1 cM compared with a marker spacing of 0.25 cM across all values of MA. In the case of Ne = 1000, the degree of LD for an IBD region of more than 1 cM is 2.5% (6). This probability is too low to correctly position QTL with a marker spacing of 1 cM. However, the degree of LD for the IBD region of more than 0.25 cM is higher (9%), hence the IBD region is more informative as there will be more phenotypic data available for each haplotype.

DISCUSSION
The present study proposed a design of family structure that is common in livestock populations and could give a reasonable mapping resolution in the joint fine-mapping method using LD and linkage. In general, the accuracy of fine-mapping of QTL depends on sampling haplotypes from a population that has a certain degree of LD between the trait mutation and flanking markers. The sampling error can be reduced by using a large number of base animals (unrelated animals). Because the number of independent base dams is larger in half sib designs, the accuracy in half sib designs is higher than that in full sib designs, especially when the number of families is low. Half sib designs are frequently used for linkage mapping in livestock, and the present study shows that such designs can also have sufficient information for fine-mapping. It is cost effective when the same design used in linkage mapping can also be used for fine-mapping. Of course a further requirement is that the QTL alleles segregate in the dam population used in the half sib design.
We simulated a population with effective size of 1000 for 6000 generations. The reason for 6000 generations of population history is to stabilize the homozygosity in markers and haplotype homozygosity. Figure 5 shows that in the first 2000 generations, homozygosity changes significantly in both cases (bi-allelic and multi-allelic markers). However, after 2000 generations, the homozygosity is stable. Favourable mutations were implemented in this study at generation 2000 or later. After 6000 generations, the average homozygosity was 0.6 in bi-allelic markers and 0.4 in multi-allelic markers, with in the latter case the number of alleles being 5∼15 with constant Ne and 3∼7 with linearly decreasing Ne. These results agree with those of Hayes et al. [7].
When different effective sizes are compared, the accuracy of mapping is not very much affected (Fig. 6). The effective size determines the LD values as described in (6) and LD = 0.17 when Ne = 500. With a marker spacing of 0.25 cM, the mapping accuracy across MA is more accurate with Ne = 1000 than with when Ne = 2000 (Fig. 6). For Ne = 1000, LD is higher and for a given haplotypes there will be more identical IBD haplotypes, giving more information about each of them. Hence power and accuracy of detecting a QTL are increased. However, in the case of Ne = 500, higher LD (0.17) did not give a better result than with Ne = 1000 until MA is around 1500 generations. This is probably because there are fewer different haplotypes with small Ne (and high LD), and similar haplotypes have more chance of carrying different QTL alleles, both causing a decreased accuracy of QTL mapping. With small Ne where haplotypes come from recent ancestors, the accuracy was less interrupted by marker mutation (this situation is similar to M LIND). Therefore, the accuracy is linearly increased, and when MA is more than 1500 generations, the accuracy is higher when Ne = 500 than that with Ne = 1000.
With higher effective size, sampling error of haplotypes increases with the same number of animals in generation t+1 used for fine-mapping, which partly explains the lower accuracy for Ne = 2000. The other reason is the lower LD values for larger Ne. A higher LD (e.g. 0.09) can be obtained with a marker spacing of 0.125 cM from (6). This implies that if marker spacing becomes more dense, the accuracy can be improved for higher effective sizes. However, the relationship between the degree of LD and accuracy has not been empirically shown. Further study is required to determine optimal marker spacing and the number of base animals for a better mapping resolution, given the effective size. In the bi-allelic marker model, the accuracy was lower compared with that in the multi-allelic marker model (Fig. 3). This is probably caused by the fact that in the bi-allelic case, there were relatively many more non-informative markers. This can also be explained by a QTL IBD probability curve. Figure 7 shows a plot of IBD probability against the length of the marker haplotypes. QTL IBD probability for a number of identical flanking marker pairs was estimated by the genedropping method (Ne = 1000 and MA = 200). The slope of the QTL IBD curve in the multi-allelic marker model is steeper than that in the bi-allelic marker model, meaning that there is more information in the multiallelic marker model. The accuracy of mapping was 0.68 for multi-allelic and 0.53 for bi-allelic markers when MA = 200 (Fig. 3).
The QTL allele substitution effect considered in this study was relatively high (0.7-1.2 phenotypic SD) and a high mapping accuracy was achieved with relatively few animals genotyped and phenotyped. Table I shows mapping accuracy for alternative sizes of QTL effect and different data set sizes. When the number of animals with phenotypic values in generation t + 1 is increased, the accuracy also increases significantly. Table I shows that the accuracy is 0.72, 0.85 and 0.94 for CONS; and 0.64, 0.78 and 0.86 for LIND when the size of the data set is 128, 256 and 384, respectively. These results are different from those of Meuwissen and Goddard [13] who reported that with a maker spacing of 0.25 cM, the change of the number of animals did not affect the accuracy. These authors used an effective size of 100 and bi-allelic markers without mutation. In our study we used a bigger effective size and multi-allelic Table I markers, which gives more chance to detect recombination between the QTL and flanking markers. In addition, we used an ongoing marker mutation model with Ne = 1000 for 6000 generations, therefore, the population properties such as haplotype homozygosity or homozygosity in markers can be different from their model. Table I also shows that accuracy is lower for smaller QTL effects, although mapping accuracy is still reasonably high with phenotypic and genotypic data on as few as 384 animals.
Ne and MA will generally be unknown in real life situations. For all cases, we used Ne = 100 and MA = 100 to estimate GRM. When comparing the mapping results obtained with this assumption with the mapping resolution using a GRM based on true population parameters for Ne and MA, the accuracy was not changed. This result agrees with Meuwissen and Goddard [13] who reported that the VC fine-mapping method is robust to assumptions about Ne and MA.
In our simulation, we did not consider artificial selection. In real livestock populations, selection has been carried out for the last several generations (50∼100 generations). The selection effect can influence population LD information and a further study is required to investigate the relationship.

CONCLUSION
In the present study, we showed that the half sib design of few sires mated to a large number of dams could be efficiently used for fine-mapping of QTL. After the population has a certain degree of LD between the trait mutation and flanking markers (around 200 generations since the mutation), QTL can be positioned within 0.75 cM of the true location with 70∼75% of certainty with constant Ne = 1000, and 60∼70% of certainty with decreasing Ne. Under a bi-allelic marker model, mapping resolution was poorer (40∼55%). When the number of animals used for fine-mapping increases, the accuracy will be increased.
It can be suggested that the same design with the same animals used in linkage mapping can be used for fine-mapping of the QTL. This would make the mapping of QTL to narrow genomic regions cost effective.