Measuring genetic distances between breeds: use of some distances in various short term evolution models

Many works demonstrate the benefits of using highly polymorphic markers such as microsatellites in order to measure the genetic diversity between closely related breeds. But it is sometimes difficult to decide which genetic distance should be used. In this paper we review the behaviour of the main distances encountered in the literature in various divergence models. In the first part, we consider that breeds are populations in which the assumption of equilibrium between drift and mutation is verified. In this case some interesting distances can be expressed as a function of divergence time, t, and therefore can be used to construct phylogenies. Distances based on allele size distribution (such as (δμ)2 and derived distances), taking a mutation model of microsatellites, the Stepwise Mutation Model, specifically into account, exhibit large variance and therefore should not be used to accurately infer phylogeny of closely related breeds. In the last section, we will consider that breeds are small populations and that the divergence times between them are too small to consider that the observed diversity is due to mutations: divergence is mainly due to genetic drift. Expectation and variance of distances were calculated as a function of the Wright-Malécot inbreeding coefficient, F. Computer simulations performed under this divergence model show that the Reynolds distance [57]is the best method for very closely related breeds.


INTRODUCTION
Assuming a species-like evolution pattern (evolution scheme as a dichotomy), the time scale that separates breeds is rather low with regards to the hundreds of thousands of years separating species. In order to measure the genetic distances between closely related populations like breeds, it is desirable to use highly polymorphic markers such as microsatellites [3,4,9,15,18,24,37,40,53,59,60,70].
The high number of microsatellites distributed over whole genomes coupled with their very rapid evolution rates make them particularly useful for working out relationships among very closely related populations [14,21,22,62,64,66]. Microsatellite markers are a class of tandem repeat loci exhibiting a high mutation rate. Therefore, a high level of polymorphism can be maintained within relatively small samples. The within breed average heterozygosity is generally higher than 0.5 [37,40,54] with extreme values above 0.8 observed for several loci [33]. For a large proportion of microsatellites, the number of alleles observed across mammalian populations can vary between less than 10 to 20 and can be even higher across natural populations of fish [56].
In this paper, we study the behaviour of the genetic distances between two isolated populations, denoted X and Y, diverging from a founder population P 0 for a small number of non-overlapping generations (Short term evolution models). The founder and derived populations are characterised by their allele frequencies p 0,i , p X,i and p Y,i (for i = 1..k) respectively at the th loci (the indices varying from 1 to L were omitted).
For the sake of simplicity, the formulae of distances presented in the first section of the present paper are given assuming that the true allele frequencies are known. In practice, p X,i and p Y,i are estimated from a limited number of individuals: x i = m X,i m X,• and y i = m Y,i m Y,• , where m X,i (resp. m X,i ) is the number of alleles i and m X,• (resp. m Y,• ) the total number of genes in sample X (resp. Y).
In the second section we will review the behaviour of genetic distances under the classical model of evolution of neutral markers assuming combined effects of mutation and genetic drift [28,29,38,41,52].
The negligible effect of mutations in a rather low divergence time allows us to consider in the third section the relationship between expectation and variance of distances and the Wright-Malécot inbreeding coefficient F [39] assuming genetic drift only. In order to guide the choice of distances, we will check their efficiency by computer simulations.

PRESENTATION OF DISTANCES
The apparent diversity of genetic distances may be structured into two or three main groups: the distances based on allele distributions of frequencies -Euclidean and angular distances -and the distances based on allele size distributions.

Euclidean and related distances
Denote by X = (p X,1 , . . . , p X,k ) and Y = (p Y,1 , . . . , p Y,k ) the vectors of allele frequencies of populations X and Y. The basis of distances overlooked in this paragraph is a norm ||X − Y||. Gregorius [26] uses ||X − Y|| 1 the sum of absolute allele frequency differences to define the absolute distance D G (1) The sum of the squares of allele frequency differences, ||X−Y|| 2 , usually called the Euclidean distance, has been directly used by Gower [25] and Goodman [23] (2) Dividing (2) by √ 2, defines D Rog , the Roger distance [58], and taking the square provides the minimum distance [46] According to the Nei notations [46] of gene identity j, (or expected homozygosity) and j XY = i p X,i p Y,i and diversity (d = 1 − j or expected heterozygosity), D m may be rewritten as the between populations gene diversity reduced by the average of the within population gene diversity Between two populations, G ST [47] is generally expressed with the heterozygosity of the total population H T = 1 − ip i 2 (withp i = (p X,i + p Y,i )/2) and the average of the expected heterozygosity within populationsH It can be rewritten as 6) which is also called the distance of Morton [42].
Other variations of the minimum distance, γ L and D R , were used by Latter [31,32] and Reynolds [57] respectively In parallel, Balakrishnan and Sanghvi [1], and Barker [2] defined respectively and

Angular distances
These distances are defined on the basis of the cosine of the angle θ between the two vectors X and Y.
Nei [46,47,49] reformulated cos θ as the normalised identity I between the two populations and derived its standard genetic distance from the logarithm of cos θ (11) It is noteworthy that D m is turned into D S after a logarithm transformation of the gene identity in (4).
With the square root of allele frequencies, which then have a unity norm, the cosine of θ can be rewritten as cos θ EC = i √ p X,i p Y,i . Edwards and Cavalli-Sforza [5,6,12,13] defined D c , the chord distance, and f θ respectively as: The values of Cste set the function support of chord distances (when Cste = 1, D c varies from 0 to 1). Since the number of rare alleles increases with the number of sampled individuals, f θ underestimates the expected genetic differentiation that would be obtained with an increased sample size [51]. For this reason, Nei advises using a corrected distance D A (equal to the square of D c for Cste = 1):

Distances based on allele size distributions
We also consider genetic distances expressed with respect to the moments of allelic size distributions of markers exhibiting length polymorphism.
Denote by i and j the repeat numbers of alleles i and j respectively. Goldstein [20], derived a distance from the Average Square Difference between populations, D 1 (15) with µ X , µ Y , V X and V Y , the means and variances in allelic sizes within populations.
Denote by ϕ i,j a function of the difference i − j (null when i = j and > 0 otherwise).
The within population Average Square Difference D 0,X is defined by i,j p X,i p X,j (i − j) 2 (idem for population Y) and is equal to 2V X . Then, equation (16) in which ϕ ij is set to (i − j) 2 may be rewritten as the squared difference between the allele size means (µ X − µ Y ) 2 , usually called (δµ) 2 , the distance of Goldstein [21].
The D SW distance of Shriver [62] may be computed with (16) setting ϕ ij equal to |i − j|.
Slatkin [63,64] argues to use D 1 , D 0,X and D 0,Y in order to extend the G ST calculation to length polymorphism

Multiple loci
In practice, the estimation of distances is performed using the arithmetic mean over L loci.
Nevertheless, when at least one locus is fixed for the same allele in X and Y, D R is undefined. So Latter [30] advises to use D L computed as follows (PHYLIP package, [17]) When at least one locus exhibits no allele shared between populations, the logarithm transformation log I is undefined (I = 0). So Nei advises rather to compute D S with the arithmetic mean of gene identities It is noteworthy that after removing loci with no shared alleles, taking the arithmetic mean of (11) (which is equivalent to using the geometric mean 1 L j 1 L ) gives the maximum distance D M of Nei [46]. Due to rare alleles within samples, the arithmetic mean of (11) is generally higher than (19).
Unbiased estimates of D m calledD m (and derived distances), D S calledD S , (expectation ofD S is shown in Appendix A) and distances taking allelic sizes into account are computable with sampled allele frequencies x i and y i using an unbiased estimation of the within and between population gene identity [49]. The bias correction of χ 2 given in [19] is also relevant forD B . So for the sake of simplicity, the expectations of distances under divergence models were computed assuming that true frequencies were known.

GENETIC DISTANCES UNDER GENETIC DRIFT AND MUTATION
The standard assumption that both derived populations, as well as the founder population, are in a mutation-drift equilibrium, implies that population divergence is due to the appearance of new mutants within populations. So distances can be used from a phylogenetic point of view, as estimators of divergence time.

Infinite allele mutation model
Due to the large number of variations a gene may theoretically exhibit, the number of possible new mutants is expected to be very large. The most appropriate mutation model for such markers is the infinite allele mutation model, IAM [28,38,65].
In this model, D S is turned into a linear function of divergence time t and mutation rate β of markers: Nei [45,46,49] advises to use D S in order to construct phylogeny for closely related as well as for largely diverged populations. In contrast, the IAM expectation of D m , exhibiting a finite maximal value, given the founder gene identity j (0) [51] is: Derived distances (equations 5 to 10) as well as f θ , D c and D A are not linear for all t values. Their behaviour (underestimation of divergence when t increases) disturbs their ability to distinguish a branching pattern between largely diverged populations. But for small divergence (βt 1) they can be considered as quasi-linear functions of t. In addition γ L , being independent of founder allele distributions, has the desirable advantage of being directly linked to the divergence time (expectation close to 2βt [31]).
Nevertheless, Takesaki and Nei [66] by simulations showed that D S , exhibiting a larger variance than the non-linear distances, D c or D A , provides few correct tree topologies between populations within species.
Divergence is governed by βt implying that for a small divergence time, differences between populations measured with gene polymorphism and their confirmed low mutability (mutation rate of the α and β chains of insulin is estimated to be 10 −7 /codon/generation, [48]) are expected to be small. The values of D S are generally less than 0.01 or 0.02 between local breeds or subspecies [48]. So from a phylogenetic point of view assuming divergence by mutation, markers with a high mutability should enhance the precision of distance estimations for closely related populations. It was shown by Takesaki and Nei [66], via computer simulations, that markers with microsatellite characteristics give as many correct phylogeny when t = 400 as markers with low mutability when t = 40 000.
Shriver [62], Goldstein [20,21], Slatkin [64] and many others have developed linear statistics assuming infinite numbers of possible allelic scores. As D 1 and R ST depend on the effective founder size, they are sensitive to bottlenecks and are not suited to deriving phylogenies [20,44].
Since under the assumption of an equilibrium between drift and mutation, the variance of allelic size converges [20,41,64], the growth of D 1 is only due to the linear growth of the squared difference between the means (15) [21]: Although there is no explicit formulae, Shriver [62] and Takesaki and Nei [66] showed by simulations that D SW increases almost linearly (until 10 000 generations with β = 0.0003) with a slope different from 2β. It is noteworthy that assuming alleles can mutate for more than 1 repeat, a generalised equation can be easily obtained substituting β byw = 1 L w [74] with w = β σ 2 , when σ 2 is the variance of the change in the number of repeats [64].
Between very closely related populations, Takesaki and Nei [66] by simulations showed that (δµ) 2 and D SW provide tree topologies of lower accuracy than non-linear distances (D c or D A ). The dramatically bad results obtained with these statistics specifically developed for microsatellite evolution applications are due to their large variance. The coefficient of variation CV of (δµ) 2 , taking both biases and variance into account, is almost constant (distances exhibit linear standard deviation, [36,55,74]) and 5 times higher than those of nonlinear distances. The CV of D SW dramatically increases when t decreases with the consequence that these distances are the least appropriate for the estimation of phylogeny between breeds.
When the level of divergence increases, the efficiency of non-linear distances decreases (as predicted by theory) but they remain, however, the best methods to use with highly polymorphic markers [66].

Range constraints for microsatellites
Due to their high mutability, microsatellites are less convenient for the study of largely diverged groups. Takesaki and Nei [66] demonstrate that microsatellites perform better for t = 400 than for t = 4 000. In [3], the tree between four species of primate (human, gorilla, chimpanzee and orang-utan) does not show any structure. The number of possible repeat scores converge to a maximum, denoted by R [3,20], with the consequence that (δµ) 2 tends to a maximal value As a consequence, mutation may be viewed as a homogenising factor" [44]. Feldman [16] and Pollock [55] propose linear corrections of (δµ) 2 and more recently, Zhivotovsky [74] defines another linear statistics. These distances introduced in order to improve estimation of large divergence times will not be described in more detail. Between closely related populations, they keep the same large variance suggesting that they are as inappropriate as D SW and (δµ) 2 .

GENETIC DISTANCES UNDER GENETIC DRIFT
Focusing on the very early stages of evolution of populations allows us to consider that mutations can be neglected. As a consequence, fluctuations of allele frequencies are only due to genetic drift. Within populations, the genetic drift tends to reduce the genetic variability whereas differential loss of genes generates genetic diversity between populations.
In a diversity study of endangered breeds it is desirable to use distances which can be expressed as a function of the loss of the within population diversity. We will introduce the Wright-Malécot inbreeding coefficient in the calculus of drift expectation and variance of distances according to: For the sake of simplicity, ∆F, the variation during t generations of the inbreeding coefficient from the founder population, which is equal to 1 − (1 − 1/2N) t , will be noted F with a subscript giving the name of the population, (F X and F Y for populations X and Y respectively) and called the inbreeding coefficient.
The drift expectation of the minimum distance of Nei, depends onF = (F X + F Y )/2, the average inbreeding coefficient (between populations) and on h 0 , the homozygosity of the founder population. For a small divergence, the drift expectation of D S calculated with a Taylor expansion, in which F 2 X , F 2 Y and F X F Y can be neglected is: In parallel, taking the limit of the general solution of recurrence of (δµ) 2 when the mutation rate tends to 0, allows this distance to be equal to with V 0 the variance of allelic size in the founder population.

Estimation of the average inbreeding coefficientF
For phylogeny purposes, the authors wish to use distances depending on divergence time only. In the present section, we focus on the distances allowing us to estimate the level of genetic diversity by way of the average inbreeding coefficientF. In Section 3.3, we will test their accuracy by way of computer simulations.
Distances like D m , D S or (δµ) 2 depend on the founder population parameters, and therefore cannot be directly linked toF. A strategy to obtain an estimate of the average inbreeding coefficient considering S populations was developed by Wright [72] and Nei [47,51]. The mean and variance of the frequency of allele i between subpopulations are denoted byp i = 1 S s p s,i and Var s (p s,i ) respectively. F ST , initially defined for dimorphic loci as the sum of the between population variance of alleles 1 and 2 weighted by H T = 2p 1p2 , an estimation of the founder heterozygosity H 0 [72], was extended to polymorphic loci by Nei [47] as the weighted variance G ST given by: The drift expectations of the numerator and denominator expressed with respect to the inbreeding coefficient of every sub-population, F s , are with p 0,i the allele frequency of the founder population common to the s subpopulations. Assuming, as in Nei and Chakravarty [50], that the ratio of expectations is within the same order as the expectation of the ratio, gives

Euclidean distances
Considering two populations and taking 2G ST gives Unfortunately, because of the biased estimation of H 0 provided by ipi (1 −p i ), the estimation ofF is positively biased, especially when divergence increases.
This strategy was extended to other distances by Reynolds [57], Balakrishnan and Sangvhi [1] and Barker [2]. Given that is unbiased whatever the level of inbreeding. Dividing each square allele differences (p X,i − p Y,i ) 2 byp i (1 −p i ) and k in Barker's method andp i and (k − 1) in Sanghvi's method [19] allows a rather long and fastidious computation of their expectations for polymorphic loci. However for dimorphic loci, these distances together with 2G ST can be rewritten as 29) and have the same expectation as in (27). For polymorphic loci with uniformly distributed founder frequencies p 0,i ≈ 1/k, approximate calculus (expectation of a ratio is approximated by the ratio of expectations) giving shows that these distances might be used as estimators ofF.

Angular distances
Given that neglecting F 2 X , F 2 Y , F X F Y and assuming uniformly distributed founder frequencies p 0,i ≈ 1/k the drift expectation of f θ calculated with the Taylor expansion is Rearranging (33) gives The distance f θ , considered as nearly unbiased for smallF, will be biased when the number of alleles and the population divergence increases (for example whenF is large, a term depending on F X F Y , which is equal to − 1 16 F X F Y (k − 1), cannot be neglected longer).
In the present work we focused on f θ rather than D A which was no longer directly linked to the inbreeding coefficient (its expectation can be directly deduced from (33) ignoring 4/(k − 1)). As a consequence, the chord distances equal to the square root of D A were not kept for further analysis.

Variance of unbiased estimates of D R
Variance of G ST was given in Nei and Chakravarty [50]. Foulley and Hill [19], compute the variance of χ 2 , assuming Gaussian distribution of true allele frequencies and equal sample sizes, m X,• = m Y,• = m.
In this paper, approximate standard deviation ofD m and D R corrected for sample size were computed under drift divergence assuming F X = F Y and m X,• = m Y,• (Appendix B). In order to provide understandable formulas, approximated standard deviations may be easily rewritten assuming L independent loci, each one exhibiting k 0 uniformly distributed founder frequencies (p 0, ,i = 1/k 0, and k 0,1 = k 0, = k 0,L = k 0 ): In the following section the validity of the approximated formulae (36) will be checked by way of computer simulations.

Comparison of several estimators ofF
The accuracy of distances estimatingF was compared by computer simulations performed under pure genetic drift divergence of two isolated populations X and Y.

Simulation procedure
The change in allele frequencies between two generations was simulated as a Multinomial sampling scheme according to the Wright-Ficher model of population evolution. Twenty genetically independent loci were considered, a number frequently found in diversity studies [33,37,40].
The founder frequencies of the founder population of X and Y were generated as follows. An initial simulated population of size N = 500 was first considered, with allele frequencies p 00,i (for i = 1, . . . , k), was submitted 1 000 times to a genetic drift process during five generations. This process generates 1 000 quasi-independent populations used as starting points of simulation runs. Each one of these 1 000 populations, described by its founder frequencies, p 0 , was submitted to a pure genetic drift divergence generating the populations X and Y, which have constant diploid effective sizes equal to N = 100 and N = 400 respectively during 22 non-overlapping generations.
In order to provide estimations of increasing values ofF (ranging from 0.025 to 0.3), gene samplings (m X,• = m Y,• = 50 genes) were computed every five generations from the divergence.

Results
The performances of the F-estimates established using the following statistics averaged over 1 000 replications, the relative bias B r (expressed in percent of the true value ofF), the standard error SE and the squared root of the mean square error √ MSE = √ bias 2 + SE 2 are presented in Figures 1, 2 and 3 respectively.

Uniform founder frequencies
Two sets of 1 000 simulations, in which allele frequencies of the initial population were set to p 00,i = 1/k, were performed with k = 2 and k = 8 alleles. Estimations ofĜ ST ,D R ,D B and χ 2 -corrected for sample sizeswere performed using the arithmetic mean across loci. We also introduce the distance of LatterD L [30], equation (18), andf θ .
Relative bias (Fig. 1): As expected, with two ( Fig. 1a) or eight (Fig. 1b) alleles per locus,Ĝ ST exhibits a positive bias, this increases with the level of divergence (this bias is well predicted by equation (27)). By contrast, χ 2 expected to be unbiased (31) andD B expected to be of the order of magnitude of G ST (30), are negatively biased asf θ . In parallelD L andD R are the least biased distances (constant bias whatever the divergence level) for diallelic or more polymorphic loci. It is noteworthy that estimations given byD L (weighted by estimates of founder heterozygosity computed with all loci) provide lower bias than estimations given byD R (weighted for each locus by an estimate of founder heterozygosity).
The deviation from the expected value of the standard errors ofD B and χ 2 (for small and largeF) is certainly due to their large negative biases allowing the variance of estimation to be decreased.  Mean square error (Fig. 3): When the bias is rather small with respect to the standard error, √ MSE is expected to be close to the standard error. With two alleles per loci the method with the smallest standard errorD R andD L give the smallest √ MSE whatever the value of the inbreeding coefficient. With eight alleles per locus and when the level of divergence increases, methods  Figure 1. The distance D * R (equation (37)) is plotted with dotted lines.
with the smallest biases (D R andD L ) give the smallest √ MSE although they do not exhibit the smallest standard errors. On the basis of an accuracy criterion combining the bias and the standard error of estimations,D R andD L are the most accurate distances whatever the polymorphism of the marker used.

Microsatellite founder frequencies
One set of 1 000 simulations was performed, in which allele frequencies p 00,i in the initial populations were set to microsatellite marker frequencies published in [33]. The number of alleles varied between loci (the mean number of alleles is close to 6). In this case the distances D L and D R were still the most accurate methods considering the √ MSE criterion (Fig. 4, [34]). On the basis of √ MSE we also compared the distance D L and the distance D R computed using the arithmetic mean over loci with another estimate of the distance D R computed using the following formula [34] This formula takes the heterogeneity of the marker polymorphism into account with n XY, which is the number of alleles present both in the sample of X and Y.
In this case, the standard error of the weighted Reynolds distance is equal to Using the weighted estimate did not yield a significant gain of accuracy. The √ MSE of D * R was nearly identical to the √ MSE of D L (Fig. 4).

DISCUSSION AND CONCLUSION
Under the assumption of equilibrium between drift and mutation, the power of different distance estimation methods for constructing phylogenetic trees is well discussed in Takesaki and Nei [66]. Their work points out that the quest for linearity at the cost of variance is not an efficient strategy. Increasing functions of time (non-necessarily linear but with a slope large enough to discriminate closely related populations) with small variances provide correct phylogeny with higher levels of confidence than linear distances do. It is clear that with such distances the length of branches is not representative of divergence time. However, this question seems of minor importance with regards to that of a correct branching pattern. Perez-Lezaun [54] compared human populations using 20 microsatellite loci on the basis of D R , R ST , D SW and (δµ) 2 . As expected, D R gives trees with the highest bootstrap values and the best topology with regards to our knowledge of human history.
Goldstein and Pollock [22] argued that the misunderstanding of mutation processes also explains the poor efficiency of these distances. D SW and (δµ) 2 were defined assuming equal probabilities of insertion and deletion of repeats whereas observed microsatellite distributions clearly show evidence in favour of asymmetric mutation processes [27,73]. Taking the mutation process of microsatellites into account should be more efficient when using methods with a small variance such as likelihood based approaches, rather than for distances based on a simple difference between allele size means.
In the second section of the present work we assumed that for very closely related breeds the number of mutations cannot explain the observed genetic variation even when highly mutable DNA sequences are used. For populations of small size, N = 50, and a mutation rate of β = 10 −3 , mutations can be neglected during 200 generations: the difference between the values of inbreeding coefficients computed assuming or neglecting mutation is small, being less than 7 percent of the true value [34].
The genetic drift allows genetic distances computed with allele frequencies to be strongly dependent on the number of generations since divergence, t, and on the value of the effective sizes of breeds, N X and N Y [43]. The values of distances increases with the parameters 1 − (1 − 1/2N X ) t ≈ t/2N X and 1 − (1 − 1/2N Y ) t ≈ t/2N Y which represent the increase of the inbreeding coefficients during t generations. Since t/2N X can be viewed as the evolution rate in population X, no phylogeny can be inferred from the tree in cases of very closely related breeds exhibiting different effective sizes. Indeed the location on the tree of the most recent common ancestor cannot be exactly determined when evolution rates vary between lineages (e.g. when a bottleneck does occur within a breed). In order to infer the true history of populations, it is necessary to root the tree using an outgroup.
This work points out that, under the drift assumption, the major part of the genetic distances (the Nei distances D m and D S for example) also depends on unknown parameters, the founder frequencies. For example the expected value of the minimum distance of Nei depends on the heterozygosity H 0 of the founder population. With such a distance we cannot separate the effect of the genetic drift occurring in each population and the ancient history of the founder population. So this fact can also disturb the phylogeny reconstruction, mainly when migration or admixture does occur between founder populations.
As in [11], we privileged distances which can be expressed with the increase during t generations of the inbreeding coefficient alone (or equivalently the increase of the kinship coefficient). This parameter is of importance to analyse the genetic diversity of breeds. It allows us to measure the loss of the within population diversity due to the drift process [34]. Eding [11] argues that, in terms of kinship, a generic formula of distance can be written as d(X, Y) = f Y +f Y −2f XY = ∆f X +∆f Y , with f X , f Y the within breeds kindship coefficient, f XY the kindship coefficients between breeds and ∆f X = F X , ∆f Y = F Y the increase since divergence of f Y and f Y respectively. d(X, Y)/2 is therefore equal to the average inbreeding coefficientF. This shows that using the Reynolds distance is equivalent to using a distance giving a measure of the within breed diversity (f X and f Y ) corrected by the between breed diversity (f XY ).
As a by product, this suggests an important fact when considering very closely related breeds. Since distances computed with allele frequencies of neutral markers are expressed as a function of the loss of the genetic diversity methods, such criteria as the Weitzman one [67,71] which advises conserving most of the diversity of the whole set by conserving the most distant breeds, are not appropriate in this case [34]. Indeed if we consider a set involving large populations and a totally inbred breed (F = 1) which has no original allele, the Weitzman approach will suggest conserving the inbred breed.
Although expected values of distances are quasi independent of the sampling process, a part of their standard deviation depends on sample size. From (36) σ/F is proportional to 1/mF, showing that when divergence is low, the accuracy of distances when building trees is sensitive to sample size. It is impossible to get accurate estimations when divergence tends to 0.
By contrast when the divergence increases the sample size does not make much differences in the accuracy of distance estimations. Therefore, for intermediate inbreeding values, the accuracy of distance estimations mainly depends on the number and on the degree of polymorphism of the markers used. The variance of distances is inversely proportional to the number of alleles per locus within the founder population. This strongly advocates in favour of the present use of markers such as microsatellites rather than gene polymorphism, which is expected to be less variable within populations.
Nevertheless, distances such as χ 2 or D B are more biased with eight founder alleles than with two founder alleles. For such low polymorphism values, the bias of D B , χ 2 and 2G ST behaves as predicted by equation (27). The dependency of their biases on the value of inbreeding and on the number of founder alleles suggests that these distances are sensitive to rare alleles present within the founder and derived populations (the most frequently eliminated when the level of drift increases and forgotten when sample size is small).
The estimations computed with five loci and eight founder alleles show biases close to those observed with 20 loci (data not shown). For smallF (between 0.03 and 0.1), the √ MSE are within the order of magnitude of the standard error making D B and χ 2 slightly more accurate than the less biased distances D L and D R , whereas all distances show the same performances when the number of loci is equal to 20. ForF higher than 0.1 and for a small number of loci as well as for a number of loci close to that observed in the literature [33,40], more than 20, the conclusions are different. As shown by the difference of √ MSE with respect to the standard error as long asF increases, the reduction of the accuracy due to bias largely counterbalances the gain in variance due to the number of loci and high polymorphisms when we consider distances such as D B or χ 2 . This suggests that unbiased distances, such as D L in all cases presented and D R with high polymorphisms, should be privileged mainly when the number of markers used is larger than 20.
ForF higher than 0.3, D L and D R should behave quite better than the other distances, mainly when the polymorphism of markers is high (microsatellites and eight alleles per locus, data not shown).
The weighted estimate of the Reynolds distance (37), taking the difference between the number of alleles observed into account, do not give a significant gain in accuracy. This formula is deduced from the expected standard deviation of the Reynolds distance (36) which depends on k 0, the number of alleles within the founder population. When this number is approximately known (for example when a sample of the founder population is available), using the weighted estimate of the Reynolds distance computed between the founder and the derived population X yields an important gain in accuracy [34]. Since the founder alleles can be lost because of the genetic drift process n XY, is a bad estimator of k 0, as far as the inbreeding coefficient increases.
To conclude this work it seems that, among distances estimatingF when drift is assumed, the Latter and Reynolds distances (D L and D R ) have to be privileged whatever the polymorphism of markers used. It is necessary to keep in mind that, because of the drift process, the obtained trees do not represent true phylogenetic relationships when the effective sizes are different between breeds. Since the distances depend on the increase of the inbreeding coefficient of each breed, F X and F Y [11,34], these trees can be viewed as a representation of the loss of the within breed genetic diversity due to the genetic drift process.
However F X and F Y can be separately estimated using a statistics directly derived from the Reynolds distance [69] or using a more accurate method based on a Monte Carlo Markov Chain algorithm [34]. Since all t/2N can be measured in all couples of breeds by these approaches, new methods allowing to locate the most recent common ancestor on trees, and therefore to retrieve the true evolutionnary relationships when no outgroup is available, could be proposed.
As in Nei and Chakravarty [50], we do not take into account the second and third terms of the Taylor expansion (in order to compute the second term we need to know the moment of the order 5 of frequency distributions under genetic drift).
With the same approximations as in the previous section, the approximated standard deviation ofD R can be simplified to The validity of this approximated formula has been checked by way of computer simulations in Section 4.3.