On the precision of estimation of genetic distance

- This article gives a formal proof of a formula for the precision of estimated genetic distances proposed by Barker et al. which can be used in designing experimental sampling programmes. The derivation is given in the general multi-allelic case using the Sanghvi distance. Two sources of sampling are considered, i.e. i) among individuals (or gametes) within locus and ii) among loci within populations. Distribution assumptions about gene frequencies are discussed, especially the normal used in Barker et al. versus the Dirichlet via simulation. &copy; Inra/Elsevier, Paris genetic distance / estimation / precision / Dirichlet Résumé - À propos de la précision de l’estimation des distances génétiques. Cet article présente une démonstration formelle d’une formule de Barker et al. donnant la précision de l’estimation de distances génétiques à des fins de planification expérimentale. Cette démonstration est faite dans le cas général multiallélique sur la base de la distance de Sanghvi. Deux sources d’échantillonnage sont considérées à savoir i) au niveau des individus (ou gamètes) intra-locus et ii) entre loci intra-populations. Les hypothèses sur les lois des fréquences géniques sont discutées via quelques simulations en particulier celle de la loi Normale adoptée par Barker et al. par rapport à la loi de Dirichlet &copy; Inra/Elsevier, Paris distance génétique / estimation / précision / Dirichlet


INTRODUCTION
In a report to the FAO, Barker et al. [2] proposed a formula to express the standard error of an estimate of the genetic distance (d) which was intended to be used in deciding on sample sizes when designing field programmes. They start from the following expression of the estimator: where p l , P 2 are the observed frequencies of a given allele at one locus in populations 1 and 2, respectively (p being an estimate of the average frequency) in which 2n = n l + n 2 individuals are sampled assuming n l = n 2 ; using equation (1) they infer that the standard deviation of D can be expressed as where L is the number of loci and k is the number of algebraically independent distance estimates per locus, i.e. assuming k + 1 alleles.
As no proof of this formula was given in the paper, we thought it might be useful to provide a formal detailed derivation which also helps to clarify the assumptions made throughout and the sources of uncertainty taken into account.

THEORY
We will restrict our attention to the multi-allelic case. Let yi j = 2!pij; y 2 j # 2np2! be the number of A j alleles observed in the n individuals sampled in populations 1 and 2, respectively, with pl!, P2j designating the corresponding true allele frequencies. Under FI o : ( Plj = P2j = p!;Hj) the statistic where p! _ (P lj +p2!)/2, has an asymptotic chi-square distribution with J&mdash; 1 degrees of freedom (7!. Factorizing n, and the expectation (J -1) of the chi-square, Z 2 can be written alternatively as: where D is the so-called Sanghvi's G 2 distance closely related to the 0 2 of Battacharyya !9!. Provided that the variance covariance matrices of Yl = (y v ) and of Y2 = {y 2j are close to each other, Z 2 in equation (4) can be interpreted as a non-central chi-square with v = J -1 degrees of freedom with a non-centrality j parameter equal to with p j = (p lj + P2j )/2 corresponding to the true distance between the two populations. P2j } and the (J x J) matrix Q of the quadratic form being (J -1) -1 diag(p! 1). Assuming p ! 7 r, and taking the expectation of d with respect to the distributions of PI and p 2 requires the evaluation of: As populations 1 and 2 are derived from the same founding population with allele frequency !, E(S) = 0. The second term is the trace of Q[varp,(p i ) + varp 2 ( P2 )]. As C( p ) is close to C (7t) ifp ! 7t, this reduces to So far, no assumption about a specific gene frequency distribution was needed since the expectation of a quadratic form depends only on the first two moments. Several assumptions can be made at that stage. For the sake of simplicity, a normal approximation for the distributions of true gene frequencies can be considered as in Barker et al. [2] and Lewontin and Krakauer !7!. One may also rely on the Dirichlet distribution which is the natural conjugate of the multinomial. The first alternative results in Hence, as in equation (9) and as expected Ep l ,p 2 (d) = 2p, and Remember that the total variance can be decomposed into var(D) _ !pi,p2!(!!pi,p2)]+varp!p![E(Z)!pi,p2)]. The expressions for E(Dlp l , P2 ) and var(D!pl, p 2 ) were given in equations (5) and (6) and correspond to effects on the first two moments of multinomial sampling of individuals or alleles within the two populations 1 and 2. Now Combining these two formulae results in the expression for the unconditional sampling variance of the estimation of the genetic distance: the expectation being equal to 3. DISCUSSION Formula (13) is identical to that given by Barker et al. [2] for L = 1 locus and k = J -1 algebraically independent estimates of the genetic distance.
Incidentally, formula (9) for the expectation of d is identical to the one given by Weir !16!, Laval [5] and Laval et al. [6] although these last authors considered a different distance measure, namely Reynolds'. This clearly shows the interest in normalizing the squared differences ( Plj -p2j)! by the degree of heterozygosity as in Sanghvi's and Reynolds' distances but not in Rogers', Takezaki and Nei [15] consider alternative estimators of genetic distance, and show that while the simple estimator D used here is not the best, it is only marginally less so.
To derive the expectation of d (9) it was assumed that p m 7 t. This implies computing p in D (formula 4) from the whole collection of the I populations I involved in the distance study either as an unweighted p = (! pi)/I, or as I=I a weighted mean; to that respect we suggest for unbalanced designs with n i I I individuals sampled in population i, p = (¿ a iPi )/ ¿ a i with weights a i I=I I=I inversely proportional to p i + [(1 -pi)/ni!.
Actually this condition turns out to be mandatory as demonstrated by a simulation study based on the Dirichlet distribution. This distribution and its particular case of the beta for two categories have been used by population geneticists, mostly in a Bayesian context, to specify prior information about allele frequencies [16]. Under recurrent mutation, migration and drift but without selection, Wright [17] also obtained gene frequencies at a biallelic locus which are beta distributed. Thus, that assumption makes sense as long as selection is absent or weak.
Results based on the Dirichlet distribution in the case of J = 5 alleles show a non-negligible downwards bias increasing with F and disequilibrium among allele frequencies when using the standard formula (figure 1).
One can guess at its direction by considering populations taken towards fixation: either they are fixed for the same allele or fixed for different alleles.
In the biallelic case, the line is either AA or aa. If it is AA (probability 7 r) the average distance between this line and another line is (0 x 7 r) + I (1 -7 r) x ( 1 1/4 ) ] , l .e. 4(1 &mdash; 7r). The same reasoning applies given the line is aLa leading to 4 7 r so that the expectation of the distance is [ 7 r x 4(1ir)] + [(1 -Moreover, improving it analytically might be a tedious task even for approximations. For instance, using the so-called delta method based on Taylor expansions, one should go beyond the second order expansion to obtain different results and assume specific forms for the third and higher moments of gene frequency distributions. Anyway, for those interested in further adjustments, one may recommend basing them on the following general formula (derived from equations (11) and (12)): where E(d) and CV d are the expectation and coefficient of variation of the true distance, respectively. Formule (13) also provides a means for combining inter loci information in the expression of the distance. Now, for K independent loci, a 'natural' K estimator of the distance is obtained from D = 2..)w k D k )/ W+ where the k -1 weight w! is proportional to the reciprocal of the variance of the distance D k K pertaining to locus k, and with w + = L w k . From equation (13), Wk oc J k -1 k=l which is equivalent to weighting each locus by its number of alleles minus 1 so that the formula for the pooled distance reduces to and its estimated variance to Finally, issues tackled here with respect to sampling of loci and of lines at a given locus are closely related to theories developed for testing selective neutrality: [7,9,11,13,14]. In particular, assumptions made in the distribution of gene frequencies in equation (7) rely on the type (a) structure shown in Robertson ([14], Figure 1), i.e. a set of equivalent populations deriving independently from a common base population. For more complex relationships involving some kind of splitting or fusion, one will have to adjust the mean and variance of the gene frequencies accordingly: see, for example, techniques proposed by Felsenstein !4!.