Comparison of four statistical methods for detection of a major gene in a progeny test design

Il est frequent, en selection, de tester sur descendance, des mâles, afin d'estimer leur valeur genetique. Les donnees recueillies dans ce but peuvent etre utilisees afin de mettre en evidence un gene majeur. Quatre tests statistiques de mise en evidence d'un tel gene majeur sont compares


INTRODUCTION
In recent years, several genes having major effects on commercial traits have been identified. The dwarf gene in poultry (Merat & Ricard, 1974), the halothane sensitivity gene in pigs (Ollivier, 1980), the Booroola gene in sheep (Piper & Bindon, 1982), or the double muscling gene in cattle (M6nissier, 1982) are notable examples.
These discoveries, as well as improvement of transgenic techniques, have stimulated interest in new techniques for detection of single genes. Various tests have been described concerning livestock (Hanset, 1982). Their general principle is that the within family distribution of the trait depends on the parents' genotypes, and therefore varies from one family to another. These methods involve simple computations but are not powerful. Concurrently, segregation analysis in complex pedigrees was developed in human genetics (Elston & Stewart, 1971) by comparing the likelihoods of the data under different trait transmission models. These methods are much more powerful than the previous ones, but involve much computation. They require numerical simplification to deal with the population structure of farm animals. Additionally, the known properties of the test statistics, a likelihood ratio test, are only asymptotic, which raises the question of their validity when applied to samples of limited size. ' In livestock improvement it is common to use progeny tests where males are mated to large numbers of females. Concentrating on this simple family structure the present paper tries to give some elements of a solution to the problems of simplification and validity. Four methods are compared on simulated data.

METHODS
The four methods considered rely upon the same information structure and the same type of test statistics.

Experimental design
The data are simulated according to a hierarchical and balanced family structure: one sample consists of n sire families (i = 1, ...n) with m mates per sire ( j = 1, ...m) and one offspring per dam. Sires and dams are assumed to be unrelated. Only offspring are measured, with one 1 ' ;j datum per animal.

Models
The Ri j performances are considered under the two following models: In this model a monogenic component is added to the assumed polygenic variation.
When two alleles A and a are segregating at a major locus, three genotypes are possible (AA, Aa, aa) which we shall respectively denote 1, 2, 3. Sires are of genotype s(s = 1, 2, 3) with probability P S . Dams transmit to their offspring allele A with a probability q and allele a with a probability 1 &mdash; q. Conditional on its genotype t(t = l, 2, 3), the ijth progeny has the performance Y.'. The following linear model can be formulated. ij Where lt t is the mean value of the performances of genotype t progeny.
U i is the sire i random effect, assumed to be independent of the genotype t and normally distributed with a mean 0 and a variance U 2 u E ij is the residual random effect, assumed to be independent of the genotype t and normally distribued with a mean 0 and a variance U2 e U i and E ij are assumed to be independent. Concerning production traits of livestock, the proportion of variance explained by polygenic effects has been generally estimated in many populations. Thus, we shall assume known a priori the heritability of the trait, h 2 , defined as: Null subhypothesis, to be tested against the general model, is fixed by A , = U2 = /-t 3 = P 0 &dquo; Where p o is the general mean of the performances. U i and E ij have the same definition as under Hi . where X and Z are two matrices of order m x 1, whose elements all equal 1, under H l : where Xi t i is the m x 3 incidence matrix for the fixed effects of the model, when the realization of the genotypes of the sire i progeny is t i .
The Vi covariance matrix for the performances Y! of the sire i family is: with D = 0 &dquo;; and R the diagonal m x m matrix R= o-e 2. 1!.

General expression of the likelihood ratio test (LR test)
The test statistic is based on the ratio of the likelihoods under H o (M o ) and under Hl (ll!I1 ), or an estimate of this ratio. In practice the test statistic considered is: 1 = -2.log (M o/ M i ). With our notation, and given the preceding hypothesis, M o is: with ... and M¡ is: The four proposed methods are all based on the two following equalities: and: Where v, 2 is the mode of the distribution of U i given Y i and the genotypes t i . Formula (2) results from the equality of mode and expectation for symetrical distributions.

Definition and interests of the four proposed methods
The differences between the four methods concern the sire effects.  (Wilks, 1938). However, in the particular context of testing a number of components in a mixture, the regularity conditions are not satisfied since the mixing proportions p i and p 2 have the value zero under H o , which defines the boundary of the parameter space.
Studying mixtures of m-normal distributions, Wolfe (1971) suggested that the distribution of the LR test is proportional to a X 2 distribution with 2d degrees of freedom. The proportionality coefficient c should be c = (n-1-m-1/2g 2 )/n where n represents the sample size, and 92 the number of components in the mixture under H l . If these results hold in our case, when the number or sires is very large, I SA should have a x2 distribution with 4 degrees of freedom.
The problem with this method is that it requires heavy computation: a complex function of the 1!j must be integrated n times for each estimation of I SA -Second and third methods: ME These methods (&dquo;modal estimation&dquo; of the sire effect U Z ), use the equation (2). Under H o , the likelihood may be written as follows: Under H l , the equality (2) leads to However, the sums over the vectors t i for each sire make this computation practically impossible as soon as m is larger than a few units (3' = 243, 3 10 = 59049).
Thus, following Elsen et al. (1988) we suggest the approximation Where Û i is the distribution mode of U i conditional on Y i , whatever the genotypes si and ti are. The statistic 1 ME1 = -2log(M o mEyN1 1 ME 1 ) is no longer an LR test but an approximation lacking the asymptotic properties described above. However we hope that this statistic which requires much less computation will nonetheless retain the power of the first proposed.
An alternative to this second method is to estimate the likelihood ll!losA and M 1 SA directly by: where Û i is defined as above.
As stated by H6schele (1988) this &dquo;approximation will be close to I SA only if the likelihood is very peaked (m -j oo) with most of its probability mass concentrated over a small region about the ML estimates&dquo;.

Fourth method: FE
The method (fixed effect of the sires), does not consider the a priori information contained in the heritability of the trait. The u i sire effects are assumed to be fixed, and become supplementary parameters which need to be estimated. The likelihood ratio may be written: with: and: This method has the advantage of its computational simplicity, while retaining the well known asymptotic properties of the LR test. However, there may be an important loss of power, due to the loss of information on the polygenic variation.

The comparisons
Three problems were studied:

Distributions of the statistics under H o
We have just mentioned uncertainties concerning the asymptotic distributions ( X 2 2 with 4 degrees of freedom for I SA and 1 FE if Wolfe's (1971) approximation is valid, no known property for l ME ). Furthermore these distributions are unknown in samples of limited size. In order to estimate these distributions, samples were simulated under H o (500 samples for SA, 1000 for FE and ME) with different numbers of sires (n = 5, 10, 20) and of progeny per sire (m = 5, 10, 20). The test statistics I SA , !MEi, I ME2 and I FE were calculated for each sample. The estimated distributions obtained were used to test the convergences to X 2 distributions. They also helped determine boundaries for critical regions in samples of a limited size. We used the Harrel and Davis (1982) method to estimate quantiles at 5 and 1% and their jackknife variance as defined by Miller (1974). These simulations were based on a heritability of 0.2.

Comparisons of the powers
By using the table of the critical regions thus obtained for each family structure, we have been able to compare the powers of the tests. These powers depend not only on the number and size of the families in the sample but also on the values of the parameters (p, < 7 g, p l , p 2 , q) which characterize the major gene segregating in the population. ' For each of the 9 family structures described above, three H I hypotheses were considered, each with a simulation of 100 samples. All these populations are assumed to follow the Hardy Weinberg law. The differences between the three H l hypotheses lie in the mean effects of the genotypes (expressed in standard deviation units) and the frequency of the allele A.
Case 1: complete dominance and equal allele frequencies Case 2: additivity, equal allele frequencies Case 3: Complete dominance, recessive allele rare The power of the tests was measured by the percentage of H o rejection.

Algorithms and cost of calculations
The methods must also be compared on the basis of how much computation they require. The calculations described above were made using the quadrature and optimization subroutines of the NAG fortran library. In order to maximize the likelihoods of the sample we used a Quasi-Newton algorithm in which the derivatives are estimated by finite differences.
The same algorithm was used for the four methods, giving results of a similar degree of precision. However, various algorithms can be used to estimate the maximum likelihood of the parameters. In the ME and FE tests, the first derivatives have a simple algebraic form and the maximum likelihood solutions are reached by zeroing the first derivatives (with respect to each of the parameters) of the logarithm of the likelihood. Under H l the corresponding system of equations can be solved iteratively, but not directly, by using for instance the EM algorithm defined by Dempster et al. (1977): see appendix.
This is the algorithm we used for the ME2 test in order to obtain more extensive information on critical region: 5, 10, 20, and 40 sires, 5, 10, 20 and 40 progenies/sire, heritability of 0, 0.2, 0.4.

Comparison of the four methods
Tables I to IV show the main characteristics of the distributions of the 4 test statistics: mean, standard deviation, 5% and 1% empirical quantiles and percentage of replicates beyond the 5% and 1% quantiles of a x4. Table V shows their powers. First, we can note that for the number of progeny increases, the mean distributions as the four test statistics decrease (except I SA between m = 5 and m = 10 for n = 5).
The fact that 1 statistics distributions converge toward a X 2 with 4 degrees of freedom cannot be confirmed since all the distributions of l, but one (segregation analysis with 5 sires and 5 progenies/sire), are significantly different from a k2 using a X 2 test of fit. Moreover, the scaled statistics (2E(l)/var (l)). l are also significantly different from a x 2 . It must be emphasized that the samples studied are far from the conditions of validity of Wolfe's approximation which requires that n > 10.m (Everitt, 1981). The I SA statistics show a notable stability as the family size varies, whereas for I FE the statistics only reaches an asymptote as m, the number of progeny per sire increases. As regards the I ME statistics, the results are totally different.
The mean and standard deviation of the I ME1 statistic decreases when the number of sires or progeny per sire increases. It appeared that the distribution of this I MEI statistic becomes very peaked near zero. It must be noticed that this pattern is close to the asymptotic distribution of the LR test of a mixture of 2 known distributions in unknown proportion studied by Titterington et al. (1985). These authors found that, under H o (only one component) the LR test &dquo;is 0 with a probability 0.5 and, with the same probability, is distributed as a x2 with one degree of freedom&dquo;. On the other hand, for a given number of progeny, the mean of the l ME2 distribution increases with the number of sires. The fewer the progeny, the greater the increase.
The calculation of the power (Table V) shows some important facts: very low power of the four statistics for low number of sires and/or progeny, clear superiority of the segregation analysis and first of the modal estimation method whatever these numbers, with respectively a 90% and a 80% power in the best case (though involving only 400 animals), very poor performance of the I FE statistic, intermediate power for l ME2 .
Thus knowledge of heritability is a substantial advantage and gives a reason to prefer the I ME statistics against the 1 FE , which requires similar amounts of computation.
The comparison of powers in hypothesis H l is also interesting: it is much more difficult to detect an additive major gene (case 2) than a dominant one (case 1) even with the segregation analysis which is 3 to 4 times less powerful in case 2 than in case 1. In comparison with the isofrequent case, the third case shows a 50% loss of power: with measurements made on a small population, very few individuals if any, belong to the high mean distribution.
The computation requirements have been estimated, on a 3083 IBM computer, by the CPU time needed for the evaluation of the statistics under H o . Ten replicates of a sample of 10 sires and 10 progenies per sire used 640 s for the ls A statistic, 142 s for the I FE statistic and 48 s for the I ME statistics. Using the EM algorithm instead of the direct maximization of INt E with the NAG subroutines decreases the time requirements to 20 s only. Thus, the proposed simplified tests l ME are 30 times as fast as the segregation analysis.

Tables of quantiles
Although theoretical works are still needed in order to describe the asymptotic behaviour of the I SA , I ME , and 1 FE tests, one can use, as a first approach, the quantiles given in our tables for larger populations since this will produce an overestimation of the first type error. On the contrary, some more calculations are needed for the l ME2 test.
The 5 and 1% points for this statistic are given in figures 1 to 3 depending on the heritability (0.0, 0.2, 0.4). Each figure gives these points for varying numbers of sires and progeny per sire.
Note that when the heritability is 0., the sire effect is not defined and, thus, that the u i (a + 1] terms disappear from the equations given in the appendix. The results of Table III are confirmed: the quantile estimates increase with the number of sires n (for a given number of progeny per sire, m) and decrease when the number of progeny per sire increases. Two other results must be noticed: -given n and m, the lower the heritability, the greater the quantiles.
on the variation range studied for m, the number of progeny per sire, the increase of the quantiles is nearly linear with n (number of sires) allowing some extrapolations for higher values of this number.
Finally, the jackknife standard deviation of the estimated quantile varies, for the 5% case, between 0.23 and 0.89, with a mean value of 0.52 and, for the 1% case, between 0.39 and 1.65 with a mean value of 0.92. These errors could explain the observed deviations of the plotted curves from smoothness.

CONCLUSIONS
On the four statistical tests studied, the &dquo;segregation analysis&dquo; method is, as expected, the most powerful. Applied on a large scale, this test requires a great deal for computation. The &dquo;modal effect&dquo; method requires much less computation than the segregation analysis and shows practically no loss of power for the first version and a limited loss of power (diminishing as soon as the sample size is sufficient) for the second version. Unfortunately, the asymptotic distribution of this last statistic is unknown. The tables of quantiles we obtained by simulation permit the utilization of this test for typical sample sizes and for various heritability values.