Identification of a major gene in F1 and F2 data when alleles are assumed fixed in the parental lines

A maximum likelihood method is described to identify a major gene using F2, and optionally Fi, data of an experimental cross. A model which assumed fixation at the major locus in parental lines was investigated by simulation. For large data sets (1000 observations) the likelihood ratio test was conservative and yielded a type I error of 3%, at a nominal level of 5%. The power of the test reached > 95% for additive and completely dominant effects of 4 and 2 residual SDs respectively. For smaller data sets, power decreased. In this model assuming fixation, polygenic effects may be ignored, but on various other points the model is poorly robust. When Fl data were included any increase in variance from Fi to F2 biased parameter estimates and led to putative detection of


INTRODUCTION
In animal breeding, crosses are used to combine favourable characteristics into one synthetic line. It is useful to detect a major gene as soon as possible in such a line, because selection could be carried out more efficiently, or repeated backcrosses be made. Once a major gene has been identified it can also be used for introgression in other lines.
Major genes can be identified using maximum likelihood methods, such as segregation analysis (Elston and Stewart, 1971; Morton and MacLean, 1974). Segregation analysis is a universal method and can be applied in populations where alleles segregate in parents. However, when applied to F l , F 2 or backcross data assuming fixation of alleles in parental lines, genotypes of parents are assumed known and all equal and this analysis leads to the fitting of a mixture distribution without accounting for family structure.
Fitting of mixture distributions has been proposed when pure line and backcross data as well as F i and F 2 data are available, and when parental lines are homozygous for all loci (Elston and Stewart, 1973;Elston, 1984). Statistical properties of this method, however, were not described, and several assumptions may not hold. For example, not much is known concerning the power of this method when only F 2 data are available, which is often the case when developing a synthetic line. Furthermore, homozygosity at all loci in parental lines is not tenable in practical animal breeding. Here it is assumed that many alleles of small effect, so-called polygenes, are segregating in the parental lines. Alleles at the major locus are assumed fixed. F l data could possibly be included, but this is not necessarily more informative because F i and F 2 generations may have different means and variances due to segregating polygenes.
The aim of this paper is to investigate by simulation some of the statistical properties of fitting mixture distributions, such as Type I error, power of the likelihood ratio test and bias of parameter estimates when using only F 2 data. To study the properties of the major gene model, polygenic variance is not estimated. The robustness of this model will be checked when polygenic variance is present in the data, and when the major gene is not fixed in the parental lines. The question of whether F i data can and should be included will be addressed.

MODELS USED FOR SIMULATION
A base-population of F 1 individuals was simulated, although the F 1 generation may not have had observed records. Consider a single locus A with alleles A l and A 2 , where A l has frequencies fp and 1 m in the paternal and maternal line. Genotype frequencies, values and numeration are given for F l individuals as: Genotypes of F 1 animals were allocated according to the frequencies given above using uniform random numbers. For the F 2 generation, genotype probabilities were calculated given the parents' genotypes using Mendelian transmission probabilities and assuming random mating and no selection. A random environmental component e i was simulated and added to the genotype. The observation en individual i(F l or F z ) with genotype r(y.L ) is: with e i distributed N(O, 0 &dquo;2). Polygenic effects are assumed to be normally distributed. For base individuals polygenic values were sampled from N(O, a 9 2), where a § is the polygenic variance. No records were simulated for F i individuals when polygenic effects were included. For F 2 2 offspring, phenotypic observations y'Ù were simulated as: where Oi is the Mendelian sampling term, sampled from N(O, Q9/ 2), ap and a&dquo;, are paternal and maternal polygenic values and e ij is distributed N(0, !2). Additionally, data were simulated with no major gene or polygenic effect: where e i is distributed ./V(0,o!). A balanced family structure was simulated, with an equal number of dams, nested within sire, and an equal number of offspring for each dam. Random variables were generated by the IMSL routines GGUBFS for uniform variables and GGNQF for normal variables (Imsl, 1984).

MODELS USED FOR ANALYSIS
The test for the presence of a major gene is based on comparing the likelihood of a model with and without a major gene. Polygenic effects are not included in the model, and the model without a major gene therefore contains random environment only. Apart from major gene or no major gene, models can account for only F 2 data, or for both F 1 and F z data. This results in a total of 4 models to be described.
Model for F 2 data with environment only For F 2 data, with n observations, the model can be written: The logarithm of the joint likelihood for all observations, assuming normality and uncorrelated errors, is: Maximising [5] with respect to Q and Q Z yields as the maximum likelihood (ML) estimate for the mean, /3 = E i y i/ n, and the ML estimate for the variance is !2 = &dquo;E.i( Yi -íJ) 2 /n. Model for F i and F 2 data with environment only Data on F i and F 2 are combined, with n l + n 2 = N observations. The observation on animal j from generation i(i = 1, 2) is: where !32 is the mean for generation i. Observations for F I and F 2 are assumed to have equal environmental variance. The joint log-likelihood is given as: The ML estimates for _,O i are simply the observed means for each generation, ie í 3 1 = E!yl!/nl, and j2 = £;y 2 ;/n 2 . The ML estimate for the variance is Model with major gene and environment for F 2 data When alleles are assumed fixed in parental lines, all F i individuals are known to be heterozygous. If no polygenic effects are considered, this means that all F 2 2 individuals have the same expectation, and conditioning on parents is redundant. In the likelihood for such data, summuations over the parents' possible genotypes can be omitted and families can be pooled. The model is given as: and the log-likelihood equals: In [9] G i is the genotype of individual i, P r denotes the prior probability that G i = r, which equals 1/4, 1/2 and 1/4 for r = 1, 2 and 3 (or A I A l , A l A 2 and A 2 A 2 ). The total number of F 2 individuals is given as n, and the function f is given as: Model with major gene and environment for F i and F 2 data In the F l generation only one genotype occurs; hence F l data are distributed around a single mean, with a variance equal to the residual variance in the F 2 generation. Due to possible heterosis shown by the polygenes a separate mean is modelled, but the possible heterogeneity in variance caused by polygenes is not accounted for.
The model for individual j from generation i for genotype r is: where /3i is a fixed effect for generation i. Model [11] is overparameterised because genotype means and 2 general means are modelled. We chose to put /? 2 = 0. In that case the mean of F i individuals, which all have known genotype r = 2, can be written as !F1 = U2 +,3 1 . The joint log-likelihood for F i and F 2 data, using !,F1 is: where n l and n 2 are number of observations in the F l and F 2 generation. The ML estimate for p fi is equal to !31 in !6).
ML estimates for !C,.(r = 1, 2, 3) and Q 2 in models [8] and [11] cannot be given explicitly. These parameters were estimated by minimising minus log-likelihood L 2 in [9] and L2 in !12!, using a quasi-Newton minimisation routine. A reparameterisation was made using the difference between homozygotes t = A3 -ii i , and a relative dominance coefficient d = (!2 -!i)/t, as in Morton and MacLean (1974).
By experience, this parameterisation was found more appropriate than the parameterisation using 3 means i Ll , !2 and J .l 3 , because convergence is generally reached faster due to smaller sampling covariances between the estimates. The mean was chosen as the midhomozygote value: a = 1 /2p i + 1/2/!3.
Parameters t and d are easier to interpret than 3 means, and therefore results are also presented using these parameters. Parameter t indicates the magnitude of the major gene effect and can be expressed either absolutely or in units of the residual standard deviation. Parameter t was constrained to be positive, which is arbitrary because the likelihood for the parameters p, t and d is equal to the likelihood for the parameters p, -t and (1-d H l is rejected, although it was true. Here, the type II error is not used, but its complement, the power, which is the percentage of cases in which H l is accepted, when H l is true. The H I model is simulated by model (1!. Fixation of alleles in parental lines is simulated by taking fp = 1 and f m = 0.

Type I error
The distribution of T when H o is true is expected asymptotically to be x2 with 2 degrees of freedom, because the H I model has 2 parameters more than the H o model (Wilks, 1938). Since in practice data sets are always of finite size, it is interesting to know whether and when the distribution of T is close enough to the expected asymptotic distribution, so that quantiles from a x 2 distribution can be used as critical values. Type I errors were estimated for data sets of 100 up to 2 000 observations, simulating 1 000 replicates for each size of data set. Three critical values were used, corresponding to nominal levels of 10, 5 and 1%. The nominal level is defined as the expected error rate, based on the asymptotic distribution. Exact binomial probabilities were used to test whether the estimates differed significantly from the nominal level. When the observed number of significant replicates does not differ significantly, a x z distribution is considered suitable to provide critical values.
Also, when the observed number is lower than expected the asymptotic distribution might remain useful. The nominal tye I error is in that case an upper bound for the real type I error.

Power of the test and estimated parameters
The power is investigated for additive (d = 0.5) and completely dominant (d = 1) effects, with a residual variance of 100, and t varying from 10-40, ie from 1 to 4 SDs. The additive genetic variance caused by this locus equals t 2 /8, when t is absolute. Heritability in the narrow sense therefore varies from 0.11-0.67. Each data set contained 1 000 observations, and each situation was repeated 100 times. The power of the test for smaller data sets was investigated for one relatively small effect and one relatively large effect.

Robustness
Investigation of the type I error and the power considered situations where either H o or H l was true, satisfying all assumptions in the models. The robustness of this test and usefulness of the assumption of fixation in parents for parameter estimation was investigated for situations which violate 2 assumptions: -when there is a covariance between error terms. This was induced by simulation of polygenic variance by model (2]. The total variance was held constant at 100, so that the power of the test could not change due to a change in total variance; -when fixation of alleles is not the case. The data were simulated by model (1], in which fp and f m were not equal to 0 and 1, resulting in segregation of alleles in the F l parents. Firstly, 3 situations were simulated where the average allele frequency remains 0.5. In that case only the assumption that all F 1 parents are heterozygous was violated. Secondly, 3 situations were simulated where the average allele frequency was not 0.5. In that case, the assumption that genotype frequencies in F 2 are 1/4, 1/2, and 1/4 was also violated.

Inclusion of F l data
A major gene which starts segregating in the F 2 not only renders the distribution non-normal, but also increases the phenotypic variance in the F 2 relative to the F i . When F i data are included, this increase in variance may be taken as supplementary evidence, apart from any non-normality, for the existence of a major gene. Assessing the relative importance of the 2 sources of information is useful so as to judge the robustness of the model including F i data. The effects on non-normality and increased F 2 variance due to the major gene should therefore be distinguished. This was accomplished by simulating different residual variances in F I and F 2 . Four situations were investigated, combining all combinations of non-normality in F 2 and increased variance in F 2 ( Estimates decreased, and more or less stabilised when the size of the data set exceeded 1 000 observations, especially for a nominal level of 10%, which were most accurate. For these large data sets, however, the type I errors were too low (P < 0.01), which means that critical values obtained from a X '2 distribution would provide a too conservative test. For example, application of the X2 95-percentile to data sets with 1 000 observations will not result in the expected type I error of 5%, but rather in a type I error of x5 3%.
When no major gene effect was present, stil on average a considerable effect could be found. Parameter estimates for the major gene model have been given in table III, simulating just a normally distributed error effect with variance 100. The empirical standard deviation for estimated t-values ranged between 7(N = 100) and 5(N = 2000) (not in table). The average estimate for t is therefore biased, and many of the individual estimates were significantly different from zero if a t-test was applied. The average estimated d is 0.5, which is expected because the simulated distribution was symmetrical.  1). For small genetic effects (t ! 10, ie 1 Q ) t was overestimated, in particular when d differed from the simulated values by < 1% when the power reached near 100%. For d = 1, however, the bias in t was still 10% when the power had reached 100%.
This bias reduced gradually, and was < 1% for a genetic effect of t = 40. In figure 2 power of the test is depicted for varying sizes of the data set. Two additive effects were chosen, with t = 25 and t = 35. Each point in the figure is on average of 100 replicates. The power increased with increasing number of observations. Increasing the number of observations > 1000 gave relatively less improvement in power, especially for the smaller effect (= 25). For a small number of observations this graph is expected to level off at the type I error (nominally 5%), but sampling makes results somewhat erratic.
Robustness when ignoring polygenic variance Data following model [2] were simulated with d = 0.5 and t = 35 and different proportions of polygenic and residual variance. The data set contained 20 sires with 5 dams each and 10 offspring per dam; each situation was repeated 100 times. Estimated parameters and resulting power are in table V. Parameter estimates for t and d, and the power of the test were not affected when a part of the variance was polygenic. The total estimated variance was equal to the sum of simulated variances.
Robustness when ignoring segregation in the parental lines Data following model [1] were simulated with d = 0.5, t = 35, Q 2 = 100 and various values for fp and fm. The genotype probabilities in parents (F 1 ) and offspring (F 2 ) have been given in table VI. For the first 3 situations, genotype probabilities in the F j were 1/=1, 1/2 and 1/4, as assumed under the fixation assumption. For the last 3 situations, however, genotype probabilities were different, because the allele frequency was not 0.5 on average. High average allele frequencies were simulated, but because only additive effects are considered, results are equally valid for low allele frequencies. The power remained equal, as long as genotype probabilities in F 2 remained 1/4, 1/2 and 1/4, and parameter estimates are unbiased (table VII).
In case the allele frequency did not average 0.5, however, parameter estimates were biased. The power of the test increased, because in this situation the distribution became skewed. The situation with d = 0.5 and t = 35 for data where the gene is fixed in parental lines (table IV), with a power of 82%, may serve as a reference.

Inclusion of F, data
Five hundred, or 1 000, F 1 observations were also simulated, with additive major gene effects (table VIII). With no major gene effect (t = 0 and hence a 7 71 2 = 0), and with equal variances in F i and F 2 (situation 1) the average estimated t was much smaller than in the model using only F 2 data (table III). In the second situation (table VIII) a major gene effect of t = 20 was simulated which corresponds to the given major gene variance of 50. When using only F 2 data, the test had a power of only 12% for detection of an additive effect of t = 20 (table IV). When including F i data, however, the power was 100% (table VIII). From the situations 3 and 4 considered in table VIII, however, it becomes apparent that when F I data were included, the major gene was detected only by its effect on variance, considering a power near the type I error rate as irrelevant. When the variance in F 2 increased by 50%, but when in fact no major gene was present, a major gene was found in 100% of the cases. For smaller increases of the variance (10%) major genes were still detected, and the probability of detection increased with the size of the data set (alternative 3 * with more F i observations). A major gene was totally undetectable, on the other hand, when the total variance in F i was equal to the total variance in F 2 (situation 4). This shows that the ability to detect a major gene can even be worsened when F I data are included. If only F 2 data were used a major gene with similar effect was detected in 12% of the cases (table IV).

Type I error
Nominal levels for type I errors were based on Wilks (1938) who proved asymptotic convergence of the likelihood ratio test statistic to a X 2 distribution. Type I errors decreased and stabilised for larger data sets, as expected. The estimated type I errors, however, were significantly too low. It is unlikely that the type I error, after having first decreased, would increase for even larger data sets as studied here. It can be concluded therefore, that type I errors are significantly lower than expected in the asymptotic case, and that for large data sets the likelihood ratio test is conservative. It has been investigated whether the constraint used on the dominance coefficient could have caused the too low type I errors. However, this was not the case, because even with no constraint, too low type I errors were found of 7.5% and 3.9% at nominal levels of 10 and 5%.
For the investigation of power we have chosen to use the theoretical asymptotic quantiles, although they were shown to give a conservative test. The nominal level for the type I error is then an upper bound, and the experimenter still has a reasonably good idea of the risk of making a type I error. When the actual type I error would be above the expected level, however, the test would become of less use.
A second reason for still using theoretical asymptotic quantiles is that adapting the test is difficult and of little practical use. A difficulty is, for instance, that estimated quantiles would be subject to sampling and the obtained point estimate is therefore only expected to give the correct test. Therefore, 2 experimenters investigating the same test, will find different critical values and the test applied will depend on the experimenter. Also in practice such a procedure would be difficult to apply since the calculated quantile would only hold for the same model and data sets of similar size and structure.
Power of the test Using only F 2 data, the power of this test was poor for additive effects (dominance coefficient = 0.5). This can be explained by the resulting symmetrical distribution which is similar to the distribution under H o . In this case, the genetic effect has to be about 4 Q to be detectable, which corresponds to a heritability of 0.67 in the F 2 generation. When the dominance coefficient is 1, an effect of 2 Q was detectable.
These results are based on data sets with 1000 observations, but it was shown that the power decreased dramatically for smaller data sets.
Power increased when F I data was included in the analysis, and additive effects of 2a could be detected. In that case the increase in variance in F 2 , caused by the major gene, was taken as an important indication for the presence of a major gene. The power to detect a major gene in F 2 data may also increase if alleles were not fixed in the parental lines, or alternatively F 3 , instead of F 2 , data were used. This corresponds more to the situation in a usual population, where between-family variation will arise. For F 3 data, for example, when pure lines were homozygous, the allele frequency will be 0.5, and parents will be in Hardy-Weinberg equilibrium. For such a situation, Le Roy (1989) found a power of 25% for an additive effect of 2u in a data set of 400 observations (20 sires with 20 half-sib offspring each). In figure 2, the power for a data set of similar size can be seen to be only ! 10% for an even larger effect of 2.5!(t = 25). This indicates that an increase in power may be expected when the F 3 generation is observed, despite the facts that more parameters have to be estimated, and that parents' genotypes are no longer known.
The power for detection of a major gene is related to the unexplained variance in the model of analysis. The inclusion of fixed and polygenic effects will therefore make the major gene easier to detect, provided that all these effects can be accurately estimated.

Parameter estimates
For additive effects simulated (d = 0.5), bias for the average estimated genetic effect For dominant effects (d = 1), however, t was overestimated by 10% when the power for detection of a major gene reached 100%. This overestimate is probably related to the underestimate for d, which resulted from the applied constraint. As mentioned, this constraint was applied to prevent t from going to zero, at which point d tended to go to infinity. When such a constraint was not applied with, for instance, an effect of t = 10 and d = 1, analyses gave in 100 replicates an average estimated d of 2.93. This is an average overestimate of ! 200%. The average estimate using the constraint was 0.93, showing that indeed better estimates were obtained under the restriction, even when the true value was on the border of the allowed parameter space. In practice, of course, overdominance cannot be excluded and parameter estimates could be compared with and without this constraint. A small, near zero, estimate for t and a large estimate for d would suggest a possible overestimation of d.
For very small or absent effects, the ML estimates were considerably biased. In this situation, the asymptotic properties of ML estimates, ie consistency, are far from being attained. In the absence of a major gene, average estimates were presented for increasing size of the data set. This showed that the average estimate decreased, and will probably reach the true value when the number of observations is very much larger. Bias of ML estimates in finite samples also resulted in significant tvalues when no effect was present. This indicates that the presence of a major gene should not be judged by the estimates and their standard errors. The standard errors discussed here were empirical standard errors. In practice such standard errors will have to be obtained using the inverse of an estimated Hessian matrix, or some other quadratic approximation of the likelihood curve near the optimum.
Using the estimated Hessian matrices, we found roughly the same standard errors, although they were not very accurate. In our study, the quasi-Newton algorithm was started close to the optimum and not enough iterations are then carried out to estimate the Hessian matrix accurately.

Robustness of model and test
Inclusion of F l data results in a poorly robust test when differences in variances would arise between the F i and F 2 due to other causes than a major gene. An increase in variance from F i to F 2 can result in a putative major gene being detected. An increase in variance of 10%, for instance, gave 25% false detections when 1000 F i and 1000 F 2 observations were combined. Such increases are not unlikely, due to, for instance, polygenes. The major gene test is then merely a test for homogeneous variance in F 1 and F z . The inclusion of F l data could also worsen the detection of a major gene, when the environmental variance in F 2 was less. Therefore any differences in variance, due to other causes than the major gene effect, will bias the parameter estimates. Also in a model that allows for segregation, such biases will remain. ' It was shown that the model is robust when polygenic effects were ignored. This can be explained by the fact that the test uses only the non-normality of the distribution as a criterion. It must be noted however that, when polygenic effects can be accurately estimated, including a polygenic effect in the model will increase power because it reduces the residual variance.
Another aspect of robustness concerns the assumption of fixed alleles in parental lines. It was shown that parameter estimates were not biased when alleles segregated, as long as the average frequency in the 2 lines was 0.5. In that case the assumed fitting proportions 1/4, 1/2 and 1/4 are still correct. If the average frequency in parental lines differed from 0.5, t was underestimated and, because skewness was introduced, estimates for d deviated from 0.5. This second situation is more likely to occur than the situation where the average frequency is exactly 0.5. Because it could be difficult to justify the fixation assumption a priori, application of a more general model that allows for segregation in parental lines might have to be considered.
A final aspect of robustness concerns non-normality of the distribution not due to a major gene. As stated earlier a mixture distribution is fitted and the detection of a major gene in F z data, assuming fixation, relies solely on the non-normality caused by the major gene. This means that in fact only a significant non-normality is proven. The method would therefore be poorly robust against any non-normality due to another cause. The robustness might be improved using data in which alleles segregate in parents. This is guaranteed in F 3 data, but may also arise in F a data, when alleles were not fixed in parental lines. If segregation in parents is the case, evidence for a major gene is no longer only in the non-normality of the overall distribution, but also for instance in heterogeneous within family variances. Therefore a model that allows for segregation is not only preferred to increase power, but is also preferred to improve robustness.