Power and parameter estimation of complex segregation analysis under a finite locus model

Puissance et estimation des parametres dans l'analyse de segregation complexe avec un modele a nombre fini de locus. La puissance de l'analyse de segregation et l'estimation des parametres ont ete etudiees sur des familles nucleaires independantes pour un caractere quantitatif determine soit par un nombre fini de locus soit selon un modele d'heredite mixte, impliquant un gene majeur et un residu polygenique infinitesimal. Dans le modele a nombre fini de locus, le nombre de locus suppose etait de dix et leurs effets suivaient une loi de distribution geometrique. En outre, la possibilite de liaison genetique entre un locus majeur et d'autres locus etait envisagee. Deux methodes d'analyse de segregation ont ete comparees, utilisant soit un modele d'heredite mixte, soit un modele d'heredite avec un nombre fini de locus. Les deux methodes statistiques presentaient des puissances similaires pour detecter un gene majeur et estimer les parametres correspondants. A l'exception toutefois d'une situation avec deux locus majeurs ayant le meme effet sur le phenotype. Le modele a heredite mixte avait alors une puissance superieure a celle du modele a nombre fini de locus, mais les estimees des parametres a partir du modele mixte etaient plus biaisees que celles du modele a nombre fini de locus. L'analyse de segregation etait plus puissante pour detecter un gene majeur dans le cas d'un caractere determine par un nombre fini de locus que dans une situation d'heredite mixte. Un gene majeur lie a un autre gene etait plus difficile a detecter qu'en l'absence de liaison genetique. La segregation de deux genes majeurs creait des biais d'estimation. Les biais etaient encore accrus en cas de liaison genetique quand les parents n'etaient pas tires d'une population en equilibre gametique pour les deux locus majeurs.


INTRODUCTION
Statistical methods used to determine the mode of inheritance of a quantitative trait in detection of major genes rely on phenotypic information. In addition, methods can utilize information on genetic markers, which are now numerous. In both cases, the most common statistical methods to detect a major gene are based on maximum likelihood theory. Maximum-likelihood-based complex segregation analysis was introduced by Elston and Stewart (1971) and Morton and MacLean (1974). Complex segregation analysis combines three factors into a mixed model for analysis of phenotypes for a quantitative trait: a gene which explains a detectable part of genetic variance (major gene); residual polygenic variance, for which individual gene effects are not of direct interest or detectable; and environment. Recently a finite polygenic mixed model, which explains the polygenic part of inheritance by a finite number of loci, was proposed by Fernando et al (1994) as an alternative formulation for the mixed model. To make the finite polygenic mixed model computationally feasible it is assumed that loci which explain the polygenic part of inheritance are unlinked, biallelic, codominant, and have equal gene effects and equal frequencies of favourable alleles (0.5) across loci .
Power of segregation analysis of independent nucleus family data (full-sib families) with the mixed model was investigated by MacLean et al (1975) and Borecki et al (1994) and for half-sib data by Le Roy et al (1989) and Knott et al (1991). In all cases, data were simulated according to the mixed model of inheritance. The general conclusion from these studies was that the best chance to detect a major gene is if it is dominant with moderate to low frequency in the population. By increasing data size (number of families and size of the families), major genes with smaller effects can be detected.
Many aspects that might affect robustness of segregation analysis with the mixed model have been studied also (MacLean et al, 1975;Go et al 1978;Demenais et al, 1986). The main concern has been false detection of a major gene with skewed data. To overcome this problem, power transformation of the data was proposed (MacLean et al, 1976). The optimal solution for skewed data is to make the transformation simultaneously with estimation of other parameters (MacLean et al, 1984). Removing skewness may, however, lead to reduced power to detect a major gene (Demenais et al, 1986).
Other common assumptions in segregation analysis include homogeneous variance within major genotypes, independence between the major gene and polygenic effects, no genotype by environmental correlation, and no correlation between environment of parent and offspring (MacLean et al, 1975).
One basic assumption of segregation analysis, which has received less attention, is normality of the residual distribution (polygenic + environmental) within a major genotype. This assumption is met if the polygenic part is controlled by infinite number of genes that each have only a small effect on phenotype, ie, the infinitesimal model (Bulmer, 1980), and if the environmental factor is normally distributed.
However, the infinitesimal model might not be the best model for the distribution of gene effects. A model where few genes with a large effect and several genes with small effects control a quantitative trait may be closer to the real nature of the distribution of gene effects. Evidence from Drosophila melanogaster supports this hypothesis (Shrimpton and Robertson, 1988;Mackay et al, 1992). Such a distribution of gene effects can be approximated by a geometric series (Lande and Thompson, 1990).
If gene effects follow a geometric series, the distribution within major genotype may not be normal, as with the infinitesimal model. This violates the assumption of a normally distributed polygenic part of the mixed model commonly used in segregation analysis. Two or more loci with large effects can also lie in a cluster on a chromosome, which would link the major gene to other genes and thus violate the assumption of independent segregation of a major gene and polygenes.
The objective of this paper was to study the effect of violation of the two assumptions of the underlying model in segregation analysis, namely a skewed polygenic distribution and linkage between a major gene and polygenes, on the power of detecting a major gene and on parameter estimation. Behavior of the mixed model of segregation analysis (Morton and MacLean, 1974) was compared to the finite polygenic mixed model . The methods were compared under an independent nucleus family data structure.

MATERIALS AND METHODS
Balanced data on a quantitative trait were simulated for 25 independent fullsib families, with a sire, dam, and ten offspring. All parents were assumed to be unrelated and were generated from a population under Hardy-Weinberg and linkage equilibria. Genotypes of parents were generated under a ten-locus model (finite locus model) or under a mixed model (from now on this will be called the mixed generating model, whenever necessary, to distinguish between models used for generating and for analyzing the data).
Under the finite locus model, the gene with largest effect had a substitution effect of 1.0 (the difference between two homozygotes is twice the substitution effect) and the gene with the second largest effect had a substitution effect of 0.25, 0.5 or 1.0. Gene effects of the eight other loci followed the geometric series 0.25, 0.125, 0.0625, where one locus had an effect of 0.25, three loci an effect of 0.125 and four loci an effect of 0.0625. Gene frequencies were 0.5 for all loci except for the major locus, for which frequency of the dominant allele was either 0.1, 0.5, or 0.9. Two alleles per locus were simulated. The three loci with largest effect were completely dominant and other loci were additive. Genotypes of progeny were generated using either independent segregation of loci or the two loci with the largest effect were linked with a recombination rate of 0.1. In the case of linkage, linkage phase of the parents was either random or all parents were double heterozygotes for the two linked loci (favourable alleles on same chromosome).
For every finite locus scenario, corresponding genotypes were also generated with a mixed model. Under the mixed-generating model, a major gene with a substitution effect of 1.0 was simulated, along with a polygenic part, which was simulated from a normal distribution with 0 mean and genetic variance equal to the total genetic variance (additive + dominance) of the other nine loci in the corresponding finite locus model. The polygenic effect of progeny was generated from a normal distribution with mean equal to the average of polygenic effects of the parents and variance equal to half of the polygenic variance.
Phenotypes were generated for both the finite locus and the mixed-generating model by adding an environmental effect to the genotypic effects. Environmental effects were simulated from a normal distribution with mean 0 and variance corresponding to one minus the broad sense heritability (H 2 , total genetic variance over phenotypic variance), which was equal to 0.4. A summary of the genetic scenarios that were simulated is given in table I. Simulated data sets were analyzed by two computer packages. The Pedigree Analysis Package (PAP Rev 4.02, Hasstedt, 1982Hasstedt, , 1994) was used to compute the likelihood of the mixed model and SALP (segregation and linkage analysis for pedigrees, Stricker et al, 1994) to compute the likelihood of the finite polygenic mixed model. Only one major locus was fitted in SALP. Mendelian transmission probabilities, equal variances within genotypes and no power transformation were used in PAP. Downhill simplex method is used for maximization in SALP and Gemini (Lalouel, 1979) in PAP. Because Gemini does not allow maximization at boundaries of the parameter space (gene frequency and heritability have boundaries at 0 and 1) the program occasionally stopped. In those cases, the parameter that reached the boundary was fixed close to the boundary (0.0001 or 0.9999 for gene frequency and 0.0001 for heritability) and other parameters were maximized conditional on that. Because the major gene was simulated with complete dominance, p AA was fixed to be equal to pAa in all maximum likelihood analyses. Input values for simulation were used as starting values for the maximization process. Likelihood ratio test statistic was calculated by comparing a general model to a model with equal means (fJ AA = fJAa = /-t aa)-Because SALP and PAP use different parameterization of effects, parameters were converted to two genotypic means ( PAA and Aaa ), gene frequency of the dominant allele (p), and polygenic (ufl) and environmental (ud) variances. Instead of polygenic and environmental variances, PAP estimates heritability (h 2 ) and the phenotypic standard deviation conditional on major genotype; for the finite polygenic mixed model SALP estimates a scaling factor (= (Qu!(q(1q)k)], where q is the allele frequency at polygenic loci, which was fixed at 0.5, and k is twice the number of polygenic loci, which was fixed at ten), and phenotypic variance.
Each simulated major gene scenario (table I) was replicated 50 times. Empirical power of the mixed model of analysis was measured as the proportion of cases in which the likelihood ratio test statistic exceeded the X Z distribution with 2 df at 5% significance level.
Because the likelihood test statistic is only asymptotically distributed according to the X 2 distribution (Wilks, 1938), 200 replicates of six data sets without a major gene were generated based on the infinitesimal model and the proportion of test statistics which supported the major gene hypothesis was calculated for both the mixed model and the finite polygenic mixed model. Polygenic and environmental variances of the examples corresponded to sets 2 and 3 (table I) without a major gene. The proportion of false detection is expected to be 5% when a 5% type I error level is used.
Empirical power of the mixed model was measured as the proportion of cases in which the major gene hypothesis was accepted. Under the mixed-generating model, the power corresponds to the probability of detecting the simulated major gene. This is not the case when data are simulated under the finite locus model; instead of detecting the first locus as a major gene, the power indicates the probability of detecting any of the simulated loci as a major gene.

Power of the likelihood ratio test
The proportions of false detection of major gene when no major gene effect was generated, but the likelihood ratio between the mixed model and the polygenic model was compared to the X 2 table value with two degrees of freedom at 5% significance level, were 4, 3 and 6% for set 2 distribution of gene effects (table I) and 4, 3 and 5% for set 3 distribution of gene effects with gene frequencies of 0.1, 0.5, and 0.9, respectively. Using the finite polygenic mixed model and its sub-model the corresponding values were 4, 3, 4 and 4, 4, 3%, for set 2 and set 3, respectively.
Thus the true power of detecting a major gene for the data structure used here can be somewhat higher for both methods than reported in table II. When data were generated under the mixed model, the highest power was achieved when frequency of the dominant allele was low and the lowest power with a rare recessive allele (table II). This pattern was consistent across different proportions of genetic variance explained by polygenes (sets 1, 2 and 3). Under the finite locus model, the pattern changed when two major loci had an equal effect on the trait (table II, set 3); the highest power for the mixed model was achieved when one of the genes was almost fixed in the population, however, the difference between cases of gene frequency of 0.5 and 0.9 for the finite polygenic mixed model was small (without linkage).
The effect of the proportion of total genetic variance that a major gene explained on the power was very clear under the mixed-generating model; the power was higher if the major gene explained a large proportion of total genetic variance, when compared within the same gene frequency (table II, sets 1, 2 and 3).
The same pattern was true when data were generated under the finite locus model: power reduced when the effect of the second largest locus increased (table II, sets 1, 2 and 3). An exception was, again, a case when two major loci had an equal effect on the trait and frequencies of favourable alleles at the major loci were 0.5 and 0.9 (table II, set 3, p = 0.9). In most cases, the higher power of detecting a major gene was achieved when data were generated under the finite locus model than under the mixed model.
Violation of the assumption of independent segregation of the major gene and other genes had a negative effect on the power of the mixed model as well as on the power of the finite polygenic mixed model (table II). Even larger reductions in the power were observed when all parents were double heterozygotes for the two linked loci with largest effects (table II). In this case, not only the assumption of independent segregation of a major gene and polygenes was violated but also the assumption of Hardy-Weinberg equilibrium in the parental population; true probabilities for parents to be homozygotes were zero, not p 2 and (1 -p) 2 , as was assumed in the analysis. The reduction in the power due to violation of Hardy-Weinberg equilibrium was confirmed by a simulation where all parents were heterozygous for the major locus (a finite locus model similar to set 2 with p = 0.5, no linkage). In this case, the power of the mixed model was 28% compared to 58% when the parent population was in Hardy-Weinberg equilibrium (table II, set 2, p = 0.5).
Parameter estimation Mean estimates of parameters, with their empirical standard deviations based on 50 replicates, and true values are given in tables III and IV. The expected variance components for polygenes given in table III (results for the finite locus model) do not include dominance variance of the second and the third largest loci (smaller loci were additive), because the statistical methods studied here did not take polygenic dominance variance into account. As a result, dominance variance may be partly confounded with estimates of additive genetic variance and partly with estimates of residual variance.
For the first distribution of gene effects (set 1) and the finite locus model, both methods gave similar estimates (table III). In most cases, estimates agreed well with true values, although some discrepancies were found for variance components. The standard deviation of the estimate of the genotypic mean depended on the estimated gene frequency and was larger for low frequencies.
Going from the set 1 distribution of gene effects to set 2, with a larger second locus effect, variation of estimates increased (table III). More bias was also observed.
For example, when gene frequency was 0.9, the difference between genotypes was underestimated (by about 0.25) by both methods and gene frequency was underestimated at 0.8.
When two major genes with equal effect were simulated, parameter estimates were biased (table III, set 3). The difference between homozygotes was inflated by as much as 25% in the case of equal gene frequencies (0.5). Gene frequency estimates were also biased; with a simulated gene frequency of 0.1, the average estimate was around 0.15. Estimates were even more biased when the first major gene had a frequency 0.9. In that case, the mixed model gave estimates closer to 0.5 than 0.9 and the finite polygenic mixed model between 0.5 and 0.9. Overestimation of differences between genotypes led to underestimation of polygenic variance, because a larger proportion of total genetic variance was attributed to variance between genotypes.
With linkage between the two loci with largest effect, a significant inflation was observed in all estimates when the linked genes were of equal size (table III, set 3).
When all base population parents were double heterozygotes for the two linked loci of large effect, parameter estimates were highly biased (table III). Estimates of the difference between the two genotypes was 0.8 units higher than the true difference between the genotypes in one locus when the two loci with the largest effect on phenotype had equal effects. Also in this case, gene frequency was higher than the expected 0.5 and the estimate of additive genetic variance was almost zero. Bias in estimates of the parameters was larger for the mixed model than for the finite polygenic mixed model.
More consistent estimates over the different genetic scenarios were achieved when data were generated under the mixed model than under the finite locus model (table IV). No important differences were found between the mixed model and the finite polygenic mixed model. The variance of estimates of all parameters increased when the proportion of genetic variance explained by the major gene decreased (going from set 1 to set 3), but average values of estimates were still close to expected values.

DISCUSSION AND CONCLUSIONS
The purpose of this paper was to study the sensitivity of complex segregation analysis to violation of some of the assumptions of the underlying model, in particular a normal distribution of polygenic effects and no linkage between a major gene and polygenes. Similarity in the power of both methods of segregation analysis (the mixed model and the finite polygenic mixed model) was observed, except when data were generated based on the finite locus model with two major genes. Similar results for both methods can be expected because the computer package (SALP), which maximized the finite polygenic mixed model used equal allele frequencies (0.5) and additive gene action for all genes except the major gene, which created an approximate normal genetic distribution within major genotypes.
The finite polygenic mixed model with one major locus is a closer approximation of a mixed model ) than an oligogenic model, which explains inheritance by a few independent loci and estimates the effect of the each locus separately (Elston and Stewart, 1971). Performance of the oligogenic model or a finite polygenic mixed model with several major loci was not studied, but might have been better than the methods studied here when data are generated from a finite number of loci.
Type I error rate was checked only for the mixed generation model and was around (or below) the expected 5%. The true type I error rate under the finite locus model is unknown. Thus, the power given in table II under the finite locus model is the probability of rejecting a pure polygenic model when the likelihood ratio test statistic is compared to the X 2 table value with two degrees of freedom.
The nature of polygenic variance (ie, the finite locus model versus the mixedgenerating model) had a significant impact on power of major gene detection. In the mixed model, the polygenic component inherited by progeny has an expected value equal to the average of the polygenic values of the parents (or midparent breeding value), which is not valid if any of the genes contributing to the polygenic component are dominant. The discrepancy of progeny from the expected midparent polygenic value increases with an increase in the relative magnitude of dominant loci over all polygenic loci. In addition, with dominance, the genetic variance of offspring conditional on parental polygenotype is not equal to half of the additive genetic variance but also contains dominance variance, which is relatively large compared with additive variance when a large recessive gene with low frequency segregates in the population. These discrepancies from assumptions of the mixed model should have a negative impact on its power in cases where data were simulated under a finite locus model compared with a mixed generating model. However, no negative effect on the power was observed. Instead, in most cases the power was higher under the finite locus model than under the mixed-generating model (table II). In the case of two loci with major effect (table II, set 3) and to a lesser extent with sets 1 and 2, the methods had a chance to detect either of the major genes, which may explain the higher power under the finite locus model. In contrast, when the same situation was generated using the mixed model, a major gene explained only a small proportion of the total genetic variance, the detection of the major gene was difficult. Which of the genes was detected as a major gene under the finite locus model was not investigated, but based on intermediate estimates for gene frequency, it seems that in some families the gene from the first locus was detected as a major gene, and in other families the gene from the second locus (or other loci) was detected.
Linkage between a major gene and polygenes reduced power but did not have a large impact on parameter estimates if the linked genes were not of equal size and if the parents were a random sample from a population in linkage equilibrium. Furthermore, based on one simulation example, violation of the assumption of Hardy-Weinberg equilibrium in the parental generation reduced power substantially. Therefore, it is recommended to test a model that assumes Hardy-Weinberg equilibrium against a model with free genotypic frequencies for the parental generation.
The results given here are restricted to data from independent nucleus families.
Based on results by Fernando et al (1994), the finite polygenic mixed model is a closer approximation of the mixed model under an example data set with three generations than PAP if data are generated with a mixed model. How these methods perform under the finite locus model when information from more than two generations are available or when nucleus families are not independent was not studied. Thus, the natural area for future studies is the performance of methods under multigenerational data when data are generated under the finite locus model.
In conclusion, both segregation analysis methods studied here gave similar power to detect a major gene and estimates of parameters under different genetic scenarios. The only distinguishable difference between methods was under the finite locus model when two major genes had equal effect on a trait. In that case, the mixed model (or PAP, when used as a mixed model) was more powerful than the finite polygenic mixed model (or SALP) in rejecting the polygenic model, but the finite polygenic mixed model gave estimates with less bias than the mixed model. The finite locus model did not have a negative effect on the power compared with the mixed generating model. Instead, the power of the methods was often higher under the finite locus model than when data were generated under the mixed model. Segregation of two major genes in a population caused biased estimates. Linkage had a negative effect on the power, but parameter estimates remained unbiased if the parents were a random sample from a large population in linkage equilibrium and if the major gene had a substantially larger effect on the trait than the other genes.