Mapping quantitative trait loci in outcross populations via residual maximum likelihood. I. Methodology

Une methode de maximum de vraisemblance residuelle, appuyee sur un algorithme sans derivation, a ete etablie pour estimer la position et la part de variance d'un locus de caractere quantitatif (QTL) et simultanement les variances polygeniques additives et residuelles. La methode est basee sur un modele mixte lineaire incluant des effets polygeniques et de QTL, supposes a priori suivre une distribution normale. La methode a ete etablie en vue de plans experimentaux de cartographie de QTL chez les animaux, ou des donnees phenotypiques et de marquage sont disponibles sur la generation des descendants, et des marquages connus chez les parents et des ancetres supplementaires. La matrice des coefficients des equations du modele mixte, requise dans l'algorithme sans derivation, a ete deduite d'un modele animal reduit qui relie les performances individuelles des descendants aux effets polygeniques ou de QTL parentaux. La matrice de variance-covariance des effets de QTL et son inverse ont ete calculees conditionnellement a l'information incomplete relative a des ensembles de marqueurs lies. L'inverse est calculee efficacement pour des dispositifs ou chaque descendant a une mere differente et les peres ont de nombreux descendants genotypes permettant de determiner la phase des marqueurs lies avec un haut degre de certitude. Il n'est pas necessaire de connaitre la phase des marqueurs lies chez les ancetres des peres. Le test de la position d'un marqueur au sein d'un groupe de liaison est base sur le rapport de la vraisemblance correspondant a la variance QTL estimee, relative a une variance QTL nulle.

Summary -A residual maximum likelihood method, implemented with a derivative-free algorithm, was derived for estimating position and variance contribution of a single QTL together with additive polygenic and residual variance components. The method is based on a mixed linear model including random polygenic effects and random QTL effects, assumed to be normally distributed a priori. The method was developed for QTL mapping designs in livestock, where phenotypic and marker data are available on a final generation of offspring, and marker data are also available on the parents of the final offspring and on additional ancestors. The coefficient matrix of mixed model equations, required in the derivative-free algorithm, was derived from a reduced animal model linking single records of final offspring to parental polygenic and QTL effects. The variance-covariance matrix of QTL effects and its inverse were computed conditional on incomplete information from multiple linked markers. The inverse is computed efficiently for designs where each final offspring has a different dam and sires of the final generation have many genotyped progeny such that their marker linkage phase can be determined with a high degree of certainty. Linkage phases of ancestors of sires do not need to be known. Testing for a QTL at any position in the marker linkage group is based on the ratio of the likelihood estimating QTL variance to that with QTL variance set to zero. quantitative trait loci / residual maximum likelihood / mapping Résumé -La cartographie de locus de caractère quantitatif dans des populations en ségrégation à l'aide du maximum de vraisemblance résiduelle. I. Méthodologie.
Une méthode de maximum de vraisemblance résiduelle, appuyée sur un algorithme sans dérivation, a été établie pour estimer la position et la part de variance d'un locus de caractère quantitatif (QTL) et simultanément les variances polygéniques additives et résiduelles. La méthode est basée sur un modèle mixte linéaire incluant des effets polygéniques et de QTL, supposés a priori suivre une distribution normale. La méthode a été établie en vue de plans expérimentaux de cartographie de QTL chez les animaux, où des données phénotypiques et de marquage sont disponibles sur la génération des descendants, et des marquages connus chez les parents et des ancêtres supplémentaires. La matrice des coefficients des équations du modèle mixte, requise dans l'algorithme sans dérivation, a été déduite d'un modèle animal réduit qui relie les performances individuelles des descendants aux effets polygéniqués ou de QTL parentaux. La matrice de variance-covariance des effets de QTL et son inverse ont été calculées conditionnellement à l'information incomplète relative à des ensembles de marqueurs liés. L'inverse est calculée e,!cacement pour des dispositifs où chaque descendant a une mère différente et les pères ont de nombreux descendants génotypés permettant de déterminer la phase des marqueurs liés avec un haut degré de certitude. Il n'est pas nécessaire de connaître la phase des marqueurs liés chez les ancêtres des pères. Le test de la position d'un marqueur au sein d'un groupe de liaison est basé sur le rapport de la vraisemblance correspondant à la variance QTL estimée, relative à une variance QTL nulle locus de caractère quantitatif / maximum de vraisemblance résiduelle / cartographie INTRODUCTION Traditional methods for the statistical mapping of quantitative trait loci (QTL) include ANOVA and (multiple) linear regression (eg, Cowan et al, 1990;Weller et al, 1990;Haley et al, 1994;Zeng, 1994), maximum likelihood (ML) interval mapping (eg, Lander and Botstein, 1989;Knott and Haley, 1992), or a combination of ML and multiple regression interval mapping (eg, Zeng, 1994). These methods were developed mainly for line crossing and, hence, cannot fully account for the more complex data structures of outcross populations, eg, data on several families with relationships across families, unknown linkage phases in parents, unknown number of QTL alleles in the population, and varying amounts of data information on different QTLs or in different families. The gene effects near markers selected based on a linkage test tend to be overestimated increasingly with decreasing family size and true effect (Georges et al, 1995). Random treatment of QTL effects would cause shrinkage of estimates toward a prior mean in small families and for (aTLs accounting only for a small portion of genetic variance. Fernando and Grossman (1989) derived best linear unbiased prediction (BLUP) of QTL allelic effects, which are assumed to be normally distributed. For simple designs (eg, (grand)daughter designs with unrelated sires), BLUP reduces to random linear regression (Goddard, 1992). Fernando and Grossman (1989) showed how to obtain BLUP estimates of additive allelic effects (v) at a QTL linked to a single marker and of residual polygenic effects (u), assuming that all individuals in a population are genotyped and that markers are fully informative. Subsequent developments allowed for multiple linked markers with a QTL in each marker bracket (Goddard, 1992), for multiple unlinked markers each associated with a QTL (van Arendonk et al, 1994a), for incomplete marker information (Hoeschele, 1993;van Arendonk et al, 1994a;Wang et al, 1995), and for reductions in the number of equations by using a reduced animal model (RAM) (Cantet and Smith, 1991;Goddard, 1992), by including QTL gene effects only for genotyped animals and their tie ancestors (Hoeschele, 1993), or by estimating the sum of the effects at several unlinked, marked QTL (van Arendonk et al, 1994a). There are two linearly equivalent (Henderson, 1985) animal models incorporating marker information, the first linking an individual's phenotype to both of its marked QTL allelic effects and to its polygenic effect (Fernando and Grossman, 1989), and the other linking phenotypes to the total additive effects and linking total additive effects to QTL effects via the genetic covariance matrix (Hoeschele, 1993).
All methods described above are concerned with the prediction of genetic effects and assume that the dispersion parameters are known. These parameters include the additive polygenic variance, the variance contributed by a QTL, the QTL position, and the residual variance. A first attempt to estimate these parameters by residual maximum likelihood (REML) was undertaken by van Arendonk et al (1994b) using a granddaughter design with unrelated sires and a single marker.
These authors found that for this situation, QTL position and contribution to additive genetic variance were not separately estimable. Grignola et al (1994) showed that for the same type of design these parameters were estimable when performing interval mapping with flanking markers, known linkage phases in the sires and no relationships among sires.
Xu and Atchley (1995) performed interval mapping using maximum likelihood based on a mixed model with random QTL effects, but these authors fitted one additive genetic effect at the QTL rather than two allelic effects for each individual with variance-covariance matrix equal to a matrix of proportions of alleles identicalby-descent (IBD) shared by any two individuals at the QTL and assumed that this matrix was known. These authors applied their analysis to unrelated full-sib pairs. In this paper, we (i) apply the theory of Wang et al (1995) for a single marker to compute the variance-covariance matrix among QTL effects conditional on incomplete information from multiple linked markers, (ii) use this covariance matrix in the estimation of position and variance contribution of a single QTL along with polygenic and residual variances and in testing for QTL presence in a marker linkage group via REML with a derivative-free algorithm, and (iii) include all known relationships between the parents (sires) of the final offspring in the analysis.

Mixed linear model
The animal model including polygenic and QTL effects of Fernando and Grossman (1989) is: where y is an N x 1 vector of phenotypes, [3 is a vector of fixed effects, X is a design/covariate matrix relating 13 to y, u is an n x 1 vector of residual additive (polygenic) effects, Z is an incidence matrix relating records in y to animals, v is a 2n x 1 vector of QTL allelic effects, T is an incidence matrix relating each animal to its two QTL alleles, e is a vector of residuals, A is the additive genetic relationship matrix, Q u is the polygenic variance, Qufl is the variance-covariance matrix of the QTL allelic effects conditional on marker information, ufl is half the additive genetic variance explained by the QTL (also referred to as the QTL allelic variance), R is a known diagonal matrix, and Q e is the residual variance.
Matrix Q depends on one unknown parameter, the map position of the QTL relative to the origin of the marker linkage group (d Q ). For notational convenience, this dependency is suppressed in model [1] and below. Parameters related to the marker map (marker distances and allele frequencies) are assumed to be known.
The model is parameterized in terms of the unknown parameters heritability (h 2 = 0&dquo;;/0&dquo;;) with U2 being additive genetic and Q2 phenotypic variance, fraction of the additive genetic variance explained by the Q!L allelic variance or half of the additive variance due to the QTL (v 2 = cr!/<7!), residual variance U2 and QTL location d Q .
Let there be phenotypes only on nonparents or final offspring which have single records. Furthermore, recurrence equations linking u and v effects of nonparents (n) to those of parents (p) are where the matrix W consists of rows with zero, one or two elements equal to 0.5 for none, one or two parents known, respectively, and each row of the matrix F contains up to four nonzero coefficients explained below. With single records, Z = I, where I is an identity matrix. Then, model [1] can be rewritten as The reduced animal model is obtained from [3] by combining the last three terms into the residual. Mixed model equations (MME) for the RAM (Cantet and Smith, 1991) can be formed based on the RAM directly or by first forming MME based on [3] and subsequently absorbing the equations in e and m. The resulting MME for the RAM are It can be easily verified that matrix D v is diagonal even if A v is not, ie, TA V T / is always diagonal. Inbreeding and unknown parental origin of marker alleles can give rise to some nonzero offdiagonal elements in A, (Hoeschele, 1993;Wang et al, 1995). With D v diagonal, matrix D uv is also diagonal, hence, the MME are easily computed.

REML analysis
The REML analysis was performed by maximizing the likelihood of error contrasts (LEC) (Patterson and Thompson, 1971) with respect to the parameters h 2 ,V 2 ,0&dquo;;, and d Q . The LEC was obtained under the assumption of a joint multivariate normal distribution of y, u, and v. For the full animal model (AM), the logarithm of the LEC (LLEC) can be expressed as (Meyer, 1989): where 0 is the vector of parameters, G is the variance-covariance matrix of the random effects (here, u and v), NF = rank(X), NR = dimension(G), and C is the coefficient matrix of the MME for the AM (model [1] (Graser et al, 1987). The terms y'Py and log ICI are computed as in Meyer (1989) via Gaussian elimination applied to the augmented MME, and log I G is obtained as -log !G-1! with G-1 computed directly. In the following, it is shown how to compute the AM likelihood in [5] when working with MME for the RAM.
When equation [5] is applied directly to the RAM, the result is where all parts different from the AM LLEC are subscripted RAM. Let G be the variance-covariance matrix of the genetic effects (u, v) of the parents and of the Mendelian sampling effects for u and v of the nonparents or finals. Let G be partitioned accordingly. Then where A is blockdiagonal with blocks of size < 2. Similarly, partition the coefficient matrix of the MME for model [3], C, according to all other (1: (3, up, vp) and Mendelian sampling effects (2: m, e). Then, where C 22 is diagonal or blockdiagonal with blocks of size x 2. Hence, the RAM LLEC can easily be modified to yield the AM LLEC, or LLEC RAM oc LLEC RAM -0.5 log JAI -0.5 log !22! + 0.5 (NR -NR RAM ) log(,9,!,!) [9] where NR is total number of random genetic effects while NR RAM is number of genetic effects pertaining to parents.
The analysis was conducted in the form of interval mapping as in Xu and Atchley (1995), where d Q was fixed at a number of successive positions (every centimorgan) along the chromosome, and at each position the likelihood was maximized with respect to h 2 , V 2 and U2 Calculation of Qp' and 0&dquo; These matrices were computed by applying the theory presented in Wang et al (1995) to marker information consisting of multiple linked rather than single markers. At a given QTL position, different markers were allowed to flank the QTL in different families due to some parents being homozygous at the closest flanking markers.

Notation
Let Q! denote QTL allele k (k = 1, 2) in individual i and v! the additive effect of this allele. Let -denote IBD, let « stand for 'inherited from', let Gobs represent the marker information observed on the pedigree, and let MT denote a possible marker haplotype ( !1) of individual i at the closest pair of marker loci bracketing the QTL for which the parent of i is heterozygous. Furthermore, let M be a set of complete multi-locus marker genotypes for the entire pedigree. Finally, p denotes parent (p = s, d), s sire, d dam, and Lp denotes the linkage phase of the alleles at the narrowest marker bracket for which parent p is heterozygous.

Variance-covariance matrix of v effects Qp
In the presence of missing marker data and/or unknown linkage phases for parents, where M k is a particular realization of M from the probability distribution of M given Gobs, and S is sample size. Note that !11! yields the exact variance-covariance matrix if sample size S is large. Samples from this distribution can be obtained by Gibbs sampling, which was implemented using blocking of the genotypes of parents and final offspring (finals) as in Janss et al (1995). For a half-sib design (daughter or granddaughter design) with large family sizes (eg, 50-100) and no relationships among final offspring (daughters or sons) through dams, the linkage phases of the parents of final offspring are 'known', as always or most frequently (near 100%) the correct phase is sampled. Then, the inverse of the variance-covariance matrix of the QTL effects can be computed exactly (up to Monte-Carlo error due to use of !11!) as follows. Equation [11] is employed to compute the submatrix pertaining to QTL effects of parents of finals and ancestors using marker information on the entire pedigree including final offspring. This submatrix is then inverted, and contributions of final offspring, computed with known parental linkage phases, are added into the inverse. Note that in the RAM in [4], offspring contributions appear in the leastsquares part of the MME rather than in the inverse variance-covariance matrix of the QTL effects.

Recurrence equations for v effects
Recurrence equations for the v effects of the finals were required to compute the elements of F and A, in !4!. The general recurrence equation for a QTL effect is where [X O J The most likely linkage phase is assumed to be the true phase for the parents of final offspring. This assumption reduces the joint probability of parental linkage phase and offspring haplotype in [13] to the probability of the marker haplotype of an offspring. This probability is computed using the parental phase and the marker genotypes of an offspring at all linked markers. Alternatively, [13] could be used when parental linkage phases are not known by computing the joint probability of parent linkage phase and offspring haplotype for each interval as a frequency count across all Gibbs cycles after burn-in, using information from the entire marker linkage group and from all relatives in the pedigree. However, this approach would only be an approximation to calculating the variance-covariance matrix and its inverse based on [10] for the entire pedigree including the final offspring.
In [13], the Pr(Q!' = Q!MJ&dquo; 4= p , Lp) are t ll = (1 -rL)(1 -rR)/(l -rM) and t 12 = r L r R/ (1-r M ) if Mm is a nonrecombinant haplotype, or t 21 = (1r L )r R/ r M and t 22 = r L (l -r R )/r M if M m is recombinant, where r M is recombination rate for the marker bracket, r L (r R ) is recombination rate between the QTL and the left (right) marker, and Haldane's no interference map function is employed. Here, we allow for double recombination while Goddard (1992) assumed it to be zero. QTL alleles in final offspring are identified by parental origin, ie, the two QTL alleles in an offspring are distinguished as the allele inherited from the sire (s) and the allele coming from the dam (d). This definition can be employed even if the parental origins of the alleles at the flanking markers are unknown, but it can be used only in the final generation. For illustration, consider a single parent p (here, p = s = sire) with genotype 12/12, linkage phase 1 -1, and the worst case of an offspring with genotype 12/12 (inheritance unknown at both flanking markers).
Then, if the QTL alleles in i are identified by the alleles at the left marker (1,2) whereas if the QTL alleles in i are identified by parental origin, Note that summing the vi and v2 equations yields the same result for both QTL identifications. Note also that the advantage of the identification by parental origin is that only v? is linked to the v effects of the dam (d), instead of linking both v! and v? to the dam effects requiring to include dam effects in the MME.

Hypothesis test
The likelihood under the null hypothesis is evaluated at v 2 = 0. The distribution of the likelihood ratio statistic is not known exactly, regardless of the method used to locate QTL (Churchill and Doerge, 1994). For the null hypothesis postulating the absence of a QTL in a particular interval rather than in the entire genome, Xu and Atchley (1995) found the distribution to be in between two chi-square distributions with degrees of freedom of one and two, respectively. Several factors may influence the distribution of a test statistic for QTL presence, eg, the length of the genome, the marker density, the extent to which marker data are missing, segregation distortion, and the distribution of the phenotypes. Self and Liang (1987) derived analytical results for the asymptotic distribution of the likelihood ratio statistic for cases where the true parameter value may be on the boundary of the parameter space. However, with finite sample sizes and several factors influencing the distribution of the statistic, it is questionable whether their results can be utilized in QTL mapping.
When analyzing real data, the threshold value for significance can be determined empirically using data permutation (Churchill and Doerge, 1994). To obtain the threshold value for a genome-wide search, in the order of 10 000 to 100 000 permutations are necessary. As these computations are unfeasible with the method presented here (see the companion paper by Grignola et al, 1996), one may resort to estimating thresholds for a number of less stringent significance levels and obtain the desired threshold by extrapolation (Uimari et al, 1996b).

CONCLUSIONS
The REML analysis described in this paper may be a useful alternative to other methods for the statistical mapping of QTL. The REML method is generally known to be quite robust to deviations from normality. When applied to QTL mapping, the REML analysis requires fewer parametric assumptions than ML (eg, Weller, 1986) and Bayesian analyses Thaller and Hoeschele, 1996a,b;Uimari et al, 1996a) postulating a biallelic QTL with unknown gene frequency and substitution effect.
While Xu and Atchley (1995) estimate QTL, polygenic and residual variances by ML, we perform REML estimation. While REML should be preferred over ML in the presence of many fixed effects relative to the number of observations (Patterson and Thompson, 1971), a model for the analysis of QTL mapping experiments may only need to include an overall mean. In this case, the difference between the ML and REML analyses is negligible. As the true nature of (aTLs is unknown, it is important to evaluate the performance of this REML analysis and of other methods with data simulated under different genetic models (eg, biallelic and multiallelic QTL models). In a companion paper , we apply the REML analysis to granddaughter designs simulated with different models for the additive variance at the QTL. Hoeschele et al (1996) apply Bayesian analyses based on biallelic and multiallelic QTL models to data simulated under both models.
The REML analysis incorporates an expected variance-covariance matrix of the QTL allelic effects, which is equal to a weighted average of variance-covariance matrices conditional on all possible sets of multi-locus marker genotypes given the observed marker data. Schork (1993) alternatively formulated a likelihood for a mixture distribution which is a weighted average of REML likelihoods conditional on all possible sets of multi-locus marker genotypes given the observed marker data. He pointed out, however, that simulation results indicated that his modification may lead to a loss of power. In both approaches, the one considered in this paper and in equivalent form by Xu and Atchley (1995), and the approach of Schork (1993), probabilities of multi-locus marker genotypes are computed from the observed marker information. However, if markers are linked to QTLs, phenotypes also contain information about marker genotypes, and this information is ignored here (Van Arendonk, personal communication). In this regard, the REML analysis can be viewed as an approximation to the Bayesian analysis based on a multiallelic QTL model with QTL variance and allelic effects having a prior normal distribution . The Bayesian analysis takes into account the joint distribution of the QTL and marker genotypes conditional on the phenotypic information.
We are currently extending our REML analysis to account for multiple linked QTLs. One way of approaching this problem was presented by Xu and Atchley (1995) and consisted of fitting variances associated with next-to-flanking markers.
Disadvantages of this approach are that it is approximate as effects associated with marker alleles identified within founders erode over generations, and that it requires many additional parameters when the marker polymorphism is limited, causing the flanking and next-to-flanking markers to differ among families.
Finally, we plan to extend the REML analysis to other designs (eg, full-sib designs), where the current computation of the inverse of the variance-covariance matrix becomes approximate due to uncertain linkage phases in parents of final offspring, and other ways of computing this inverse exactly (eg, Van Arendonk et al, 1994a) will be implemented.