A sampling algorithm for segregation analysis

Methods for detecting Quantitative Trait Loci (QTL) without markers have generally used iterative peeling algorithms for determining genotype probabilities. These algorithms have considerable shortcomings in complex pedigrees. A Monte Carlo Markov chain (MCMC) method which samples the pedigree of the whole population jointly is described. Simultaneous sampling of the pedigree was achieved by sampling descent graphs using the Metropolis-Hastings algorithm. A descent graph describes the inheritance state of each allele and provides pedigrees guaranteed to be consistent with Mendelian sampling. Sampling descent graphs overcomes most, if not all, of the limitations incurred by iterative peeling algorithms. The algorithm was able to find the QTL in most of the simulated populations. However, when the QTL was not modeled or found then its effect was ascribed to the polygenic component. No QTL were detected when they were not simulated.


INTRODUCTION
Considerable research has been directed at locating quantitative trait loci (QTL) among continuously distributed traits in the genome of various species using designed experiments and marker information. There is also keen interest in searching for possible QTL in complex pedigree structures without markers in domesticated livestock. This is called segregation analysis, in which both the effect of an allele at a postulated QTL and the probabilities that individuals in the population carry 0, 1 or 2 copies of the desirable allele are of interest.
Methods for segregation analysis have been suggested by a number of authors, e.g. [11,13,15,20]. Models that include a QTL, segregating according to Mendelian rules, and a polygenic component which incorporates all other genetic effects are commonly used in these analyses. A typical mixed inheritance model for the analysis of a single trait is y = Xb + Za + ZWm + e (1) where, y, b, a, and e are vectors of observations, fixed effects, breeding values and residuals, X and Z are incidence matrices assigning observations to effects, W is an unknown (N individuals × N genotypes ) matrix describing the populations' genotypes and m is a (N genotypes × 1) vector of QTL effects related to the available genotypes. The matrix W contains one 1 in each row, corresponding to the individual's genotype, and zeroes elsewhere. Variances of the random effects a and e in this model are assumed to be Aσ 2 a , where A is the numerator relationship matrix, and Iσ 2 e . With such a model Hoeschele [11] used Maximum-A-Posteriori Estimation to evaluate sires for gene effects and polygenic breeding values. Requirements for this method were evidence of a major locus segregating in the population and also the number of major genotypes at that locus. The method was extended to the analysis of categorical data and multiple traits.
Kinghorn et al. [15] developed FINDGENE as a method for searching for a postulated QTL. From the mixed inheritance model (1), the model y = Xb + Za + ZTm + e was developed, where T contains the animals' genotypic probabilities. Each row of T contains (real) probabilities that sum to one. The heritability (h 2 ) of the trait is required as an input for FINDGENE.
With this algorithm, genotypic probabilities (T) are determined using the method of Fernando et al. [3] with the data adjusted for the current estimate (denotedˆ) of the other effects in the model (y-Xb-Zâ), a penetrance function and the pedigree. To estimate all other effects, the data adjusted for the current estimate of both the major gene effect and genotypic probabilities (y-Tm) are used. A series of iterations between estimating genotypic probabilities and the effect of the QTL, and estimating all the other effects are required to obtain approximate solutions for all effects in the model.
Van Arendonk et al. [24] introduced an iterative method for determining genotypic probabilities by extending the "peeling"method of Elston and Stewart [1] for use in large complex pedigrees with many loops. Corrected formulae for this process were published by Janss et al. [13], in a paper which introduced the term "iterative peeling", and Fernando et al. [3]. In iterative peeling the probability that an individual has each of the possible genotypes is determined as a function of the existing genotypic probabilities of the individual's neighbourhood set -its parents, progeny and mates.
Janss et al. [13] used the Gibbs Sampler to determine the effect of a postulated QTL and the probabilities that individuals in the population had each of the possible genotypes. The Gibbs Sampler [6] is a Monte Carlo Markov Chain (MCMC) method in which parameters are repeatedly sampled individually and randomly from their conditional posterior distributions, given the current state of all other parameters. These samples form a Markov Chain. Sampling continues until the distribution of each parameter converges to its true distribution. The mean or mode of the samples taken after convergence of the Markov Chain are used as an estimate of each parameter. Janss et al. [13] -using a Gibbs Sampler -first estimated the set of each individual's genotypic probabilities (t i ) from the current genotypic probabilities of its parents, progeny and mates, the data and all other parameters; then a particular genotype (w i ) was sampled for that individual. The frequency (q) of one of the two alleles in the base population was also sampled. Sorensen [20] describes an implementation of the Gibbs Sampler which improves the method of Janss et al. [13] by simultaneously sampling b and a effects by first permuting y appropriately.
In both these algorithms, genotypes are sampled individually using iterative peeling. Unfortunately sampling individual genotypic probabilities as a function of the parents, progeny and mates has a number of well known problems [3]. One of these problems relates to the difficulty of sampling the parameter space or "stickiness" [12]. Loops commonly occur in the pedigree of the population as a result of matings between relatives. When determining each individual's genotypic probabilities iterative peeling treats information coming from parents, mates and progeny independently. When there are loops in the pedigree such information is not independent; nevertheless, in iterative peeling algorithms such as these, it is treated as if it were independent. These problems can lead to high autocorrelations between successive samples and may prevent convergence to the true distribution. It could also bias estimates of the other parameters. Methods, such as simulated tempering [7], have been developed to help overcome these problems. While these are useful, they can be very computationally expensive and may require running a number of Gibbs samplers simultaneously. Blocking (iterating over a group of closely related animals such as a family of half-sibs) has also been proposed as a method to expedite mixing [17].
The Gibbs sampler is a specific case of MCMC methods. More general MCMC methods allow groups of parameters to be sampled jointly and in different ways to the Gibbs Sampler. With the Metropolis-Hastings algorithm [8,19] new parameters are sampled, but not always accepted, to be part of the Markov Chain. A candidate sample is accepted if an increase in the likelihood would result, or if the ratio of the likelihood of the candidate to the current sample is greater than the test number drawn randomly from a uniform distribution; otherwise the candidate sample is rejected and the current sample repeated in the chain. Detailed descriptions of the Metropolis-Hastings algorithms can be found in many statistical texts, e.g. [5,21].
One condition for an MCMC algorithm is that any point in the parameter space can be reached from any other point: the Markov Chain must be irreducible. When more than two alleles are involved the Markov chain generated based on an iterative peeling algorithm may be reducible, and this is also a concern when more than one locus is modeled [20].
This paper describes an MCMC method for finding the effect of a postulated QTL with two alleles among quantitative variation, and the probabilities for all individuals in the population having 0, 1 or 2 copies of the gene, without any genotypic information. No parameters were assumed to be known, and hence all were estimated. The method was validated using simulated data from a random mating population. The benefits of this method over previously published methods are discussed and some obvious applications and extensions identified.

ALGORITHM
Sorensen [20] describes in detail an implementation of the Gibbs sampler for the mixed inheritance model (1) where v i and S i are hyperparameters. 5. W: The probability distribution for a genotype configuration for the whole population is: non-founders j P(w j |w mother j w father j ).
(S 9.8.1) Alleles are assumed to be randomly sampled from the parents genotypes according to Mendelian rules. For founders, genotypes are assumed to be randomly sampled from the available genotypes given the frequency of the alleles in the base population (q and 1 − q) and the assumption of Hardy-Weinberg equilibrium. 6. q: a beta distribution with parameters e and f is assumed for the allele frequencies (q and 1 − q) in the base population,

Posterior distributions
The full posterior distribution of the parameters is given by: The conditional posterior distributions for each of the parameters from which samples are drawn are: where * −i means the matrix (vector) * without the i-th row (element) andb i satisfies where λ = σ 2 e /σ 2 a andâ i satisfies where n a is the rank of A.
where n e is the order of y and e = y − Xb − Zâ − ZŴm. 6. the probability that an individual has each of the possible genotypes is computed by evaluating the expression where individual i has n i mates with n ij offspring with mate j, for each possible value of w i and normalising the result so that the probabilities sum to one. If the individual is a founder then the term P(w i |w m i , w f i ) is replaced by the term P(w i ) which is a function of the gene frequency in the base population. A genotype (w i ) is sampled using a number drawn randomly from a uniform distribution. 7. q|W founders , y ∝ Be(q|n a 1 + f , n a 2 + e), where n a 1 and n a 2 are the numbers of each type of allele in the founders.
The Gibbs Sampler proceeds by sampling parameters from these conditional distributions until sufficient samples have been drawn for the process to have converged to the true distribution. As already noted, sampling genotypic probabilities individually can restrict mixing of W, and that of other parameters, and could prevent convergence to the true distribution. To avoid these problems, the new algorithm proposes using descent graphs and the Metropolis Hastings algorithm to sample W.
A descent graph is a structure which uses ordered pairs of zeroes and ones to describe the inheritance state of all alleles in the population. Among descendants the first (second) element of each pair describes the source of the paternal (maternal) allele. A zero (one) indicates that the grandpaternal (grandmaternal) allele was inherited. The elements in the descent graph for a population are numbered from 1 to 2N individuals , with the i-th pair corresponding to the ith individual. The binary nature of a descent graph lends itself to rapid sampling on (binary) computers, e.g. [4]. When there are only two alleles, the descent graph also can be used for including the founder population (those with unknown parents), but for them zero and one represent the different alleles. A simple pedigree and descent graph with resultant genotypes is shown in Table I. For example, the (0 1) pair in the descent graph for individual 4 indicates that it inherited the sire's paternal allele (allele 1) and the its dam's maternal allele (allele 4). A more detailed description of descent graphs can be found in Lange [16]. Thompson [22] first suggested that sampling descent graphs would be more efficient than sampling genotype probabilities given the relative sizes of the parameter spaces. Although the parameter space for a descent graph may be very big, it is finite, whereas that of the genotypic probabilities is infinite. The problem of calculating genotypic probabilities for ungenotyped individuals when some genotypic data are available on others, as is the case for genetic disorders, is more complicated and is addressed in a separate paper [10].
When all genotypes are unknown any randomly drawn descent graph will provide a set of genotypes for the population consistent with the pedigree. The matrix of genotypic probabilities for the population (T) can be calculated as the mean of these sets. Besides providing a consistent pedigree a descent graph also provides the frequency of alleles in the base population (q).
W can be sampled from the full conditional posterior p(W|b, a, q, σ 2 e , σ 2 a , y) using the Metropolis-Hastings algorithm. In this implementation of the Metropolis-Hastings algorithm a descent graph is drawn randomly to form the initial sample. This descent graph is permuted by randomly choosing some individual bits and changing their state from 0 to 1 or vice-versa. This process can cause significant change to the pedigree of the population for chosen individuals which have a heterozygous parent and many descendants. Table II illustrates the changes to the genotypes after permuting the paternal bits in the descent graph for two individuals in the example population. However, if the individual has no descendants, it will be the only one to change its pedigree and this will occur only if the parent is heterozygous.
For descendants the probability of changing state from one grandparental gamete to another is even. For base individuals the probability of changing state is a function of the current estimate of q with individual gametes more likely to adopt the more frequent gamete. There are many different ways that this sampling procedure could be implemented. These questions relate to the number of individual bits to exchange in each step and to the number of Metropolis-Hastings steps to make before returning to sample other parameters. It must be possible in any draw for sufficient bits to be exchangeable so that every other possible pedigree for the population in the distribution is obtainable with some non-zero probability. A number of Metropolis-Hastings steps are made to allow the pedigree to mix sufficiently between sampling of the other parameters. This number depends upon the size of the population being analysed.
The likelihood of each sampled descent graph is determined directly from a function of the errors L = 1 σ e √ 2π e −e e/(2σ 2 e ) .
With e = y − Xb − Zû − ZŴm determined for each descent graph, the ratio of the likelihoods of the candidate and current samples is computed and tested against a random uniform number (L candidate /L current > r) before the candidate accepts or rejects the candidate graph.
To compute the likelihood efficiently, a residual for each individual and possible genotype, and its contribution to a likelihood, is calculated before sampling begins. These values are used to compute the change in the likelihood for each newly modified descent graph. Permuting the descent graphs of the progeny of homozygous individuals causes no change to their (or their descendants') genotypes or the likelihood. Such changes are ignored when calculating the change in the likelihood. The likelihood of the whole pedigree, given the current parameters, is calculated after the final Metropolis-Hastings step in each genotype sampling step.
The frequency of the allele in the base population, to be used in step 6 of the Gibbs sampler, is computed as a by-product of the genotype sampling process. Thus the only difference to the sampling procedure described by Sorensen is to step 5, which is how the genotypes are sampled.

SIMULATION
A number of populations with the following structure were generated: A base population of 10 sires and 100 dams was generated. Sires were mated randomly to dams with no limit on sire usage. All dams had ten offspring which had an equal chance of being male or female. Five sires and 50 dams were randomly replaced in each mating cycle. Replacement individuals were all chosen randomly from the most recent cohort. Six cohorts were simulated for a total of 6 110 individuals. Three different models were used to simulate records for the six cohorts. All models contained a residual term and a fixed effect of the cohort. The first model contained a polygenic effect for each individual, the second contained an additive monogenic effect for each individual of the cohort, Model (ii) and the third model contained both polygenic and monogenic effects for each individual.
Breeding values (a) were drawn from the distribution N(0, Aσ 2 a ). Two alleles (A and a) at the locus were simulated. The effect (m) of having one copy of allele A was set at √ 6, model (ii) or √ 3, model (iii). Two frequencies of allele A in the base population were simulated: 0.5 and 0.2. Base animal alleles had that chance of being allele A. This implied that the base population was in Hardy-Weinberg equilibrium. The variance due to the QTL was then 2q(1 − q)m 2 [2]. When no monogenic effect was modeled the polygenic variance was set to 3 but when both genetic effects were modeled it was set to 1.5. This was the same as the variance due to the monogenic effect when the frequency of A in the base population was 0.5.
Descendants' alleles were randomly selected from their parents' alleles. Records were generated by adding the appropriate genetic effects -breeding values and QTL effects -for the model and residuals drawn from the normal distribution N(0,3).
The same three models used to simulate the populations were also used to analyse each of them. Twelve populations were generated using each model. After 2 000 burn-in Gibbs samples the next 2 000 Gibbs samples were used to provide estimates of all parameters in each analysis. A variety of starting values for the variances and QTL effects were chosen for the different populations. In each Gibbs sample, 12 220 Metropolis-Hastings steps were taken, with a random number -the minimum of 12 220 and 1 + 2/r, where r is a random uniform variate -of nodes in the descent graph perturbed in each step. The effect of having each of the four possible genotypes -AA, Aa, aA and aawere estimated separately. Table III summarises the results of analysing populations, simulated with the three different models and two different gene frequencies, with a model (i) containing only a polygenic genetic effect. Regardless of the source of the genetic variation, mono-or polygenic, the genetic and error variances were consistently well estimated with this model.

RESULTS
The corresponding results from analyses with a model (ii) containing only a single monogenic effect are shown in Table IV. With this model the error variance tended to be overestimated in all populations. In all cases both the effect of the QTL and variance due to it were underestimated. When the QTL was the only simulated genetic effect, the highest estimate for both the variance due to the single locus and its effect were obtained, and the error variance was the least overestimated.
The results from the analyses with the complete model (iii) are shown in Table V. With both genetic effects in the analytical model, no effect of a QTL was found when it was not simulated and, the error variance was consistently well estimated. A small variance due to the polygenic effect was found in all replicates even when it was not simulated. There was a considerable difference between the magnitude of these effects in each replicate and a corresponding underestimate in the variance due to the QTL. Figure 1 illustrates the distribution of the samples of genetic variance from each replicate of the population simulated with a base frequency of 0.5 and analysed with both genetic effects -model (iii). It is clear that any variance due to the QTL was missed in replicate D, was ambiguous in replicate B, found in the other ten replicates but often underestimated. As shown however in Figure 2, the effect of the allele was estimated correctly in 8 of the twelve replicates. When the Table III. Estimates of error and polygenic variance from the polygenic model -Model (i) -in simulated populations: mean and sampling variances of twelve replicates. Variances over the replicates are shown in parenthesis. Residual and total genetic variances of 3 were simulated for all models. The variance due to the genetic effects was shared equally between the two genetic effects in the combined simulation. frequency of the base allele was simulated as 0.2 the QTL was found in eleven of the twelve replicates. The analysis of each replicate took less than one hour using a Pentium (350 MHz) computer.

DISCUSSION
Sampling descent graphs with the Metropolis-Hastings Algorithm has many advantages over sampling with an algorithm based on iterative peeling. It is quick, always provides pedigrees that are consistent with Mendelian sampling, can easily handle more than two alleles and allows ready mixing of the genotypes in the pedigree. Since this method of sampling meets the criteria for the Metropolis-Hastings algorithm, it is irreducible. With more than two alleles additional storage is needed for the founder individuals. It is obvious that this form of sampling genotypes will not experience the problems of stickiness that can occur with sampling genotypes using iterative peeling, nor do loops in the   Since MCMC methods are computationally expensive the search for an optimal sampling process could be fruitful. It is reassuring to see that when variation is generated by an additve QTL, its effect is fully recovered with a polygenic model. This means that a QTL with a moderate effect will be contained in the estimated breeding values generated with the Best Linear Unbiased Prediction (Henderson, 1973). It is difficult for the monogenic model to model the genetic variation completely. In the simulated data sets no QTL were found that were not simulated, but in a few cases a simulated QTL was overlooked and in two cases it was unclear whether it was a real effect or not. It is likely that additional depth in the pedigree (more generations) and larger litter sizes should enhance the detection of QTL. Table V. Estimates of error, polygenic and QTL variances, and the effect (m) of the QTL in simulated populations from analytical Model (iii): mean and sampling variance of twelve replicates. Variances over the replicates are shown in parenthesis. Residual and total genetic variances of 3 were simulated for all models. The variance due to the genetic effects was shared equally between the two genetic effects in the combined simulation.

Error
Polygenic Conversely, smaller litter sizes and fewer generations of data will make it more difficult to identify QTL with this method. It is likely that smaller QTL can be detected in larger data sets. This method of sampling is also the logical one for use with finite locus models, e.g. [14,23]. The method has been used to find putative QTL for a number of traits in cattle, sheep and swine including one for disease resistance [18] and has been adapted to incorporate marker information [9].