A rapid conditional enumeration haplotyping method in pedigrees

Haplotyping in pedigrees provides valuable information for genetic studies (e.g., linkage analysis and association study). In order to identify a set of haplotype configurations with the highest likelihoods for a large pedigree with a large number of linked loci, in our previous work, we proposed a conditional enumeration haplotyping method which sets a threshold for the conditional probabilities of the possible ordered genotypes at every unordered individual-marker to delete some ordered genotypes with low conditional probabilities and then eliminate some haplotype configurations with low likelihoods. In this article we present a rapid haplotyping algorithm based on a modification of our previous method by setting an additional threshold for the ratio of the conditional probability of a haplotype configuration to the largest conditional probability of all haplotype configurations in order to eliminate those configurations with relatively low conditional probabilities. The new algorithm is much more efficient than our previous method and the widely used software SimWalk2.


INTRODUCTION
Haplotyping in a pedigree involves the consideration of the Space of All Consistent Haplotype Configurations (SACHC) for the pedigree based on all observed data (genotype data and pedigree structure). For a larger pedigree with a larger number of linked loci, the size of SACHC is too large for an exact method to be feasible. Most configurations in SACHC typically have very 26 G. Gao, I. Hoeschele small conditional probabilities, so that only a relatively small subset of configurations with high conditional probabilities (or likelihood) is relevant [4]. Identifying a subset of configurations with the highest likelihoods and estimating their conditional probabilities in SACHC is an important computational step for genetic studies such as the calculation of haplotype frequencies and the estimation of identity-by-descent matrices. Likelihood-based sampling methods are often employed to infer the most likely haplotype configuration or a set of configurations with the highest likelihoods for a large pedigree with a large number of loci (e.g., [7,10]). These methods are flexible but can have high CPU time requirements and may converge very slowly. Some rule-based algorithms (e.g., [1,6,8]) can be applied to large pedigrees, but these algorithms often assume zero recombinants or are more appropriate for pedigree data with a small expected number of recombinations [3], such as high density marker data in a short chromosomal region.
In our previous work [4], we proposed a conditional enumeration method based on computations of conditional probabilities and likelihood, and on setting a threshold λ (λ < 1) for the conditional probabilities of the possible ordered genotypes at every unordered individual-marker. It is often efficient to identify a set of configurations with the approximately highest likelihoods in SACHC. However, the computing time of this method can increase substantially, when (1) threshold λ is set very close to 1, (2) the pedigree contains a high proportion of homozygous genotypes and is less informative, or (3) inter-marker distances is large (say 5 cM) and the pedigree contains a large number of recombinations which can increase the haplotype uncertainty of the individuals. In this study, we describe a rapid haplotyping algorithm based on a modification of the conditional enumeration method. The modified enumeration method is more efficient than the original method for large pedigrees with large numbers of loci. We compare the modified method by simulation in large pedigrees with the original method and with a sampling method implemented in the software SimWalk2 [10,11], which is widely used for haplotyping in large pedigrees. SimWalk2 identifies a single haplotype configuration that is often nearly optimal.

METHODS
In this study, we assume linkage equilibrium between markers in the founders of the pedigree and we also assume that all individuals in a pedigree have been genotyped for all markers without genotype errors. We use the same notation as in our previous work [4]. The combination of a specific individual and a specific marker locus is termed an individual-marker. The genotype of some individual-markers in non-founders can be ordered by their parents' genotypes. The observed data after this partial reconstruction are denoted by D. Let U denote all the remaining heterozygous individual-markers in a pedigree, each with an unordered genotype. Assume that the size of U is t. To reconstruct a haplotype configuration for the entire pedigree, one needs to assign an ordered genotype for each individual-marker in U.
Let {M 1 , M 2 , . . ., M t } be a specific ordering of the individual-markers in U. Let m i denote an ordered genotype assigned to individual-marker M i , then a set of assignments {m 1 , m 2 , . . ., m t } is a haplotype configuration for U. The joint probability of this configuration conditional on the observed data (D) is [4] Pr(m 1 , Probability p i is equal to one of the conditional probabilities p s i and p l i , so that p i p l i . Under the assumption of linkage equilibrium between markers in the founders, probabilities p i , p s i and p l i can be calculated by an approximation method using only the informative flanking markers of the individual under consideration and its parents and offspring [4]. In our previous conditional enumeration haplotyping method (see [4] for details), we set a threshold λ for the conditional probabilities of ordered genotypes at every individual-marker, and assigned (one or two) ordered genotypes to each individual-marker in U sequentially by using an optimal (marker) search process. After the first i−1 individual-markers {M 1 , M 2 , . . ., M i−1 } have been assigned ordered genotypes, for each set of assignments {m 1 , m 2 , . . . , m i−1 } to these i − 1 individual-markers, we temporarily treat each of the remaining individual-markers (not including the first i − 1 individual-markers) in U as M i , and calculate the corresponding conditional probability p l i for each of these M i . We find the individual-marker with the highest conditional probability p l i among all the remaining individual-markers in U, and assign this 28 G. Gao, I. Hoeschele individual-marker to M i . This procedure is called an optimal (marker) search process. At the individual-marker M i , if p l i λ, we delete the ordered genotype m s i , otherwise, both ordered genotypes, m l i and m s i are retained. After all individual-markers in U have been processed by this algorithm, we can obtain a subset of haplotype configurations with approximately the highest likelihoods. When setting λ = 0.5, the conditional enumeration haplotyping method becomes a conditional probability haplotyping method [4] which is very fast and identifies a single haplotype configuration by assigning a single ordered genotype m l i to each individual-marker M i , and the optimal (marker) search process generates an optimal reconstruction order [4], Here, we propose a more efficient modified conditional enumeration haplotyping method by setting an additional threshold α for the conditional probabilities of haplotype configurations for U to eliminate some configurations with low conditional probabilities.
For the haplotype configuration {m 1 , m 2 , . . ., m t }, let q i denote the ratio of conditional probability p i to the larger conditional probability p l i at individualmarker M i , i.e., q i = p i /p l i and q i 1. We define the important quantity Q i as the product of q 1 , q 2 , . . . , q i (Q i = i k=1 q k ). For any integer i t, we have Let T denote the largest conditional probability of all haplotype configurations for U (T is unknown), and let R denote the ratio of the conditional probability of the haplotype configuration {m 1 , m 2 , . . ., m t } to T , i.e., R = Pr(m 1 , m 2 , . . ., m t | D) / T and R > 0. If R is very small (e.g., R < 0.001, then the conditional probability Pr(m 1 , m 2 , . . . , m t | D) is very small relative to the largest conditional probability T , and the configuration {m 1 , m 2 , . . ., m t } can be ignored when our purpose is to identify a set of configurations with the highest likelihoods. We describe an approximation method to estimate the upper bound of R.

Note that probability Pr
Hence we obtain R Q t r. For any i t, since Q i Q t , we have or r i 1 (this inequality was confirmed in our data simulation). Even though for some individual-marker M i inequality (3) may not hold, since both probabilities in inequality (3) are greater than 0.5, the two probabilities should be very close to each other. Thus from the definition r = t i=1 r i , we obtain r 1 approximately, and from inequality (2), for any i t, we have Given a small threshold 10 α (10 α < 1; e.g., α = −3), for haplotype configuration {m 1 , m 2 , . . ., m t }, if we can find an integer i ( t), such that Q i 10 α , then R will be very small and the configuration is ignorable and can be deleted when haplotyping in the pedigree. Since Q i is calculated from the conditional probabilities of the first i assigned individual-markers in U, M 1 , M 2 , . . . , M i , by utilizing only these conditional probabilities (with no need for calculating the conditional probabilities at the remaining individual-markers, M i+1, . . . , M t ) we can infer whether the corresponding configuration can be deleted from SACHC. This elimination of configurations produces considerable saving in the computing time required for haplotyping. Based on this principle for haplotype configuration elimination, we now modify our previous conditional enumeration haplotyping method. The new algorithm employs two user-determined threshold parameters: threshold λ for the conditional probabilities of ordered genotypes at every individual-marker (λ 0.5) [4] and threshold 10 α for the ratio of the conditional probability of a haplotype configuration to T (α < 0 and 10 α (1 − λ)/λ, see the Appendix).
Suppose that ordered genotypes have been assigned to the first i − 1 individual-markers, for each set of assignments {m 1 , m 2 , . . . , m i−1 } to these i − 1 individual-markers, we find the individual-marker M i with the highest conditional probability p l i among all the remaining individual-markers in U. And then we assign ordered genotypes to individual-marker M i as follows (i = 1, 2, . . . , t): 1. When p l i λ, assign m l i to individual-marker M i . 2. When p l i < λ, if assigning m s i to individual-marker M i produces Q i 10 α , then we only assign m l i to individual-marker M i , otherwise we retain both ordered genotypes, m l i and m s i , for individual-marker M i . After all individual-markers in U have been processed with this algorithm, we will have obtained a set of haplotype configurations SACHC* (⊆ SACHC) for the pedigree. The elements (configurations) of SACHC* can be ranked by their likelihoods, and SACHC* will always contain a subset of configurations which have approximately the highest likelihoods among all configurations in SACHC of the pedigree. This subset of configurations with approximately the highest likelihoods can be obtained by eliminating configurations with lower likelihoods in SACHC*, as desired. The likelihood of a configuration can be calculated with the method described in [11] by adopting Haldane's model of recombination.
The number of haplotype configurations retained in SACHC*, the accuracy and the computing time of the modified conditional enumeration method can all be controlled with the chosen values for thresholds λ and α, and increase with increasing absolute values of λ and α. When λ approaches 1 and α approaches −∞ (10 α approaches 0), the modified conditional enumeration haplotyping method approaches an exhaustive enumeration method (exact method). The exhaustive enumeration method is computationally expensive or infeasible for large pedigrees or large numbers of loci.
In the modified method, we calculate the conditional probabilities for individual-markers in U by an approximation method [4], and we use inequality (4) which is only approximately true. Therefore, to guarantee the accuracy of the method, one should choose high absolute values for threshold parameters λ and α subject to maintaining an acceptable computing time. We recommend that the value of λ be set larger than 0.65, and that α (α < 0) be set according to the average distance (d) between adjacent markers, with a decrease in the absolute value of α for an increase in d. For example, if d 2 cM, we can set α −1.0; if d 5 cM we can set α as large as −0.3 (10 −0.3 ≈ 0.5).

SIMULATION STUDIES AND RESULTS
To evaluate the performance of the modified method (abbreviated below as the "modified method"), we compared this method with our original conditional enumeration haplotyping method ("original method") and the widely used software SimWalk2 by analyzing three simulated pedigrees with different inter-marker distances (results from additional simulation studies evaluating our original method and comparing it to SimWalk2 can be found in [4]). The three simulated pedigrees had 163, 450 and 198 members with 18, 30 and 18 founders over 5, 8 and 6 generations, and a single linkage group consisting of 10, 10 and 20 bi-allelic markers with allele frequency of 0.5 and inter-marker distance of 10 cM, 5cM and 1.5 cM, respectively. Each father had two spouses, and each full sib family had three children. Table I presents the haplotyping results from the analyses of the three pedigrees with the modified and the original conditional enumeration haplotyping methods. For the same λ value, when setting a sufficiently small value for α, the modified method identified a set of top haplotype configurations with the sum of likelihood ratios nearly identical to that of the set of corresponding top configurations identified by the original method (top configurations are those configurations with the estimated highest likelihoods, and a likelihood ratio is the ratio of the likelihood of a top configuration to that of the true configuration). However, the modified method uses much less computing time. The computing time of the original method can become unacceptably long. For example, in the analysis of the 198-member pedigree, when setting λ > 0.973, 32 G. Gao, I. Hoeschele Table I. Comparison of the modified ("Modified") and the original conditional enumeration haplotyping method ("Original") based on analyses of three simulated pedigrees. the computing time (not listed in Tab. I) is much more than 53 h; in this case, the modified method (with λ = 0.99 or 0.995) identified a set of haplotype configurations quickly (in less than 11 min) whose sum of likelihood ratios was much higher than that from the original method (with λ = 0.973).
We note that in the analysis of the 198-member pedigree using the original method, when setting λ 0.970, the computing time is very short ( 0:07:41, see also Tab. II), but when setting λ 0.973, the computing time increases substantially. The reason is that at many individual-markers in U, the larger conditional probabilities of the ordered genotypes are less than 0.973 but greater than 0.970. When setting λ = 0.973, two ordered genotypes are retained for each of these individual-markers, and the computing time increases exponentially with the number of these individual-markers. However when setting λ 0.970, we only keep one ordered genotype for each of these individual-markers. We also note that the original and modified methods were run with many different values for thresholds λ and α. In Tables I and II below we only present  the results for some representative values of the thresholds.  Table II presents results on the comparison of the modified method with the original method and SimWalk2 (2.83), based on analyses of the 163-and 198-member pedigrees. Table II shows that the modified method can identify a set of haplotype configurations with much higher log-likelihood and in much shorter time when compared to SimWalk2 which identifies a single configuration. For the 198-member pedigree with denser markers, the modified method identified 33 configurations with the same log-likelihood of −281.575 in about 10 min, while SimWalk2 identified a single configuration with the log-likelihood of −369.891 in about 160 h.

DISCUSSION
The modified conditional enumeration haplotyping method is an efficient algorithm for large pedigrees and large numbers of loci, in particular for the case of tightly linked markers, where the existing sampling methods are always computationally intensive.
For a large pedigree with high proportion of uninformative markers, we can control the computing time more effectively by setting a (user-determined) control parameter (n c ) for the maximum number of retained haplotype configurations (the maximum size of SACHC*, e.g., n c = 10 000). After the first i − 1 unordered individual-markers M 1 , M 2 , . . . , M i−1 in U have been assigned ordered genotypes, if the total number of retained haplotype configurations exceeds n c , the algorithm will adjust the values for thresholds λ and α so that only a single ordered genotype (the one with larger conditional probability p l i at M i ) is retained for each of the remaining unordered individual-markers in U. This step can reduce the computing time dramatically. We note that the enumeration haplotyping methods use an optimal (marker) search process and assign ordered genotypes at each step to the individual-marker which has the most information in the corresponding individual and its parents and offspring among all remaining individual-markers in U.
In this contribution, we have assumed linkage equilibrium between markers and that all individuals in a pedigree have been genotyped for all markers. We have work in progress extending our methods to pedigrees with missing marker data while accounting for founder allele frequencies and marker-marker linkage disequilibrium among high-density single nucleotide polymorphism (SNP) markers in the founders of a pedigree. The extension of the haplotyping method to deal with missing data also involves developing an efficient genotype elimination algorithm for large pedigrees with large numbers of loops for which the existing methods may not work well or be computationally infeasible (e.g., [2,5,9]; O'Connell 2006, personal communications). We will report on this extension in a later communication.
The modified haplotyping method described above was implemented in a C/C++ program, which is available upon request from the first author for academic research.