Efficient computation of the inverse of gametic relationship matrix for a marked QTL

Best linear unbiased prediction of genetic merits for a marked quantitative trait locus (QTL) using mixed model methodology includes the inverse of conditional gametic relationship matrix (G-1) for a marked QTL. When accounting for inbreeding, the conditional gametic relationships between two parents of individuals for a marked QTL are necessary to build G-1 directly. Up to now, the tabular method and its adaptations have been used to compute these relationships. In the present paper, an indirect method was implemented at the gametic level to compute these few relationships. Simulation results showed that the indirect method can perform faster with significantly less storage requirements than adaptation of the tabular method. The efficiency of the indirect method was mainly due to the use of the sparseness of G-1. The indirect method can also be applied to construct an approximate G-1 for populations with incomplete marker data, providing approximate probabilities of descent for QTL alleles for individuals with incomplete marker data.


INTRODUCTION
For best linear unbiased prediction of genetic merits, Fernando and Grossman [3] proposed a mixed linear model to account for covariance of Recently, a fast indirect method was developed for computing vectors involving the numerator relationship matrix in the polygenic model [2]. It was applied for computing relationship statistics and inbreeding coefficients [2,10]. The objective of this paper was to examine the potential, in terms of computational efficiency, of this indirect method when relationships are to be calculated at the gametic level in large populations in order to obtain G −1 . To focus on this question, complete marker data with either known or unknown parental origins of haplotypes were assumed; however, possible application of the indirect method for populations with incomplete marker data will be discussed.

Theoretical considerations
For simplicity, we first describe how to compute the inverse of G directly and what requirements have to be fulfilled. Let G denote the conditional gametic relationship matrix, G = {G i j }, where G i j is a 2×2 matrix that includes four gametic relationships between individuals i and j. Thus, G ii contains 1s on the diagonal and the inbreeding coefficient, conditional on markers, of individual i on the off-diagonals. Note that individuals are ordered by birth date and numbered consecutively from 1 to n, where n is the number of individuals. Assuming complete marker information, two identified parents for non-founders and nonzero recombination rates between marker(s) and QTL, G can be decomposed, similar to decomposing the numerator relationship matrix [5] as: where L = {L i j } is a lower block triangular matrix which can be recursively formed from pedigree and marker data, and D = diag{D ii } is a block diagonal matrix proportional to the conditional Mendelian sampling covariance matrix, or M = Dσ 2 v , where σ 2 v is the genetic variance contributed by the marked QTL [1,15]. Matrix D ii can be obtained as: where B i is a 2×4 matrix containing the conditional probabilities that QTL alleles in individual i descended from parental alleles (PDQ). Details of computation of PDQ for a QTL linked to a single marker and to flanking markers have been given by Wang et al. [15] and Liu et al. [6], respectively. Off-diagonal terms of G ii are expressed as tr(T G s i d i ). Matrix T is a 2×2 matrix with term t mn representing the probability that the pair of QTL 1 and 2 in individual i corresponds to QTL m of its sire and QTL n of its dam. For calculating T, see Wang et al. [15]. Matrix B i can be partitioned as contain PDQ from paternal and maternal origins, respectively. Furthermore, G −1 using (1), is expressed as: where L −1 is a lower block triangular matrix which contains −B s i and −B d i at positions linking individual i to its sire and dam and 1s on the diagonal. Therefore, L −1 can be directly and quickly obtained without computing L.
Calculating matrix D requires knowing the conditional gametic relationships between the parents of all individuals. Note that G −1 can also be calculated by using the partitioned matrix theory [15].
From (2), it is clear that for non-inbred populations, D and consequently G −1 can be calculated simply since there are no gametic relationships between any individual's parents. However, for inbred populations, these relationships must be determined to compute D and this could be computationally costly if the conventional tabular method is used.
Abdel-Azim and Freeman [1] implemented Tier's algorithm [12], or adaptation of the tabular method, at the gametic level to compute D quickly. The idea behind the Tier algorithm is that only a small part of G is necessary to compute the conditional gametic relationships between two parents of individuals. The required elements of G are first flagged and then calculated [1,12]. Although the required subset is a small part of G, this subset might require a large amount of memory to be stored in large populations.
Colleau [2] presented an approach to multiply the numerator relationship matrix (A) by a vector, say x, regarding a planned mating design (i.e., y = Ax). This method is called the "indirect" method, because it computes a group of elements of A simultaneously. With the indirect method, solutions are obtained from considering A −1 instead of A. It is assumed that inbreeding coefficients of the ancestors are known. Colleau [2] also showed that inbreeding coefficients themselves can be obtained by the indirect method efficiently. Inbreeding coefficients for paternal sibs, either half or full, having the same pseudo-generation number were obtained simultaneously from relationships between the sire and relevant mates. The pseudo-generation number or equivalently longest ancestral path number was set to 0 for founders and subsequently incremented by 1 after considering the maximal value found for parents [2,9]. Recently, the algorithm for computing inbreeding coefficients by the indirect method was modified in order to facilitate the implementation of this method and to speed up the computation of the coefficients, especially with overlapping generations [10]. Now consider implementation of the indirect method at the gametic level to compute D for a marked QTL. It is assumed here that PDQ have already been calculated, and thus the only requirement to compute D is the conditional gametic relationships between the parents of every individual. The conditional gametic relationships between an individual and other individuals (Y) can be expressed as: where X is a n×2 matrix consisting of a 2×2 identity matrix at the position of the individual of concern and null otherwise. Substituting (1) into (3) leads to: Denoting L X by Z, we have: and As mentioned earlier, L −1 is a lower block triangular matrix with special structure that contains only −B s i and −B d i at positions where individual i is linked to its sire and dam and 1s on the diagonal, and therefore (4) and subsequently (5) can be solved for Z and Y, respectively, using Gaussian elimination. The elimination for (4) can be simply done by tracing up the pedigree that is to update X upwards as: Note that X is now overwritten by DZ. Thus, the elimination for (5) can be simply done by tracing down the pedigree -that is to update X downwards as: This finally provides solutions for Y. It is assumed that D ii is known for the individual of concern and its ancestors. Therefore, conditional gametic relationships between a sire (or a dam) and its mates can be derived by this method in two simple steps, or tracing up and tracing down the pedigree. To compute all conditional gametic relationships of concern, the indirect method should be run as many times as the number of sires or dams. In practice, the number of sires is much less than that of dams.

Algorithm and processing strategies
The modified algorithm of the indirect method to compute inbreeding coefficients [10] can be applied by some changes to compute D, as follows: (1) compute PDQ for all individuals; (2) sort identification of individuals by pseudo-generation numbers of their sires, and then by identification of their sires in ascending order; (3) for j = 1 to max (pseudo-generation numbers): (3-1) extract the pedigree of sires with pseudo-generation number j and their mates and recode them sequentially starting from 1. Thus, for the extracted pedigree, we have Y * = G * X * ; (3-2) calculate D ii for the sires and their ancestors if not already calculated, and then form D * ; (3-3) process the indirect method sire after sire: (3-3-1) set X * to a 2×2 identity matrix at the position of the sire considered and null otherwise; (3-3-2) solve (L * −1 ) Z * = X * for Z * by tracing up the extracted pedigree; (4) finally, calculate D ii for individuals that have not had their D ii calculated yet.
Matrices Z * and D * Z * are obtained simultaneously. Since elements of vector Z * excluding those for the sire considered and his ancestors are certainly null, D * Z * can be obtained only knowing D * ii for the sire and his ancestors [10]. Note that, as shown in the previous section, updating X * upward and subsequently downward, using information from the extracted pedigree, give D * Z * and Y * , respectively.
Theoretically, the indirect method should be solved by tracing a pedigree containing only the sire of concern, his mates and their ancestors. However, extracting and recoding the pedigree for every sire (plus his mates) separately is time-consuming. Therefore, with our previous algorithm for computing inbreeding coefficients [10], a reduced pedigree containing all sires and dams is extracted once. With this pedigree, processing the indirect method for each sire might include some excessive computations that do not contribute to the final results. In the case of the polygenic model, it seems that these excessive computations are less costly than extracting and recoding many pedigrees. However, inheritance of QTL alleles from parents to a child involves eight possible allelic pathways. Therefore, for the same problem, solving (4) and (5) requires much more computations than solving corresponding equations for the polygenic model and consequently more time for the computations possibly in excess. Hence, in order to lower excessive computations with the gametic model, a few small pedigrees are built by extracting the pedigree for each group of sires with the same pseudo-generation number and their mates. It should be pointed out that since the pseudo-generation number represents a chronological order [9], the indirect method can be processed group after group sequentially. With such a sequential process, D ii for the sires of the current group and their ancestors will be available when needed. Excessive computations can be further reduced by using a Boolean vector to flag non-null block matrices of X as they are calculated.

Simulation
A simulation study was conducted to compare the computational efficiency of the indirect method [2] and adaptation of the tabular method [1,12] (denoted as IM and AT, respectively). A situation where a QTL located in the middle of a 10 cM interval flanked by two markers was considered. The QTL contributed 20% of the additive genetic variation of the trait with heritability of 0.3. To create a QTL with the given additive genetic variation, the average effects of the two QTL alleles were set to 6.7 and -6.7, and the remaining additive and environmental variances to 360 and 1050, respectively. Considering three factors with two levels each, eight populations with complete marker information were generated for 10, 20 and 30 years of reproduction. The levels of the three factors were discrete or overlapping generations, 50 or 300 breeding sires and 2 or 8 litter size. All eight breeding populations consisted of 1500 dams. In the base generation, two alleles at equal frequencies for the QTL and the maximum number of alleles for the two markers (i.e., maximum polymorphic markers) were simulated, and all founders were assumed to be unrelated, non-inbred and unselected. For the setting of overlapping generations, the oldest 50% and 30% of sires and dams, respectively, were replaced each year with selected progeny from the previous year. Parents were selected based on individual phenotypic values and mated at random.
The efficient sparse storage scheme described by Tier [12] was applied to store the required upper triangular subset of G. Although this scheme requires a little more memory than that described by Abdel-Azim and Freeman [1], it is much faster, because it avoids repeated searching through linked list and retrieves elements on the lower diagonal efficiently. Each entry of the sparse matrix is a 2×2 matrix representing G i j . To avoid redundant computations, PDQ are obtained once and saved in memory with both algorithms. The PDQ for flanking markers and the number of discrete-generation equivalents were calculated according to Liu et al. [6] and Meuwissen and Luo [7], respectively. The probabilities of linkage phases (cis or trans) were calculated using information of parents and progeny first. Then the PDQ of each individual were calculated with respect to parental origin of haplotypes, gametic frequencies, probabilities of linkage phases of itself, its sire and its dam. In all eight populations, less than 5% of parental origin of haplotypes in generation 30 was unknown. The computational cost for calculating PDQ, time required for allocating memory, and memory used to store pedigree and marker information were assessed. The algorithm described by Wang et al. [15] was applied to compose G −1 after calculating D. All the algorithms were programmed in C++ language. Variable types of float (4 bytes), int (4 bytes), short int (2 bytes) and bool (1 byte) were declared and computations were carried out on a PC Pentium 4, 3.2 GHz processor under Windows XP (1 Gb of RAM).

RESULTS AND DISCUSSION
The performance of the two algorithms, in terms of computing time and storage requirements, to compute matrix D is presented in Tables I and II, respectively. The results clearly indicate that IM could be run in less time and with less memory requirement compared to AT. The computing time of IM relative to AT ranged from 2 to 50%. With discrete generations, both AT and IM performed slower than with overlapping generations, which may be due to the larger number of ancestors and the larger equivalent number of discretegenerations. As expected, the computing time of both methods was increased substantially with increasing number of sires. A smaller number of sires indicates a smaller number of genetic origins in the population, and therefore the number of required elements of G was far less with 50 sires compared with 300 sires. The effect of the number of sires on IM was due to the fact that this method essentially solves (4) and (5) as many times as the number of sires. However, the relative efficiency of AT decreased more quickly with the increased number of sires. The results (not shown) from a study of the effect of the number of dams indicate that IM was significantly less sensitive to the number of dams compared to AT.
Since litter size increased from 2 to 8, the total number of individuals increased 4 times. In spite of that, AT became relatively faster with larger litter size. Similar findings were reported for polygenic models [7,10]. With larger litter size, selection pressure is increased because the same number of breeding individuals is chosen among a higher number of candidates. Hence, the number of ancestors of the selected individuals tends to decrease, contributing to decreasing the number of required elements of G. In contrast, IM consumed slightly more time with larger litter size except for the discrete situation with 300 sires. This could be because increased numbers of individuals increases the overhead for other tasks.
Increasing the number of generations increased the computing time for both methods. More precisely, the rate of increase was higher for IM than for AT. While IM ran fast compared to AT in all situations considered, AT may become competitive with IM in terms of computing time when both litter size and the number of generations are large, but may still require much more memory than IM. (1) Computed for individuals born in the last year with overlapping generations. (2) Hard disk memory was used. In general, the memory requirement of AT showed the same trend as was seen for computing time and was larger than that of IM in all settings examined. In the case of discrete generations with 300 sires, two progeny per dam and 30 years of reproduction, AT performed worst. This bad performance of AT could be linked to a larger number of ancestors with a more dispersed pedigree, which led to an increase in the number of required elements of G.
In this case, the memory requirement for AT was 1032.9 Mb or 98 times that required by IM, and computing time was exceptionally larger due to the use of hard disk space. In contrast to computing time, storage requirements of IM did not significantly differ between overlapping and discrete generations and between 50 and 300 sires. However, the memory required by IM was greater with the larger litter size and more generations, due to increased total number of individuals in these cases. As a result, the storage requirement of IM was almost a linear function of the total number of individuals, whereas that of AT could be a function of many factors such as the number of ancestors, litter size, the number of generations, and selection strategy. Populations that require a large amount of memory with AT can be divided into smaller populations based on the related group [12]. Nevertheless, such divided populations may always have some part of their pedigrees in common, and consequently repeated computation of the same elements of G would often be required, which would result in a marked increase in computing time. Table III shows the number of calculated elements of G −1 and memory required to store them as a lower triangular sparse matrix. The total CPU time consumed for creating the triangular inverse was less than 0.2 seconds in all scenarios.
Finally, a problem lies in calculating G −1 for populations with incomplete marker data. A solution for this situation is given by Wang et al. [15], which is not feasible for large populations. Given missing marker information, approximate methods to compute G and its inverse for large populations in which PDQ are first approximated and then G is constructed recursively were presented [13,15]. The approximated PDQ can also be used by AT or IM to compute D and consequently G −1 . Although these approximate methods do not adequately utilize the existing marker information for approximating PDQ [13], the use of a more exact method such as the Monte Carlo Markov chain might improve the accuracy of the approximations. Recently, given incomplete marker data, a method for constructing a gametic covariance matrix for an equivalent mixed model was developed [11]. With this method, however, genotype probabilities for individuals with missing marker information should be calculated, and two random effects for each possible marker genotype should be considered.
In conclusion, we show that it is feasible to apply the indirect method at the gametic level for computing conditional gametic relationships between two parents of all individuals in inbred populations which are necessary for calculating matrix D and finally G −1 . Simulation results indicate that the indirect method can compute D faster with significantly less storage requirements compared to the adaptation of the tabular method. The efficiency of the indirect method relies on the use of the sparseness of G −1 which can be exploited quickly in two simple steps.