Identification of gametes and treatment of linear dependencies in the gametic QTL-relationship matrix and its inverse

The estimation of gametic effects via marker-assisted BLUP requires the inverse of the conditional gametic relationship matrix G. Both gametes of each animal can either be identified (distinguished) by markers or by parental origin. By example, it was shown that the conditional gametic relationship matrix is not unique but depends on the mode of gamete identification. The sum of both gametic effects of each animal – and therefore its estimated breeding value – remains however unaffected. A previously known algorithm for setting up the inverse of G was generalized in order to eliminate the dependencies between columns and rows of G. In the presence of dependencies the rank of G also depends on the mode of gamete identification. A unique transformation of estimates of QTL genotypic effects into QTL gametic effects was proven to be impossible. The properties of both modes of gamete identification in the fields of application are discussed.


INTRODUCTION
Fernando and Grossman [2] described how to incorporate genetic markers linked to quantitative trait loci (QTL) into best linear unbiased prediction (BLUP) for genetic evaluation. For this, the inverse of the conditional gametic relationship matrix G is needed. This matrix mirrors the (co-)variances between QTL allele effects of all animals for a marked QTL (MQTL).
For offspring of so-called informative matings the paternal or maternal origin of gametes can be identified by one or several markers in the surroundings of the QTL. The QTL-allele on the paternal (maternal) gamete can then be taken as the first (second) MQTL-allele effect of such an individual. Below this is termed "gamete identification by parental origin".
An alternative mode of gamete identification has been employed by Wang et al. [21] and Abdel-Azim and Freeman [1]: for an individual with a heterozygous (1, 2) marker genotype, the gamete with the first (1, in alphanumerical order) marker allele is taken to carry the first and the gamete with the other (2) allele, the second MQTL allele effect. This is denoted as "gamete identification by markers".
Both modes of gamete identification have been used before in publications dealing with the computation of G and its inverse from pedigrees and marker data. Until now -to the authors' knowledge -the consequences of changing the mode of gamete identification in a marker assisted BLUP (MA-BLUP) model have, however, not been investigated.
Abdel-Azim and Freeman [1] -based on the results of [2] and [21] -developed a numerically efficient algorithm for the computation of G and its inverse. This algorithm has been tailored for situations where G has full row and column rank and the number of MQTL effects is twice the number of animals in the pedigree. However, under certain circumstances, linear dependencies may occur between gametic MQTL effects and G may therefore be rank-deficient. This could e.g. arise from a microsatellite located within an intron (zero recombination rate) of that gene, which is responsible for the QTL or if double recombinants are ignored for a QTL between two flanking markers [10]. This article first demonstrates by example that G is not unique but depends on the mode of gamete identification, and as do the MA-BLUP estimates of gametic MQTL effects. Then a generalization of the Abdel-Azim and Freeman algorithm [1] is developed, allowing for the elimination of linear dependencies in G and its inverse.

MODEL, NOTATION, DEFINITIONS, ASSUMPTIONS
Let us consider the following mixed linear model (gametic effects model) where y (m×1) denotes the vector of m phenotypic records for n animals, f (n f ×1) is the vector of fixed effects, u (n×1) is the vector of random polygenic effects and v (2n×1) is the vector of the random gametic effects (v 1 1 , v 2 1 , . . . , v 1 i , v 2 i , . . . , v 1 n , v 2 n ) of a marked quantitative trait locus (MQTL) that is linked to a single polymorphic marker locus (ML). Linkage equilibrium between ML and MQTL is assumed. Observed marker genotypes are denoted by M. X (m×n f ) , Z (m×n) are known incidence matrices and T (n×2n) = I n ⊗ [ 1 1 ], where ⊗ stands for the Kronecker product. Subscripts in parentheses of the vectors and matrices denote their dimensions. Expectations of u, v and e and covariances between them are assumed to be 0. Furthermore, let Cov(u) = σ 2 u V, Cov(v) = σ 2 v G, Cov(e) = σ 2 e R, with the (n × n)-dimensional numerator relationship matrix V, the (m × m)-dimensional residual covariance matrix R and the (2n × 2n)-dimensional conditional gametic relationship matrix G and the variance components σ 2 u , σ 2 v and σ 2 e of the polygenic effects, the effects of the MQTL and the residual effects.
Let α 1 i α 2 i , i = 1, ..., n denote the two MQTL alleles of individual i having the additive effects v i = (v 1 i , v 2 i ) , and P(α k i ⇐ α t j |M) defines the probability that the kth allele, k = 1, 2, of individual i descends from the tth allele α t j , t = 1, 2, of parent j given the observed marker genotypes M, and, r is the recombination rate between the maker locus and the MQTL. In the following paragraphs let us assume that individuals are ordered such that parents precede their progeny (ordered pedigree).

COMPUTING G AND ITS INVERSE
Abdel-Azim's and Freeman's example [1] is used to demonstrate that G and its inverse are not unique but depend on the mode of gamete identification. With the assumptions made above and a recombination rate r > 0, gamete identification by markers is considered first.

Gametes are identified by markers
Let s and d denote paternal and maternal parents of animal i. The eight probabilities that the MQTL alleles (α 1 i , α 2 i ) of animal i descended from any of the parents' four MQTL alleles, paternal (α 1 s , α 2 s ) and maternal (α 1 d , α 2 d ), for given observed marker genotypes M, can be written as a matrix Q i as defined by Wang et al. [21]: It must be defined what is the first and what is is the second MQTL allele in (2a): in heterozygotes (1,2 at the marker) the first MQTL allele is on the gamete with the first marker allele (1) and the second MQTL allele is on the gamete with the second marker allele (2), as already described in the introduction.
In homozygotes, the MQTL alleles can not be distinguished. The Q i for the base animals, i.e. animals having no parents in the pedigree, are not defined. Non-base animals have Q i s with first and the second row sums equal to one as well as the sum of the elements of the sire block (first two columns of Q i ) and the sum of the elements of the dam block (last two columns of Q i ).
The Q i matrices are of key importance, because once these Q i s have been computed for all individuals in an ordered pedigree, the tabular method [21] can be applied for the construction of G and G −1 -no matter what method has been used for the computation of Q i s before: where f i is the conditional probability that 2 homologous alleles at the MQTL in individual i are identical by decent, given observed marker genotypes M (conditional inbreeding coefficient of individual i for the MQTL, given M), which can be calculated according to formula (11) in [21], and )-dimensional matrix constructed by setting the (2s-1)th and (2s)th column equal to the first and second column of Q i and the (2d-2)th and (2d)th column equal to the third and fourth column of Q i , all other elements of A i are zero, where s and d are the numbers of the sire and the dam of individual i in the ordered pedigree. Abdel-Azim and Freeman [1] gave an algorithm for the decomposition of G by G = BDB , where B is a lower triangular matrix and D is a block diagonal matrix with (2 × 2)-matrices D i from (4) in the ith block. B can be recursively computed as where I 2 is an identity matrix and A i is the same matrix as in (3) and (4). The inverse of G can be calculated as with  Table I. Example pedigree, marker genotypes from [1] and Q * i (bold numbers) from (2b), in Q i notation (2a).

Animal Sire Dam
Marker [1] proposed efficient computational techniques using this decomposition and a sparse storage scheme for G −1 . The example of Abdel-Azim and Freeman (see Tab. I in [1]) can be used to demonstrate G ( Fig. 1 in [1]) and G −1 (p. 162 in [1]) for complete marker data, linkage equilibrium and a recombination rate of 0.10 under gamete identification by markers.

Gametes are identified by parental origin of the marker alleles
When the gametes α 1 i , α 2 i are identified by the parental origin of the marker alleles, the first MQTL allele of animal i is defined as its paternal (α 1 i = def α s i ) and the second as its maternal allele (α 2 i = def α d i ). Consequently (2a) becomes and with the fact that the row sums of Q i are equal to 1 , i.e. only two parameters P(α s i ⇐ α s s |M) and P(α d i ⇐ α s d |M) are to be calculated and therefore Q i reduces to Q * 1 i and Q * 2 i are known as transition probabilities in QTL analysis. In contrast to gamete identification by markers (2a), the gametes of base animals cannot be uniquely identified and the paternal or maternal origin of the marker alleles of all base animals remains uncertain when (2b) is applied. With a probability of 0.5 the first marker allele may be of paternal or maternal origin, and the second, too. This fact creates differences in the Q i matrices and, as a consequence, differences in G and its inverse if gamete identification by parental origin is used. The same is true for heterozygous offspring of uninformative matings. For illustration, let us consider animal 5 in Table I in [1] and Table I of this paper. Animal 5 has a marker genotype A 1 A 1 and is offspring of animal 3 (sire, A 1 A 2 ) and animal 4 (dam, A 1 A 2 ). It is evident that animal 5 has inherited A 1 from both parents. With definition (2a), this is the first allele of the sire and the first of the dam, but because of the homozygosity, each of the A 1 in animal 5, A 1 can be the first or the second marker allele. Thus under (2a), Q 5 must be determined as where the first matrix of the product is the matrix with the probabilities of descent for the marker alleles and the second is the matrix with the recombination rate r = 0.1 in both formulas for Q 5 . Now we use definition (2b), and the fact that the sire of 5 is base animal 3. Hence in individual 3 A 1 can be maternal or paternal with probability 0.5. The dam of animal 5 is no base animal. So it is clear that A 1 is the paternal allele of the dam, and or in (2b) notation Q * 5 = 0.5 0.9 . The complete set of Q * i s (2b) in their Q i notation (2a) for Table I data in [1] for gamete identification by parental origin can be found in Table I. With Q i -notation of the Q * i the algorithm of [21] and [1] can also be applied for computing the conditional gametic relationship matrix (non-zero elements of this matrix see (E 1) and its inverse (non-zero elements of the inverse see (E 2)). Figure 1 in [1] and (E 1) or the matrix at page 162 in [1] and (E 2), there are some differences in G and G −1 . The G-matrix [1] is of full rank and has 128 non-zero elements, G in (E 1) is of full rank, too, but it only has 106 non-zero elements. The numbers of non-zeros in the corresponding inverses are 74 (p. 162 in [1]) versus 58 (E 2).
With the w = Tv, model (1) can be written as MQTL genotypic effects of model y = Xf + Zu + Zw + e, with (n × 1)-vector w of genotypic effects at the MQTL of the n animals, to the same conditional genotypic relationship matrix [19] (non-zero elements in (E 3)) for both different conditional gametic relationship matrices Figure 1 in [1] and (E 1). As a consequence the resulting genotypic effects w are independent of the variant of G and the same is true for polygenic effects and the total breeding values of all animals.

LINEAR DEPENDENCIES IN G AND RULES FOR ELIMINATING THEM
As already mentioned, the recombination rate r between MQTL and the marker may be zero for certain applications. Therefore we re-examine the example from Table I in [1] using gamete identification by markers, but now with a recombination rate of r = 0. The corresponding Q i s can be found in Table II.
With the Abdel-Azim and Freeman algorithm [1] the G-matrix can be calculated, but it has dependent rows and columns (e.g. identical rows/columns 8, 12 and 14, see (E 4)).

(E 4)
The computation of G −1 fails because of the dependencies in G. These dependencies are indicated by det(D i ) = 0 for individuals i = 5, 6, 7, and consequently, D −1 i in (4) or (6) does not exist for these individuals. The dependencies in G are caused by the configuration of Q i s. Problem-creating Q i -matrices in the example are Q 5 , Q 6 and Q 7 in Table II. Q 6 and Q 7 imply Table II. Example pedigree, marker genotypes from [1] and Q i (recombination rate: of (E 6), (E 7):  (1). Q 5 contains the information that animal 5 has received the sire's first MQTL-allele and the dam's first MQTL-allele, but it is not known which of these alleles is the first and which is the second in animal 5. Therefore Q 5 can be written as the average Q 5 = 0.5 · 1 0 0 0 0 0 1 0 + 0 0 1 0 1 0 0 0 , and the corresponding effects . Hence the number of gametic effects in model (1) can be reduced to a smaller set of different effects without dependencies in a corresponding 'condensed' gametic relationship matrix G * . How the configuration of the Q i s can be used in a 'condensing' algorithm for the gametic effects and the computing of the 'condensed' gametic relationship matrix G * and its inverse is outlined in detail in the following section.
Let v * denote the n * -dimensional vector of the n * remaining components of v and let L be a(2n × n * )-dimensional matrix with row sums equal to 1 in such a manner that v = Lv * . Therewith, model (1) can be written as The determination of the n * remaining components of v is part of the condensing algorithm. It is assumed that the Q i matrices (2a) have already been computed for all animals and the pedigree is ordered such that parents precede their progeny.
Let 4)vector of the column sums of Q i . SQ 1 i = 1 for example means that animal i has received the first MQTL-allele of its sire and therefore SQ 2 i = 0. If there is a one in the first or second row of the first column of Q i the place of this allele in i is the number of that row containing the one.
Define N = ((N i,j )), i = 1, ..., n; j = 1, 2 a (n × 2)-dimensional integer matrix with the indices of the remaining gametic effects v * of n animals and N i = (N i,1 ; N i,2 ) the ith row of N and let n b be the number of base animals at the top of the pedigree which are considered to be unrelated and non inbred, and n max The algorithm consists of four parts: the generation of the index matrix N, the determination of matrix L, the calculation of the condensed gametic relationship matrix G * , and finally, the computation of its inverse. It is independent of the mode of gamete identification and can be used with Q i definition (2a) as well as with Q i definition (2b).

First part of the algorithm: Generation of the index matrix N
For i ≤ n b (base animals): For i > n b (non base animals) and k, j = 1, 2: where N s(i),k is the index of the kth MQTL-allele (k = 1, 2) of the sire s(i) and N d(i), j is the index of the jth MQTL-allele ( j = 1, 2) of the dam d(i) of   Table III.
Second part of the algorithm: Determination of the incidence matrix L For each animal i (i = 1, ..., n) there are two rows in L. Let L 2i−1,t denote the elements of the first row and L 2i,t (t = 1, ..., n * ) those of the second. The following algorithm determines the non zero elements of L: and ; else for t = N i,2 where 1 Q i means that no element of Q i equals one.

Third part of the algorithm: Calculation of the condensed gametic relationship matrix G *
The condensed gametic relationship matrix G * with full rank n * can be computed by the use of the following generalization of the tabular method of [21]: G * 1 = G 1 and for i = 2, ..., n i is a (2×n max i−1 )-dimensional matrix with columns N s(i),1 , N s(i),2 , N d(i),1 , N d(i),2 being identically with the first, second, third, fourth column of Q i in this order and zero elements otherwise.

Fourth part of the algorithm: Computation of the inverse of the condensed gametic relationship matrix G *
The inverse G * −1 of the condensed gametic relationship matrix G * can be determined by generalization of the tabular method of [21] in an analogous way: G * −1 1 = G −1 1 and for i = 2, ..., n

if one row and column is added and
if two rows and columns are added. Calculation of d * i and D * i can be simplified by using d * We continue with animal 7 for illustration of (10). With G * −1 6 we only have to compute d * 7 to get G * −1 7 = G * −1 (see (E 7) for non-zero elements of the inverse G * −1 ) with A * 1 7 = 0 0 0 0 0.5 0 0.5 0 0 from above.
With w = Tv, v = Lv * , the relation Q G = 0.5 · TGT = 0.5 · T(LG * L )T between conditional genotypic relationship matrix Q G and conditional gametic relationship matrices G and G * , it is easy to verify that G * from (E 6) and G from (E 4) result in the same conditional genotypic relationship matrix Again, we consider the situation that the gametes α 1 i , α 2 i are identified by parental origin for Table II data and calculate Q * i according to (2b). The Q * i s for the non-base animals are Q * 4 = 0.5 0.5 , Q * 5 = 0.5 1.0 , Q * 6 = 0.5 0.0 and Q * 7 = 0.5 0.0 . Applying the condensing algorithm for this data, the condensed gametic relationship matrix G * (see (E 9) for non-zero elements) and its inverse G * −1 (see (E 10) for non-zero elements) ) can be calculated recursively. In contrast to (E 5) (10 remaining effects) there are now 11 effects left. The differences in the condensed gametic relationship matrices ((E 6) vs. (E 9)) and their inverses ((E 7) vs. (E 10)) are evident. G * in (E 6) is of rank 10 and has 34 non-zero elements, G * in (E 9) is of rank 11 and has 43 non-zero elements. The corresponding inverses are of rank 10 with 32 non-zeros (E 7) versus rank 11 with 39 non-zero elements (E 10). But again the matrices of (E 6) and (E 9) result in the identical genotypic relationship matrix (E 8), which can easily be verified. This means that the number and size of estimates of gametic MQTL effects depend on the mode of gamete identification, but the sum of these effects remains unaffected for each animal.

DISCUSSION
Gamete identification by parental origin and gamete identification by markers have already been used earlier in the literature without a clear distinction [1,2,14,21]. In this article it was shown for the first time, at least to the authors' knowledge, that both identification methods result in different conditional gametic relationship matrices and different estimates of gametic effects in a MA-BLUP model. It could, however, be demonstrated that the MA-BLUP breeding value -the estimated sum of QTL-gamete effects and the polygenic effect -remain the same irrespective of the method of gamete identification.
A practical advantage of identification by parental origin is that both the sire-block and the dam-block of the Q i -matrices (2a) can each be represented by a single number, namely the probability that the paternal allele of the sire and the dam have been transmitted to the descendant, respectively. The reason for this is that each block (sire and dam) of the Q i -matrices has only two non-zero entries which sum up to one, if the Q i -matrices reflect gamete identification by parental origin. The alternative mode of gamete identification needs three probabilities for each block, because the number of non-zero entries per block (again summing up to one) is four when gamete identification is done by marker. It may be worthwhile to store Q i -matrices for all animals additionally to marker raw data as an essence of marker information and intermediate result in computing the G matrix and its inverse and also for other purposes as e.g. the computation of measures of marker information content. Though six numbers per animal will not be prohibitive to store even with tens of thousands of animals, identification by parental origin needs only two and is therefore easier to administer.
The genotypic relationship matrix Q G in the MQTL genotypic effect model may either be determined by deterministic methods [1,2,9,13,21] or by Markov chain Monte Carlo (MCMC) [5,12,[16][17][18]20] and their advantages and disadvantages have been investigated by several authors [11,13]. MCMC has been implemented in the LOKI program [6] in order to compute Q G by MCMC, but currently it cannot be used to compute G for a MQTL allelic effects model. Though not primarily designed for this purpose SimWalk2 [15] can be employed to achieve this goal: it reports MCMC estimates of all 15 detailed identity state probabilities [4] between any pair of animals in the pedigree. Q i matrices can be derived from the SimWalk2 output implicitly using gamete identification by parental origin. This follows from the definition of identity states in SimWalk2 software [15].
The MA-BLUP breeding value of each animal in (1) equals the sum of the MQTL genotypic effect and the polygenic effect. Thus it would be sufficient to use a MQTL genotypic effects model and to save one equation per genotyped animal. Since the MQTL genotypic effect is the sum of the first and the second MQTL allele effects, one positive and one negative gametic effect may give rise to the same genotypic effect as two gametic effects of average size. It would be interesting for breeders to know how an MQTL genotypic effect is composed and therefore it is desirable to have estimates of the available gametic effects. A conclusion of the considerations above is, that a certain mode of gamete identification has to be chosen. It may be more natural to animal breeders to think in pedigrees and parental origin rather than in marker haplotypes and therefore gamete identification by parental origin may be preferred for this purpose.
It has been proposed by [11] to estimate MQTL genotypic effects w and to transform these into allelic effects v. If such a transformation exists, it would have to be specific for a certain definition of v, i.e. it would depend on the mode of gamete identification. The conclusion of [11] that v = 0.5 · GT (Q G (n×n) ) −1 w follows from Tv = w = TGT (TGT ) −1 w = 0.5·TGT (Q G (n×n) ) −1 w is however not possible, because a left-inverse, say T ( where t 1,1 cannot take the values 1 and 0 at the same time, and consequently, T (left)− (2n×n) T (n×2n) never equals the identity matrix. Both modes of gamete identification fail for some animals. The gametes of homozygous individuals (both base and non-base) cannot be distinguished by markers. On the contrary, gamete identification by parental origin fails in all founders and in all non-founders from non-informative matings, where from marker analysis it cannot be deduced whether the marker was inherited from the dam or from the sire. As the number of markers is increased, the probability for a mating to be non-informative for all markers as well as the probability for an individual to be homozygous at all marker loci becomes smaller and smaller. Multiple markers will therefore help to identify nearly all gametes unequivocally. This is, however, not true for founder animals, when gamete identification is done by parental origin. This problem can, however, be resolved by identifying founder gametes by markers and by arbitrarily denoting one of both gametes of each founder animal as maternal and the other as paternal. By applying this rule to the example pedigree [1] it can be shown that a third variant of a conditional gametic relationship matrix results and differs from the two others demonstrated above (data not shown), but again transforms to the same Q G -matrix and, consequently, leads to the same MA-BLUP breeding values.
Gamete identification by marker may be of interest in applications where the intention is to test a certain polymorphism for linkage disequilibrium with a QTL. With two alleles and in using this polymorphism for gamete identification, all gametes with allele 1 are treated as the first and all gametes with allele 2 as the second gamete of heterozygous animals. The expectation is that, if the polymorphism is in strong linkage disequilibrium with the QTL, differences between the first and the second gametic effect of heterozygotes will exhibit the same sign and roughly the same size, provided there is a sufficient accuracy of both gametic estimates.
A reduction of the size of the conditional gametic relationship matrix has already been proposed by [10]: parents and offspring sharing the same marker haplotype were treated as sharing the same QTL allele, by assuming a zero probability of double recombinations. In these cases the same gametic QTL effect was assigned to the parent and the offspring. When the offspring has received a recombinant marker haplotype a new gametic QTL effect was defined as the average of both gametes of the parent animal. Treating these two gametes as parents of the new gamete allows to set up a pedigree of gametic effects and to compute the conditional gametic relationship matrix and its inverse simply by applying the Henderson rules [7,8]. The condensing algorithm will, of course, lead to identical results, if desired: sire-and dam-blocks of progeny with non-recombinant parental haplotypes have to carry zeros and ones only, and in the recombinant case the corresponding sire-and dam-blocks are assembled by fifty-percent transition probabilities as in the case without markers. The Meuwissen and Goddard proposal [10] therefore combines a special case of the condensing algorithm with gamete identification by parental origin. Assuming the QTL in the middle of a certain marker interval and rounding the QTL-transition probabilities to either one or zero if they are closer to these values as a predefined threshold, e.g. 0.02, will give the same results as in [10] for those animals which are informative at the markers flanking this particular interval. For the other animals information from markers more remote and asymmetrically distributed around the QTL's assumed home interval may be available. In these cases the probability of double recombinations may be too high to be neglected and, furthermore, the assumption of an equal transmission probability of the first and second parental allele may be unrealistic in the light of the markers transmitted. These animals can easily be combined with the former group by maintaining the original transition probabilities in the Q imatrices without rounding and then applying the condensing algorithm to all pedigree members in the same way, no matter of previous rounding or not.
In conclusion, the condensing algorithm is a generalization of the Abdel-Azim and Freeman algorithm [1] for computing the conditional gametic relationship matrix and its inverse. Although suggested before by other authors, computing this inverse cannot be avoided if estimates of gametic QTL effects are desired.
The condensing algorithm can be applied to different modes of gamete identification, situations with and without markers including X-linkage, clones and haplodiploid pedigrees. Treatment of haplotypes according to the proposals of [10] are covered as a special case and can be combined with exact treatment of more remote marker information.