The covariance between relatives conditional on genetic markers

The development of molecular genotyping techniques makes it possible to analyze quantitative traits on the basis of individual loci. With marker information, the classical theory of estimating the genetic covariance between relatives can be reformulated to improve the accuracy of estimation. In this study, an algorithm was derived for computing the conditional covariance between relatives given genetic markers. Procedures for calculating the conditional relationship coefficients for additive, dominance, additive by additive, additive by dominance, dominance by additive and dominance by dominance effects were developed. The relationship coefficients were computed based on conditional QTL allelic transmission probabilities, which were inferred from the marker allelic transmission probabilities. An example data set with pedigree and linked markers was used to demonstrate the methods developed. Although this study dealt with two QTLs coupled with linked markers, the same principle can be readily extended to the situation of multiple QTL. The treatment of missing marker information and unknown linkage phase between markers for calculating the covariance between relatives was discussed.


INTRODUCTION
Quantifying the resemblance between relatives is a fundamental issue in quantitative genetics. It is needed for estimating genetic parameters, predicting breeding values, planning mating schemes, QTL mapping and marker assisted genetic evaluation. The study of the correlation between relatives can be traced back to the beginning of the last century [29,36]. Kempthorne [22] summarized the work on this topic up to Malecot's study [27]. Fisher [12] first studied the two-locus epistatic deviations and their effects on the covariance between relatives such as parents and descendants, fullsibs, uncles and cousins. Cockerham [6,7] partitioned the two-locus epistatic variance into additive by additive, additive by dominance and dominance by dominance. Kempthorne [21,22] applied the analysis of factorial experiments to partition the genetic variance and studied the covariance between relatives in random mating populations [21,23], inbred populations [24] and a simple autotetraploid population [25]. Plum [31] formulated a recursive method for calculating the relationship and inbreeding coefficients. Cockerham [8] and Weir et al. [37] analyzed the influence of linkage on the covariance between relatives. The theory and computational algorithms for the correlation between relatives were well established in the early development of quantitative genetics.
The resemblance between relatives is attributed to gene transmission from the parents to the descendants so that the relatives share identical genes by descent with certain probabilities. Since the gene transmission between generations is not observable, the transmission probability of an allele is generally taken to be 0.5. Actually, the transmission of an allele from a parent to offspring follows an all-or-none pattern. With information from molecular markers, it becomes possible to track the transmission of a linked gene more precisely than by using pedigree data alone.
There have been several studies on the conditional covariance between relatives. Fernando and Grossman [11] developed a method for calculating the gametic covariance conditional on a single linked marker, assuming completely informative markers. Van Arendonk et al. [3] designed a computing procedure for the gametic relationship matrix given a single linked marker, which is valid when the parental origin of the offspring's alleles is known. Goddard [16] derived the conditional gametic covariance due to allelic effects in terms of genetic effects without using the concept of identity probabilities, where parental origins of marker alleles and linkage phases among markers are assumed to be known. However, the parental origin of the offspring's alleles is often unknown in real data analysis. Wang et al. [35] extended Fernando and Grossman's [11] method to accommodate situations where the parental origin of marker alleles can not be determined unequivocably. However, the method used to account for this biological uncertainty has been developed only for a single marker linked to a QTL. In QTL mapping for human populations, Fulker and Cardon [13,14] used a regression approach to approximate the IBD of QTL from the IBD of flanking markers. Their development is based on the method of Haseman and Elston [18] which considers the expected IBD of a locus as a linear function of the IBD of another linked locus. Kruglyak and Lander [26] developed a hidden Markov model to estimate the IBD states of a putative QTL using the probability distribution of the marker IBDs. This approach is more accurate than Fulker and Cardon's approximation [13,14], but is more complicated to compute. Xu and Gessler [38] made a compromise between the two methods and proposed an approximate hidden Markov model to improve the computing speed at the expense of estimation accuracy. Almasy and Blangero [2] improved Fulker and Cardon's method [13,14] in regard to the sib-pair approach of QTL mapping and developed a general framework of multipoint identity by descent. Pong-Wong et al. [32] combined the method of Haseman and Elston [18] for estimating identity by descent between sibs often used in human genetics and the method of Wang et al. [35] for general pedigree to derive a simple method for calculating the gametic identity-by-descent matrix of QTLs. Meuwissen and Goddard [28] developed a method of predicting gametic identity probability from marker haplotypes by a simplified coalescence process, assuming that the number of generations since the base population and effective population size are known. These studies on conditional identity measures of relatives have generally focused on the identity by descent due to allelic effects. The theory of conditional covariance due to non-additive effects has been little studied. Aside from the covariance due to allelic effects, the quantification of the conditional covariance components due to additive and non-additive effects is also frequently required to refine the statistical model for marker assisted analysis of quantitative traits.
This study aimed to develop a general theory for constructing the conditional covariance between relatives in the presence of additive, dominance and epistatic effects and to update the classical theory when both pedigree and marker data are available. The development relaxed the assumptions of previous studies and applied both single and flanking marker inferences with known or unknown parental origins of offspring's haplotypes.

Genetic covariance components
If there are q loci controlling a quantitative trait, the classical formula for computing the covariance between genotypic values (g) of individuals i and j [21,22] is: (1) under the assumption of no inbreeding and linkage equilibrium among loci. When there is only one locus (q = 1), formula (1) reduces to Cov(g i , g j ) = r ij σ 2 A + u ij σ 2 D . In this notation, σ 2 A 2 D 1 stands for σ 2 AAD while σ 2 A 1 D 2 stands for σ 2 ADD . Traditionally, the coefficients r ij and u ij are assumed to be identical for the q loci because the allelic transmission at each individual locus can not be traced. Considering only two loci, say m and n, the genetic covariance due to these two QTL loci can be written as: where σ 2 A m and σ 2 A n are the additive variances of loci m and n, and σ 2 D m and σ 2 D n are dominance variances at the two loci. The epistatic variances for additive by additive, additive by dominance, dominance by additive and dominance by dominance between loci m and n are σ 2 A m A n , σ 2 A m D n , σ 2 D m A n and σ 2 D m D n , respectively. Information on the markers linked to QTL affecting a trait can be used to refine the covariance among relatives. Conditional on the marker information M, the coefficients r ij and u ij at different loci may vary from locus to locus. Therefore, formula (2) needs to be rewritten as: where r m ij , r n ij , u m ij and u n ij are the additive and dominance relationship coefficients between individuals i and j at loci m and n, and r m ij r n ij , r m ij u n ij , u m ij r n ij and u m ij u n ij are the relationship coefficients of epistatic interactions between loci m and n. The relationship coefficients r l ij and u l ij (l = m, n) depend on the conditional probability of QTL allelic identities between individuals i and j: This development refines the estimation of genetic covariance and its additive and non-additive components by using marker information that provides locus specific knowledge of QTL allelic transmissions. Therefore, tracing allelic transmission and assessing the conditional probability of QTL allelic identity between relatives are two fundamental issues in this study.

Conditional probability of QTL allelic identity by descent
For every pair of individuals i and j in a population, there are four possible QTL allelic identities: The probabilities of these identities can be inferred conditional on the marker information. Let matrix P ij contain the probabilities of the four QTL allelic identities between individuals i and j: The additive and dominance relationship coefficients between individuals i and j can be obtained from the four elements (p 11 , p 12 p 21 and p 22 ) of P ij according to formulae (4) and (5): Similarly, the QTL allelic identity matrices between i's parents s and d and j's parents s and d in the parental generation ( Fig. 1) can be defined as P s s , P s d , P d s and P d d . For a descendant i, there are eight possible ways to inherit the QTL alleles of the parents s and d. The conditional probabilities of QTL allelic transmission from parents to descendant i can be summarized in matrix T i : where the t's are all (2×1) column vectors. Similarly, QTL allelic transmission probabilities from parents s and d to descendant j can be defined in matrix T j . The QTL allelic identity probabilities between individuals i and j, i.e. P ij , can be calculated as: where Substitution of P sj and P dj into formula (6) leads to Formula (7) corresponds to Falconer's [10] "basic rule"for calculating coancestry whereas formula (6) relates to the "supplementary rule". Computationally, formula (6) is more efficient than formula (7). Both (6) and (7) indicate that the QTL allelic identity probabilities in a population can be tabulated recursively from ancestors to descendants. The same principle applies to the derivation of QTL allelic identity probabilities of individual i with itself. Letting j = i, s = s and d = d in formula (7), and replacing the marginal probabilities with conditional probabilities in T j of formula (7) because the allelic transmission from parent to the first allele of offspring i is not independent of that to the second allele, the QTL allelic identity probabilities of individual i with itself (P ii ) can be derived from formula (7) and take the following form: where P sd and t's are as defined above and 1 = (1 1). Matrix P ii is always symmetric. When the parental origins of the two QTL alleles are known (e.g. Q 1 i is from the father and Q 2 i from mother), formula (8) simplifies to P ii = 1 t 1 P sd t 4 t 4 P ds t 1 1 · (9) In this situation, there is no dependence between the two events of allelic transmission. Therefore, formulae (6) and (7) can be directly applied to assess the QTL identity probabilities of an individual i with itself when parental origins of offspring's alleles are known. This explains why formula (8) of Van Arendonk et al. [3] works in the same way as the method of Wang et al. [35] when parental origins are known.

QTL allelic transmission probabilities
The parental origin of QTL alleles is usually unknown because the QTL allelic transmission is not directly observable. Therefore, the eight transmission probabilities of QTL alleles from parents s and d to descendant i (T i ) have to be assessed based on marker alleles transmitted from parents s and d to the offspring i and genetic distances between QTL and markers. When two flanking markers are available, the transmission probability from QTL allele k p (k p = 1, 2) of parent p (p = s, d) to allele k i (k i = 1, 2) of descendant i can be formulated as: is the conditional probability given in the 5th column of Table I when k p = 1 and in the 6th column when k p = 2.
Matrix T i can now be expressed in terms of marker allelic transmission probabilities, S i , and recombination rates between QTL and markers and between flanking markers: Note that M 1 i and N 1 i always stem from the same parent, so do M 2 i and N 2 i . When only a single linked marker is available, the situation simplifies to the case of Wang et al. [35]. Formula (10) is identical to formula (5) of Wang et al. [35] if their B matrix is transposed.

Marker haplotype transmission probability
Although marker genotypes can be observed through genotyping techniques, the parental origin of a descendant's haplotype is often uncertain. For example, if a descendant and its parents all have genotype A 1 A 2 at a single marker, there is no way to ascertain which parent the descendant's haplotypes come from. Furthermore, when a parent is homozygous, it is impossible to determine which parental gamete a descendant's haplotype comes from. In this development, we trace all possible paths from parental gametes to a descendant's marker haplotype. Because the inference is always conditional on marker information, the notation for conditioning on marker information (|M) will be dropped hereafter for ease of presentation.
The assessment of the marker haplotype transmission involves three steps. First, the transmission probabilities of each path from parental gametes to a descendant's haplotype needs to be quantified. For this, we need to infer which parent a descendant's haplotype comes from (parental origin), and which parental gamete type the descendant's haplotype originates from given the parental origin (gametic frequency). The probability of each transmission path is a probabilistic product of the parental origin and the gametic frequency given parental origin, following the Law of Compound Probability [5]. There are four mutually exclusive paths for each descendant's haplotype in a single marker case and eight in a flanking marker case. Second, we need to determine the probabilities of each descendant's haplotype given the transmission path from a parental gamete to the descendant's haplotype. This can be done by comparing the descendant's haplotype with the parental gametic type. Third, our purpose is to determine the probabilities of each transmission path from parental gametes to a descendant's haplotype given that the descendant's haplotype is observed. This requires calculating the reverse probability of each path given the observed haplotype of the descendant using the Bayes Theorem [5].
Consider the single marker case first. A marker haplotype M k i i of descendant i may come from the first or second paternal allele ( They are four mutually exclusive events. The transmission probability of each path above is a product of the probability of the parental origin of marker haplotype, = s, d), and the gametic frequency of parents given the parental origin: is unity when M k i i and M k p p are identical, and is zero when they are not, this formula can be rewritten as: where is an indicator function, which is equal to one if M k i i ≡ M k p p , and zero otherwise.
In the case of flanking markers, there are eight mutually exclusive marker transmission paths for each haplotype M k i i N k i i (k i = 1, 2) of descendant i: Here, M k i i and N k i i always stem from the same parent. The probability of each transmission path from a parental gamete to a descendant's haplotype is: where k p = 1, 2 and k p = 1, 2 index the pth parental alleles at marker locus M and N, respectively. The paternal and maternal gametic frequencies given parental origins are ( where is an indicator function. To clarify the description above, the stepwise calculation of the transmission probabilities of each marker haplotype is given for an example pedigree in which descendant A 1 B 2 A 2 B 2 results from the father A 1 B 1 A 2 B 2 and the mother A 1 B 2 A 1 B 2 . The elements of matrix S i , i.e. the probabilities for each marker allelic transmission path from parental gametes to the descendant's haplotypes given the descendant's haplotypes are observed (M 1 Table II,  When the linkage phase can not be ascertained in the flanking marker situation, the marker transmission probability can be estimated by its expectation. For example, if only one individual's phase in the related individuals (including sire s, dam d and descendant i) is unknown, the expectation of S i of descendant i, E(S i ) = S c i p c + S r i p r , can be used to calculate the QTL transmission probabilities (matrix T i ), where S c i and S r i are estimated from coupling and repulsion phases, respectively. The probabilities of the coupling phase (p c ) and repulsion phase (p r ) will be inferred based on the pedigree and marker information of parents, mates and offspring. In general, the expectation of S i can be estimated as:

COMPUTATIONAL PROCEDURE
Computationally, it is convenient to arrange the conditional QTL allelic identity probabilities between relatives in a gametic relationship matrix, G. If  Prob. of parental origin Prob. of parental origin stands for the recombination rate between markers.
n individuals in a population are coded successively, G will consist of 2 × 2 submatrices as follows: where the (2 × 2) P ij submatrix can be calculated as The submatrix P ij between individuals i and j can be used in turn to compute the P submatrix of their descendants. Thus, the gametic relationship matrix G can be calculated recursively, beginning with the uppermost left submatrix of G for founders of the population. The recursive calculation proceeds by processing individuals in the order of increasing birthdate. Once matrix G is calculated, the additive and dominance relationship coefficients r ij and u ij can be calculated from the (2 × 2) submatrix of G since matrix P ij is identical to the submatrix of matrix G: Letting R = {r ij } be the additive relationship matrix and U = {u ij } be the dominance relationship matrix at a given QTL locus, then the additive by additive relationship matrix is R m # R n , the additive by dominance relationship matrix R m # U n , the dominance by additive relationship matrix U m # R n , and the dominance by dominance relationship matrix U m # U n , where the symbol # is the Hadamard product between two matrices of the same size.

NUMERICAL EXAMPLE
A pedigree data set is given (Tab. III) to demonstrate the method described. We assumed that marker locus 1 is linked to QTL 1 with a recombination rate of 0.1. In another linkage group, marker loci 2 and 3 bracket QTL 2. The recombination rate between marker locus 2 and QTL 2 is taken as 0.1 and between QTL 2 and marker locus 3 as 0.2. The parents of individuals 1 and 2 are unknown. These two individuals were assumed to be unrelated and non-inbred. Therefore, the uppermost-left 4 × 4 submatrix of matrix G for these two individuals is an identity matrix for both loci. The matrices of the conditional marker transmission probabilities from parents to offspring are listed in Table IV for both single marker and flanking marker cases. For individual 5, the linkage phase between markers 2 and 3 was uncertain. Given the genotype of individual 5, the probabilities of two possible linkage phases     Once the conditional QTL allelic transmission probabilities and QTL allelic identity probabilities of founders were computed, the construction of G proceeded by computing the submatrices P ij and P ii . The pedigree data were sorted in ascending order by birth dates and the animal ID was coded sequentially. We calculated the submatrix P ij for each possible pair of individuals and submatrix P ii for each individual. The pedigree data, submatrices P ij and P ii , and elements of G are all indexed by the successive identification number of the individuals, such that G can be constructed step by step, recursively. The resulting matrix G is listed in Table VI.   The conditional additive relationship matrix (Tab. VII) and dominance relationship matrix (Tab. VIII) are calculated from G. Subsequently, the epistatic relationship matrices R 1 #R 2 , R 1 #U 2 , R 2 #U 1 and U 1 #U 2 can be computed.
The G matrix can also be calculated recursively by extending it by two rows and two columns for each new individual using matrix T i , in a manner similar to the construction of numerator relationship matrix [1,19,35]. The principle and computing formulae of this study apply to this method. The only difference is that matrix T i needs to be filled by place-holding zeros to adjust the positions of the non-zero elements to match the positions of parental QTL identity probabilities in the preceding upper left submatrix of matrix G. Taking the first QTL of individual 5 as an example, matrix T 5 needs to be filled with

Differences between classical and marker-based methods
Molecular genotyping techniques have a profound impact on the traditional theories of quantitative genetics. They make it possible to analyze a quantitative trait on the basis of individual loci. As a consequence, some theories of quantitative genetics have to be modified to account for marker information. In this study, the classical formulae for the covariance between relatives [7,21,22] were reformulated to allow for locus-specific information about the resemblance between relatives provided by genetic markers. In classical theory, identity by descent is assumed to be equal among loci. Therefore, σ 2 A and σ 2 D are summed over all loci affecting a quantitative trait and the epistatic variances are also summed over all possible combinations of the same kind of interactions. In the new formulation, the identities by descent are generally not equal among loci.
The coancestry, which is half of the additive relationship coefficient r ij in the case of no inbreeding [9], needs to be modified as well to incorporate marker information. The coancestry f ij between two individuals, i and j, depends on the relationship of i's parents s and d with j's parents s and d [10]: This equation assumes that the transmission probability of parental alleles to descendants is 0.5 as in the absence of marker information. When marker data are available to track QTL transmission, the formula is no longer optimal. Conditional on markers, the coancestry between individuals i and j needs to be defined as: Classically, the identity by descent between relatives depends only on pedigree and reflects the average relatedness between individuals in a population. In this study, the identity measures depend not only on pedigree data, but also on linked marker genotypes and genetic distances between markers and QTLs. They would vary from trait to trait and reflect the actual genetic resemblance between relatives. The conditional identity by descent provides a more accurate measure of genetic resemblance regarding a specific trait, and results in a more accurate analysis of quantitative traits, such as the estimation of genetic parameters. Furthermore, the classical formula for the covariance between relatives was derived based on the assumption that no linkage exists between quantitative trait loci. When the loci are linked, the joint probability of identical genes among loci is not a product of marginal probabilities. Therefore, the classical method results in a biased estimation if linkage between loci exists and epistatic effects are not negligible [8,37]. On the contrary to the classical method, the assumption of no linkage between QTLs is not required for the marker-based method as long as the markers are available for tracking QTLs. Consider an individual i with genotype because the marker information M removes the influence of Q 1 . That is, with marker information, the joint probability of identical QTL alleles among loci can be expressed as the product of marginal probabilities.

Incomplete marker information
In data analysis, incomplete marker information, such as missing marker genotypes and unknown linkage phases, presents a tedious problem which might increase computation considerably. A solution to this problem suggested in the literature is to replace the gametic relationship matrix G by its expectation conditional on observed markers M obs [15,35]: where G ω is a gametic relationship matrix conditional on a single phase-known marker configuration ω for the pedigree from a set of all possible marker configurations (Ω), and Pr(ω|M obs ) is the probability of the complete marker configuration ω conditional on observed markers. MCMC algorithms are often used in exploring the possible configurations and their probabilities conditional on observed data [33,34]. George et al. [15] gave a detailed review in this regard.
Calculating the expectation of matrix G for a pedigree with substantial missing marker data usually increases computing time considerably due to the potentially large number of configurations [15]. This computing burden is from the repeated calculation of G ω for each complete marker configuration because the expectation of G is taken at the last stage of the procedure, after G ω 's are obtained. Therefore, we previously suggested taking the expectation of marker transmission probability (S i ) for the issue of unknown linkage phase. Since the expectation of S i rather than G is taken in the early stages of tracing marker origins, the repeated calculations of matrix G ω can be avoided. A similar idea can also be used to tackle the problem of missing markers. For example, the marker genotype of an individual i's mother, d, is missing. Based on the marker genotypes of d's parents, mates and descendants, it is possible to calculate the probability distribution of d's marker genotype: P(M 1 d M 2 d ≡ k) = p k , where p k = 1. An S k i for each possible genotype k can be calculated to get the expectation of S i , E(S i ) = k S k i p k .
Then S i can be replaced with E(S i ) in calculating matrix G. Since the expectation is taken at the stage of tracing markers, the G matrix is calculated only once.
Recently, Perez-Enciso et al. [30] developed a MCMC method for identityby-descent probabilities of a chromosome region conditional on DNA markers. This procedure appears promising for the situations of incomplete marker information and could be used for computing the gametic relationship matrix at a QTL by some minor modifications.

Possible benefits of the present study
There have been several studies on the conditional covariance between relatives for marker assisted genetic evaluation within the framework of mixed model equations [11,16,35]. In comparison with these studies, the present study differs as follows: (1) This study presents a general framework for the conditional covariance between relatives including additive effects, dominance effects and epistatic interactions whereas the previous studies generally focused on the covariance components due to allelic effects. This study extends the classical covariance between relatives to cover the situation where both marker information and pedigree are available.
(2) In real data analysis, a solution for unknown parental origins of the offspring's marker alleles is unavoidable. Among the previous studies on gametic relationship, only Wang et al. [35] considered the problem of unknown parental origins and quantified the transmission probabilities by tracing back to both paternal and maternal origins for each allele of an offspring. Their algorithm was developed only for a single marker situation. This study developed a clear, stepwise procedure based on Bayes theorem for calculating marker transmission probabilities, which is applicable to both single and flanking marker situations with known or unknown parental origins. Probability tree analysis [20] shows that the quantification of the marker transmission probability requires assessing the reverse probability of a transmission path for a marker haplotype given the descendant's observed marker haplotype.
(3) This study provides a systematic procedure for constructing the additive and nonadditive relationship matrices conditional on marker information besides the allelic relationship, and makes it possible to model additive and nonadditive effects in the framework of random models. Therefore, it provides the opportunity to improve the models for marker assisted genetic evaluation and QTL mapping by including other QTL effects aside from QTL allelic effects. In marker assisted selection, modeling QTL additive effects, instead of QTL allelic effects [11], will reduce the number of equations and decrease the problem of overparameterization. Also, including nonadditive effects will refine marker assisted genetic evaluation by separating the nonadditive QTL effects from additive effects, thereby improving the estimates of additive effects. Modeling the nonadditive effects may be especially meaningful for refining QTL mapping of livestock using random models. There have been attempts to extend Fernando and Grossman's [11] model to QTL mapping [4,17] which were generally based on modeling QTL alleles. The feasibility of modeling the additive and nonadditive effects in QTL mapping will certainly improve the accuracy of QTL mapping. This is especially true for QTL mapping based on full-sib designs.