An efficient variance component approach implementing an average information REML suitable for combined LD and linkage mapping with a general complex pedigree

Variance component (VC) approaches based on restricted maximum likelihood (REML) have been used as an attractive method for positioning of quantitative trait loci (QTL). Linkage disequilibrium (LD) information can be easily implemented in the covariance structure among QTL effects (e.g. genotype relationship matrix) and mapping resolution appears to be high. Because of the use of LD information, the covariance structure becomes much richer and denser compared to the use of linkage information alone. This makes an average information (AI) REML algorithm based on mixed model equations and sparse matrix techniques less useful. In addition, (near-) singularity problems often occur with high marker densities, which is common in fine-mapping, causing numerical problems in AIREML based on mixed model equations. The present study investigates the direct use of the variance covariance matrix of all observations in AIREML for LD mapping with a general complex pedigree. The method presented is more efficient than the usual approach based on mixed model equations and robust to numerical problems caused by near-singularity due to closely linked markers. It is also feasible to fit multiple QTL simultaneously in the proposed method whereas this would drastically increase computing time when using mixed model equation-based methods.


INTRODUCTION
Variance component (VC) approaches have been widely used to detect the existence of variation associated with quantitative trait loci (QTL) [1,3,10,13,15,38]. The idea behind the approaches is to obtain identity by descent (IBD) coefficients between relatives for the QTL, based on marker and pedigree data, and maximize the likelihood of phenotypic data given these IBD coefficients at each putative QTL position. QTL position can be estimated with maximum likelihood (ML) or restricted maximum likelihood (REML) at the location with the highest likelihood value across the chromosome. This idea has been extended to a fine-mapping method using linkage disequilibrium (LD) generated from closely linked markers [27,28]. In the fine-mapping method, IBD coefficients between unrelated founders in a recorded pedigree are estimated based on haplotype similarity using the genedropping method [26] or the coalescence method. This allows utilizing unknown relationships beyond the recorded pedigree as well as known relationships, possibly allowing to estimate QTL position within a smaller region, e.g. within a few cM [30].
Data sets used for fine-mapping are typically not very large (a few hundred genotyped animals with records) because many marker loci with denser spacing are required. However, there is much richer information in the (co) variance structure (more non-zero elements in the variance covariance matrix). Such a complex and dense (co) variance structure makes a sparse matrix technique less useful. Therefore, the usual and efficient REML algorithm based on the mixed model equations (MME) using sparse matrix techniques (e.g. [12,19]) may be less suitable for fine-mapping of QTL. Moreover, the complicated and subtle process to maximize computational efficiency in the MME-based REML, i.e. by avoiding the inverse of the variance covariance matrix of the observations, is prone to have numerical problems since the matrix of covariance coefficients between haplotypes based on closely linked markers can reach near-singularity. Direct use of the variance covariance matrix of the observations (V) and its inverse would be robust to such problems because the V is an aggregate of (co) variance matrices of all components, which is more guaranteed to be positive definite. The dimension of the V is usually smaller than that of the MME, especially when there are many ancestors without records. Another advantage in using the V is to easily fit multiple QTL simultaneously since the dimension of the matrix is not changed. By contrast, the dimension of the MME rapidly increases with the number of QTL fitted.
The aim of this study was to investigate the efficiency of a REML algorithm with the direct use of the V compared to the usual REML algorithm based on MME when the combined LD and linkage (LDL) mapping is used with a general complex pedigree.

Mixed linear model
The genetic model that is used in the VC approach with combined LD and linkage mapping is relatively general. A vector of phenotypic observations is written as a linear function of the effects of QTL, a polygenic term representing the sum of other unidentified genetic effects, fixed effects and residuals. The model can be expressed as, where y is a vector of N observations on the trait of interest, β is a vector of fixed effects, u is a vector of n random polygenic effects for each animal, NQ is the number of QTL, q i is a vector of n random effects due to ith putative QTL and e are residuals. The random effects (u, q i and e) are assumed to be normally distributed with mean zero and variance Aσ 2 u , G i σ 2 q i and R, where A is the numerator relationship matrix based on additive genetic relationships based on recorded pedigree, G i is the genotype relationship matrix whose elements are IBD probabilities at ith QTL, and R is the covariance matrix among residual effects, assumed diagonal in this study. X and Z are incidence matrices for the effects β and u and q i , respectively. It is assumed that there is no correlation between u and q i , u and e, and q i and e, and no correlation among q 1 ∼ q NQ . The associated variance covariance matrix of all observations (V) given the observed pedigree and marker genotypes is modeled as The mixed model equations (MME) pertaining to (1) are

Finding inheritance states and haplotype construction
IBD coefficients are estimated based on the pattern of possible inheritance states and associated haplotype configurations [17,34,36]. To derive such patterns, two kinds of approaches can be used. One is to find an optimal haplotype configuration with the highest likelihood given observed data using maximum likelihood approaches. IBD coefficients are then estimated based on the most likely haplotypes. The other is to apply a Markov chain Monte Carlo (MCMC) algorithm to surface all possible inheritance states and haplotype configurations. IBD probabilities are estimated every MCMC cycle. Averaged IBD probabilities over all cycles would provide estimates based on the posterior distribution given the observed data [24].

LD information
IBD coefficients are based on similarity of haplotypes unrelated through known pedigree. These patterns of similarity can be derived by using a genedropping method [26] or the coalescence method introduced by Meuwissen and Goddard [27,28]. An assumption of a mutation age of 100 generations and a past effective size of 100 can be used to estimate IBD coefficients since they are usually unknown. Results have been shown to be robust against such assumptions by exploring a range of values for effective size and mutation age as well as for populations that have a decreasing effective size [22,23,27]. Therefore, IBD coefficients between unrelated founder haplotypes in recorded pedigrees can be approximated and these are non-zero values.

Combined LD and linkage information
Using the IBD probabilities between unrelated haplotypes, IBD probabilities between related haplotypes in the following generations can be recursively estimated using known pedigree information [32,39,40]. Therefore, IBD probabilities between all haplotypes can be estimated based on joint information from LD and linkage. Note that all resulting coefficients have a non-zero value.

Genotype relationship matrix
The number of variance covariance coefficients between all haplotypes at each position is 2n × 2n (n is the number of animals). From a computational perspective, it is useful to transform the coefficients between all haplotypes to a covariance between individual genotypes without loss of information [32]. Therefore, the number of IBD coefficients are reduced to n × n. The following equation is used [37], Where G is the genotype relationship matrix, H is the haplotype (gametic) relationship matrix, I n is an identity matrix with rank n, ⊗ is the Kronecker product of two matrices and [1, 1] is a 1 by 2 vector.

G and the position of the QTL
For a given data set, the similarity of marker haplotypes changes with the position of the QTL. Therefore, a different G exists for each putative QTL position on a tested chromosomal region. For each G, the maximum value of the log likelihood and the variance components are estimated for the corresponding position on a chromosome. Therefore, each position has a maximum value for the log likelihood for model parameters at the given QTL position. The most likely QTL position can be estimated as the position with the highest maximum likelihood value across all positions.

Calculation of log likelihood
By assuming multivariate normality of the data with mean vector Xb and variance covariance matrix V, the resulting likelihood can be written and some numerical procedure can be used to estimate the variance components. The log likelihood for the model in (1) can be obtained using the following equation [16].
where ln is a natural log, and | | refers to the determinant of the associated matrices. Note that X is a full rank incidence matrix. The P matrix is defined as, Alternatively, avoiding the inverse of the V matrix, the following equation can be used [31], + ln |A| + ln |G| + n lnσ 2 u + n lnσ 2 q + y Py (6) where N r , N β , N u and N q is the number of records, fixed effects, polygenic effects and QTL effects and C is the coefficient matrix in the MME.

Average information REML
As a method to obtain REML estimates, the average information algorithm (AIREML [11,19]) is used. The average information coefficients are the average of the observed and expected information matrices from the Newton-Raphson and Fisher scoring method (see App. A). The equation for the iterative algorithm is, where Θ is a column vector of variance components (σ 2 e , σ 2 u and σ 2 q ), and k is the iteration round. AI is the average of the observed and expected information matrices which consists of the second derivatives of the log likelihood function with respect to the variance components, which can be written as, where A * = ZAZ and G * = ZGZ .
∂L ∂Θ is a column vector with first derivatives of the log likelihood function with respect to each variance component, which is, Py.
The first derivatives for each of the three VC of the QTL model are, In general cases, the use of the MME (3) is very efficient to obtain AI coefficients and the first derivatives because one can avoid inverting matrix V (see App. B and C). However, in fine-mapping, the QTL part in the MME becomes dense, substantially reducing the computational efficiency of this procedure.

Comparison of the direct and MME method
The AI matrix (8) and the first derivatives (9) can be obtained in two different ways. One is to directly compute the inverse of the V matrix, and then the P matrix (direct method). The other is to solve MME equations and use matrix products (MME-based methods, see App. B). In the direct method, the main part of the computations is to invert V, needed for matrix P for each REML iteration round (6). The dimension of the V matrix is N r ·N r and matrix inversion requires order (N r ) 3 computations. However, the partial inverse of the coefficient matrix corresponding to polygenic and QTL effects is necessary in the MME-based methods. The dimension of the coefficient matrix corresponding to polygenic and QTL effects is (N u + N q ) . (N u + N q ) and partial inversion requires computations slightly more than order (N u ) 3 + (N q ) 3 . In a fine-mapping model where G is a dense matrix, the direct method is expected to be more computationally efficient unless the number of records is far greater than the number of animals. Note that the number of animals is often much larger than the number of records with genotypic data. For example, in a pedigree spanning several generations, often only the last one or two generations would be phenotyped and genotyped for mapping studies. Another advantage in the direct method is to easily fit multiple QTL simultaneously. This is because the dimension of V (2) is not changed by increasing the number of QTL. In contrast, the dimension in the MME (3) rapidly increases with the number of QTL,

Simulation study
One hundred generations of a population with an effective size of 100 was simulated for 11 markers and a QTL. In each generation, the number of male and female parents was 50 and their alleles were transmitted to descendents based on Mendelian segregation using the genedropping method [26]. Parents were randomly mated with a total of 2 offspring for each of 50 mating pairs. For the QTL, one of the base alleles surviving with a frequency of more than 0.1 and less than 0.9 was randomly chosen and treated as favourable with effect α compared to other QTL alleles in generation 100. The number of base alleles in each marker locus was 4 and starting allele frequencies were all at 0.25. The marker alleles were mutated at a rate of 4×10 −4 per generation [5,7,41]. Therefore, this historical population would have an equilibrium distribution of alleles in all marker loci and LD between the QTL and closely flanking markers. Note that pedigree information is available only from generation 101 onwards.
The computational efficiency of the direct method and the MME-based method was investigated. Analyses were carried out for a pedigree spanning two generations (generation 100 ∼ 101), 5 generations (100 ∼ 104) and 10 generations (100 ∼ 109). By default, random selection and mating was carried out, resulting in many marriage loops and inbreeding loops, therefore a complex pedigree. Note that even with a pedigree of two generations, complex marriage loops heavily affect the covariance structure (i.e. reducing sparsity of the inverse of the covariance matrix). For a simple pedigree with each parent having one mate only, marriage and inbreeding loops do not exist. We generated 50 unrelated full sib families each with two progeny and compared with a complex pedigree. For a fair comparison between results, marker genotypes and phenotypic values were only available for animals in the last two generations. Phenotypic values were simulated as y i = µ + α i + u i + e i . The mean of population (µ) was 100, values for u i were drawn from N(0, Aσ 2 u ) with σ 2 u = 25, and values for e i were from N(0, σ 2 e ) with σ 2 e = 50. The favorable QTL allele had an additive value of 7 (α 0 = 0 and α 1 = 7), therefore, QTL variance ranged from 8.8 to 24.5 with V QT L = 2p(1 − p)α 2 [8]. To evaluate the effects of marker densities on computational stability and efficiency, eleven markers were positioned at 10, 1 or 0.1 cM intervals. Table I shows the computing time averaged over 10 replicates per REML iteration for each method with a general complex pedigree spanning 2, 5 or 10 generations. The results show that the computational effort per iteration round of the direct method is lower in all cases than that of the MME method. When the number of animals and the number of records are the same (e.g. N r = N u = N q = 200), the computing time of the direct method is around 1.5 ∼ 2 times lower than that of the MME method. With a pedigree spanning 5 generations where N r = 200 and N u = N q = 500, the direct method is about 50 times faster than the MME method. With a pedigree of 10 generations with N r = 200 and N u = N q = 1000, the direct method is around 270 times faster than the MME method. As expected, the direct method performed at a similar computational speed regardless of the number of non-zero elements in the MME matrix while the computing time of the MME method rapidly increases with a larger number of non-zero elements.

Computational efficiency for the MME and direct method
With a general complex pedigree, the inverse of G is very dense even in linkage mapping (e.g. 76% of non-zero elements in a pedigree spanning two generations). Therefore, a sparse matrix technique is not useful. With a more complex pedigree, the proportion of non-zero elements increases (95% and 97% for a pedigree of 5 and 10 generations). This may explain a smaller difference of computing time between LDL mapping and linkage mapping with a more complex pedigree. Table II shows the number of non-zero elements and computing time both for a complex pedigree spanning two generations and a simple pedigree (of 50 unrelated full sib families). Although the number of animals are the same (200), the number of non-zero elements in the inverse of G is much smaller for a simple pedigree than for a complex pedigree when using linkage mapping. In this case, a sparse matrix technique is very useful and the computing time of the MME method is lower than that of the direct method (e.g. for linkage mapping with 50 full sib families).  Table III shows the proportion of replicates having numerical problems to obtain the log likelihood and variance components when LDL mapping is carried out. Note that if G is non-positive definite, a bending algorithm is used to ensure positive definiteness for G [35]. Thus, positive definite G is used for both methods. When marker density is low (> 10 cM), the log likelihood and parameters are estimable in all replicates for both the direct and MME method. When marker density is higher (1 cM or 0.1 cM), the MME method often faces numerical problems, i.e. ln |C| in (6) cannot be obtained, therefore, ln L (6) cannot be estimated. This is probably due to the fact that very high marker density increases the likelihood of the coefficient matrix to be singular. The G matrix is an explicit part of the MME, and even though this matrix is bent to become positive definite, the Gaussian elimination procedure could still face very small pivotal values and there is a good chance of negative values for determinants. These problems rarely or never occur for the direct method implementing the V that is the sum of all covariance matrices.

Comparison with a standard VC software 'ASReml'
Table IV compares the computing time per iteration for the direct method with that for ASReml [12]. In LDL mapping, the direct method is much more efficient than ASReml especially when using a complex pedigree spanning 10 generations. In linkage mapping alone, the direct method performs better with a complex pedigree, however, with a simple pedigree, ASReml is more efficient. It is noted that the performance of ASReml is similar to that of our MME method (Tabs. I and II). This is because both programs use the extended MME to obtain AI coefficients and the inverse of the coefficient matrix with a sparse matrix technique (see App. B). Optimal ordering of sparse structures and utilization of the reordering matrix [6] may be more efficiently optimised in ASReml than in our MME method, making ASReml perform better than Table IV. Comparison of computing time (s) averaged over replicates to estimate variance components between the direct method and 'ASReml'. the MME method for a simple pedigree. However, in general, both programs perform similarly in that the computational effort of the MME-based procedure is high with LDL mapping and complex pedigrees whereas it is much lower with a simple pedigree structure in linkage mapping.

DISCUSSION
This study presented a REML procedure suitable for fine-mapping of QTL with a complex pedigree. Because the coefficient matrix is dense with LDL mapping and a complex pedigree, MME-based methods are less computationally efficient and sparse matrix techniques are not very useful. Besides, with closely linked multi markers in fine-mapping (e.g. marker spacing <∼ 1 cM), MME-based methods could face numerical problems because of dependency or near-singularity in the MME. These problems were not observed in the direct method.
It is common that genotypic and phenotypic observations are available only on relatively few animals from the last few generations in an entire pedigree (genotypes and phenotypes for ancestors are missing). In such situations when the number of animals in the pedigree is similar or greater than the number of available observations, the direct method is computationally efficient since the V describes the variance covariance structure between observations, taking into account all the ancestral relationships.
When multiple QTL are involved in phenotypes of a trait, it is useful to analyze a number of QTL simultaneously [18,20,29,42,43]. It is straightforward to simultaneously include a number of QTL positions in the model. The computing time for a multiple QTL analysis is not much different from that of a single QTL analysis if the direct method is used; the dimension of the V is not changed and the computing time for inverting the V is the same in the multiple or single QTL model. However, in MME-based methods, the computing time rapidly increases with a larger number of QTL positions fitted since the dimension of MME linearly increases with additional QTL positions included in the model. For example, with a complex pedigree of two generations, computing time per iteration is 0.11, 0.12 and 0.13 (s) for the direct method and 1.6, 4.7 and 19 (s) for ASReml when fitting 2, 3 and 5 multiple QTL simultaneously using LDL mapping. These results were expected given (2) and (3).
Since estimates were based on the same likelihood equation (7), theoretically, AI coefficients and first derivatives should have the same value in each method if starting value and convergence criteria are the same. In this study with a complex pedigree of two generations, the average number of iterations over replicates was 4.4 (±0.27) for the direct method and 6.4 (±0.27) for ASReml. This difference was probably due to the fact that the first and second derivatives were estimated from two different coefficient matrices (e.g. V matrix and MME) and the computational procedure was different. The small number of iterations for the direct method is probably due to the fact that the procedure is more stable, therefore quicker to reach convergence than the MME-based method (see Sect. 3.2). However, the estimated variance components and maximum likelihood for the direct and ASReml agreed well when the procedure was completed without numerical problems (results not shown).
One could consider using IBD coefficients between haplotypes rather than between animals, which can make the inverse of the matrix (H) more sparse. However, this would be advantageous only if the proportion of zero elements is much higher in H than in G. This is not the case in LDL mapping in which IBD coefficients between all haplotypes (of base animals and descendents as well) are non-zero elements. The inverse of the H matrix is a part of MME. The MME must be inverted every iteration. Unless the proportion of zero elements is very large in the MME, inverting MME is computationally heavy whether partial matrix theory with a recursive method (e.g. [9,12]) is used or not. This is because all non-zero elements must be involved in inverting MME. As a matter of fact, the H matrix has order 2N q × 2N q which makes MME bigger, e.g. the dimension of MME with the H matrix is ∼ (N u + 2N q ) × (N u + 2N q ). However, the dimension of MME with the G matrix is ∼ (N u + N q ) × (N u + N q ). In linkage mapping with a relatively simple pedigree, the proportion of zero elements is much higher in H than G. This results in a very sparse structure in the inverse of H and the MME. In this case, zero elements can be skipped in the computation (using sparse matrix techniques), therefore a considerable proportion of elements does not have to be involved in inverting MME (saving computational cost). However, in LDL mapping, H and H −1 do not have any zero elements. Therefore, more than 2N q × N q elements must be involved in inverting MME. Note that the number of elements reduces to N q × N q when using G, which is therefore computationally more efficient.
In the MME method, we used the package AMD version 1.1 [2] for an optimal ordering of sparse matrices. We found this procedure very useful and dramatically reducing the computing time if and only if there was a considerable proportion of zero elements in the MME (e.g. linkage mapping with simple pedigree). However, when LDL mapping was used, the computing time was not changed. Note that the optimal ordering had been already done before the iteration started, the time for the optimal ordering was not included in the computing time per iteration.

CONCLUSION
The direct method is generally suitable for fine-mapping of QTL with closely linked markers and a complex pedigree where genotypes and observations are available for the last few generations. Efficient algorithms also make use of proper statistical testing techniques such as permutation testing [4] in fine-mapping more feasible. Only when linkage mapping is applied with a simple pedigree and sparse marker spacing, will the MME method have a similar efficiency. In addition, the direct method has the potential to easily accommodate multiple QTL because the dimension of V is not affected by the number of QTL.
all components (i = u, q and e). The working variates for our QTL model are calculated as, whereû andq are solutions from the MME, andê = y − Xb − Zû − Zq.
The matrix P is a projection matrix transforming the observation vector (y) into residuals (e.g.ê = Py). Furthermore, PW e is the vector of residuals obtained when W e is taken as the observation vector instead of y. In this manner, y PPPy can be obtained as the scalar product of the scaled residual vector (W e ) and the new residual vector (PW e ) obtained when W e is taken as the observation vector, i.e. y PPPy = W e PW e . All elements in the AI matrix can be obtained in the same manner.

Use of Gaussian elimination on extended MME
Gilmour et al. [12] used an extended MME with Gaussian elimination to estimate the elements of the AI matrix. We construct an M matrix including the QTL term as an extension of the MME in (3).
After performing Gaussian elimination, the first row and first column, M (1,1), equals y Py [14]. If y is replaced by W e from (A3c), then M (1,1) after Gaussian elimination equals y PPPy that is one of the elements in the AI matrix. If y is replaced by W q from (A3b), then M (1,1) after Gaussian elimination equals y PG * PG * Py that is another element in the AI matrix. All elements in the AI matrix can be calculated in the same manner.