Solving animal model equations through an approximate incomplete Cholesky decomposition

Summary - A general strategy is described for the design of more efficient algorithms to solve the large linear systems Bs = r arising in (individual) animal model evaluations. This strategy, like Gauss-Seidel iteration, belongs to the family of &dquo;splitting methods&dquo; based on the decomposition B = B* + (B - B*), but, in contrast to other methods, it tries to take maximum advantage of the known sparsity structure of the mixed model coefficient matrix: B* is chosen to be an approximate incomplete Cholesky factor of B. The resulting procedure requires the solution of 2 triangular systems at each iteration and 2 readings of the data and pedigree file. This approach was applied to an animal model evaluation on 15 type traits and milking ease score from the French Holstein Association with 955 288 animals and 4 fixed effects, including group effect for animals with unknown parents. Its convergence was compared with a standard iterative procedure.

Summary -A general strategy is described for the design of more efficient algorithms to solve the large linear systems Bs = r arising in (individual) animal model evaluations.
This strategy, like Gauss-Seidel iteration, belongs to the family of &dquo;splitting methods&dquo; based on the decomposition B = B * + (B -B * ), but, in contrast to other methods, it tries to take maximum advantage of the known sparsity structure of the mixed model coefficient matrix: B * is chosen to be an approximate incomplete Cholesky factor of B. The resulting procedure requires the solution of 2 triangular systems at each iteration and 2 readings of the data and pedigree file. This approach was applied to an animal model evaluation on 15 type traits and milking ease score from the French Holstein Association with 955 288 animals and 4 fixed effects, including group effect for animals with unknown parents. Its convergence was compared with a standard iterative procedure. genetic evaluation / animal model / computing algorithm / type trait Résumé -Résolution des équations du modèle animal à l'aide d'une décomposition de Cholesky incomplète et approchée. Une stratégie générale est décrite pour l'obtention d'algorithmes plus efficaces dans le but de résoudre les grands systèmes linéaires Bs = r, caractéristiques des évaluations de type «modèle animal». Cette stratégie, comme l'itération de Gauss-Seidel, appartient à la famille des « méthodes d'éclatement» basées sur la décomposition B = B * + (B-B * ), mais contrairement aux autres méthodes, elle tente de mettre à profit autant que possible la structure creuse (connue) de la matrice des coefficients des équations du modèle mi!te: B * est prise égale à une approximation de la décomposition de Cholesky incomplète de B. La procédure résultante nécessite la résolution de 2 systèmes triangulaires à chaque itération et 2 lectures du fichier de données et de généalogie. Cette approche a été appliquée à une évaluation de type « modèle animal», sur 15 caractères de morphologie et une note de facilité de traite provenant de l'Unité pour la Promotion de la race Prim'Holstein, concernant 955288 animaux et 4 e f fets f i ±>es, y côinpris un effet groupe

INTRODUCTION
In most developed countries and for all major domestic species, joint genetic evaluations of up to a few millions of animals is routinely computed using Best Linear Unbiased Prediction (BLUP). When an individual animal model is used, this supposes the solution of a linear system whose size is larger than the total number of animals to evaluate. Such a task, repeated a few times a year and for several traits of economic importance can be a real challenge, even on the most advanced computers. To reduce this computational burden, algebraic manipulations of the equations have been proposed with 2 main aims: 1), a decrease of the system size by absorption of some effects (eg the permanent environment effects) by use of an equivalent model (eg the reduced animal model of Quaas and Pollak, 1980) or by use of particular transformations (eg the canonical transformation which reduces multitrait evaluations to sets of single trait analyses (Thompson, 1977;Arnason, 1982Arnason, , 1986; 2), an increase in the sparsity of the coefficient matrix (eg Quaas' transformation which makes possible the estimation of group effects for unknown parents only, with no more difficulties than if they were parents (Quaas and Pollak, 1981;Quaas, 1988) or the QP transformation which, when traits are sequentially available, makes the coefficient matrix in a multiple trait analysis much sparser, (Pollak, 1984). Unfortunately, such tools are not always applicable either because they are restricted to special data structure or because they are not always computationally advantageous. Slow convergence is not a real problem for moderate-size research applications, for which general purpose programs are available (eg Groeneved et al, 1990). But it can make routine evaluations prohibitive in the case of large-scale applications. In any case, with or without algebraic manipulation, the linear system is virtually always too large to be solved directly and an iterative solution has to be performed. The algorithms chosen have been in most cases very simple and initially designed for the solution of general problems, without any particular attention to the type of problems animal breeders are dealing with. The most frequently used ones are the Gauss-Seidel, Successive Overrelaxation and Jacobi iterations (Golub and Van Loan, 1983). Unfortunately, for animal models, these can be extremely slow to converge (requiring up to several hundreds of iterations) especially when groups of unknown parents (Quaas, 1988) are added, or when more than one fixed effect is considered (Ducrocq et al, 1990). An important breakthrough in the search for an efficient solution of animal models was the discovery of the possibility to &dquo;iterate on data&dquo;, a strategy proposed by Schaeffer and Kennedy (1986) which avoids the actual computation and storage of the whole coefficient matrix. This can be considered as one of the first uses of the known nonzero structure of the mixed model equations in designing more efficient algorithms.
Indeed, a careful look at this structure gave birth to new approaches for a fast solution of some parts of these equations, in a block iterative context: Poivey (1986) showed that by considering in the inverse A-1 of the relationship matrix, only the diagonal and the terms relating an animal to its parent of a given sex and by correcting accordingly the right-hand side for the other terms of A-1 using solutions from the previous iteration, the resulting system has a very simple and sparse Cholesky decomposition and can be solved directly. Likewise, Ducrocq et al (1990) proposed for the solution of the additive genetic value part of the mixed model equations the use of 2 decompositions of the type described by Poivey (1986)considering first the relationship between an animal and its dam and second between an animal and its sire. They also proposed to absorb the equations for additive genetic values into the equations for fixed effects after correction of the right-hand side for all off-diagonal elements of A-1 (&dquo;pseudo-absorption&dquo;). Convergence was improved, but not as much as might be desired for huge routine applications. The main drawback to such an approach is the rather tedious programming. This paper presents a more general procedure for the design of new algorithms taking maximum advantage of the known sparsity structure of the mixed model equations. An application to a large animal model evaluation is described and its performance is compared with a standard iterative procedure.

MATERIAL AND METHODS
Principles for a new algorithm be the linear system to be solved. If B is very large, system [1] can be solved directly only if B-1 or C, the Cholesky factor of B(C = CC', C lower triangular) is sparse and easy to obtain. If this is not the case, consider B * , a matrix &dquo;close to&dquo; B and whose inverse or Cholesky factor is sparse and easily computed and write [1] as: Then, the following functional iterative procedure can be implemented. At iteration (k + 1), solve: Expressions [3] is very general and is the base of a family of iterative algorithms known as Splitting Methods (Coleman, 1984). If B * is simply a diagonal matrix whose diagonal elements are those of B, [3] is the Jacobi iteration. If B * is the lower triangular part of B, including the diagonal, [3] is the Gauss-Seidel iteration (Golub and Van Loan, 1983).
If the starting value for s is taken to be s(°) = 0, expression [3] leads to: ie, the right-hand side in [4] is updated at each iteration. An even more general iterative algorithm is obtained by mimicking the method of successive overrelaxation (Golub and Van Loan, 1983) where cv is a relaxation parameter, which corresponds to the splitting of B: The next paragraph will illustrate how B * can be chosen in a particular situation. Consider the following animal model: where: y is a vector of observations; b is a vector of fixed effects; a o is an n a -vector of additive genetic values for all animals with and without records; e is a vector of normally distributed residuals with E(e) = 0; X and Z o are incidence matrices.
Assume E(a o ) = Qg where g is an ng-vector of group effects, defined only for the animals with unknown parents. Q is a matrix relating animals in a o with groups (Westell, 1984;Robinson, 1986;Quaas, 1988).
The mixed model equations used to compute Best Linear Unbiased Estimates of b and g and BLUP's of a o can be written (Quaas, 1988): If model [6] includes only one fixed effect (which, for clarity, will be called a herd-year effect), X'X is diagonal with hth diagonal element equal to the number of observations in herd-year h. There is at most one nonzero element per row j of Z'X (or column of X'Z). This nonzero element is in column h and is always equal to 1 when animal j has its record in herd-year h. Z'Z is diagonal with diagonal element equal to 1 for rows corresponding to animals with one record and 0 for animals without record and groups. Finally, if equations are ordered in such a way that progeny precede parents, and such that parents precede groups in a, A * is of the form A * = Ln-1 L' (Quaas, 1988) where L is a (n a + ng) x n d matrix with 3 non zero elements per column. If j s and j d represent the indices of the sire (or the sire's group) and the dam (or the dam's group) of animal j, column j has a 1 in row j (= the jth diagonal element of L) and a -0.5 in rows j s and jd ! n-1 is a (n a x n a ) diagonal matrix with jth element equal to 8 j = 4/(m j + 2) where !rt! is the number of known parents of j(rri! = 0, 1 or 2).
Given this rather simple structure for B, the coefficient matrix in [81, the following choice of B * in [5] is suggested: take B * = T * T * ', where T * is the incomplete Cholesky factor of B, ie, the matrix obtained by setting t ij to zero in the Cholesky factorization of B each time the corresponding element b2! in B is zero (Golub and Van Loan, 1983, p 376). Equivalently, B * = TDT' where T = {t ij } is lower triangular with unit diagonal elements and D is diagonal with positive diagonal elements d!, and T and D are computed using the algorithm sketched in figure 1. The TDT' factorization has an important advantage over the standard Cholesky factorization: it does not require computation of square roots. ' A few remarks need to be made at this stage. First, it is known that in the general case, the incomplete Cholesky factorization of a positive definite matrix is not always possible (negative numbers appear in D ie, the computation of diagonal elements in T * requires square roots of negative numbers). It will be shown that this is never the case here.
Secoild, the coefficient matrix in [8] can be rewritten as: and it clearly appears that column h corresponding to herd-year effect h has n h + 1 nonzero elements: a 1 on the diagonal, and 1/n h on each of the n h rows corresponding to animals with records in herd-year h. Hence, the incomplete TDT' factorization can be applied to the lower right part of the product in !9!, ie, on QPQ' = (Z'Z + oA* ) -Z'X(X'X) -1 X'Z, which is also the lower right part of [8] after absorption of the fixed effect equations.
Third, a strict application of the algorithm in figure 1 would lead to nonzero elements relating mates, as in A * . Given the particular structure of A * = LA-'L', we would like to have these elements being 0 too, such that the lower right part of T has the same nonzero structure as L. However, L is a (n a +ng) x n a matrix whereas the lower right part of T is supposed to be a (n a + n . ) x (n a + ng) square matrix. We will assume (or choose) that the lower right part of T corresponding to groups is dense. Consequently, TDT' is only an approximation of the true incomplete Cholesky decomposition.

Algorithm
The computation of T and D does not require the coefficient matrix B to be explicitly set up. For animal j, b jj in B is equal to: with x i = 1 -1 if j has its record in herd-year hand Xj = 0 otherwise. nh Given the structure imposed on T, the only elements that are nonzero in column j are t a j, a = j s or a = j d where j s and j d are the indices of the sire (or sire group) and the dam (or dam group) or j. Henderson's rules (Henderson, 1976) of construction of A * imply: -Another consequence of the chosen structure for T is that the product t.z&dquo;,t!&dquo;,, in figure 1 is always 0 except when m is the herd-year effect where i and j have their records or m is a progeny of i and j. Since t ij is computed only for i = a = j s and i = a = j d , t im t jm is nonzero only: 1), when j and its parent have their record in the same herd-year; and 2), when j is mated to its own sire or dam. Both events are sufficiently rare to be ignored (as B * is not exact, anyway) and then: The fact that j, or j d may or may not correspond to a group is irrelevant here. It is also essential to notice that t aj need not be stored, as it is easily obtained from d j .
Replacing tjm by [12] in figure 1 and using (10!, we get: For columns corresponding to groups, we have, from the structure of A * : and ti&dquo;,,t!&dquo;! in figure 1 is different from 0 each time m is a progeny of groups i and J.
Therefore, just before undertaking the dense factorization of G where G is the current part corresponding to groups, we have: (minor adaptations for terms corresponding to groups have to be made when the unknown sire and the unknown dam are in the same group). Now, by noting that: and that d j = z j + û8 j > û8 j for animals without progeny (all assumed to have a record), it follows by recurrence that C 1d P J > 0 for all p. Then cp > 0. B dp )  [20] can be condensed in such a way that only 2 readings of the date file are necessary at each iteration. Indeed, algebraic manipulation of these equations leads to the following requirements: The general sketch of the resulting algorithms is given in Appendix 1.

Dealing with several fixed effects
In many instances, the routine animal model evaluations involve more than one fixed effect. To adapt the algorithms to this frequent situation, one can distinguish between on one hand, a vector of fixed effects b with many levels (like the herdyear-(season) effects or the contemporary group effects in many applications) and on the other hand, a vector f of other &dquo;environmental effects&dquo; with fewer levels.

Model [6] becomes:
where X and F are the incidence matrices corresponding to effects in b and f respectively. The resulting system of mixed model equations can be written: A block iterative procedure can be implemented involving at each iteration first the solution of: using Gauss-Seidel iteration and then, the solution of: The block iterative procedure described in Dealing with several fixed effects was implemented, using a relaxation parameter cv = 0.9 in all analyses. To compare this procedure with a standard iterative algorithm, the mixed model equations were also solved using a program along the lines of , where the equations for fixed effects were solved using Gauss-Seidel iterations and the equations for additive genetic values were solved via second-order Jacobi iteration (see  for details). Group solutions were adjusted to average 0 at the end of each iteration, as proposed by Wiggans et al (1988). Indeed, this constraint had very little effect on convergence rate, as the average group effects solutions tended to a value very close to 0 anyway. Several relaxations parameters were used for the second-order Jacobi step. The same convergence criteria were computed in all cases and intermediate solutions were compared to &dquo;quasi-final&dquo; results.

RESULTS
Figures 3 and 4 illustrate for one of the traits -&dquo;rump width&dquo; (o-p = 1.4, h 2 = 0.25) -the evolution of 2 convergence criteria: the norm of the change in solution between 2 consecutive iterations divided by the norm of the current solution vector (both considering elements in a only) and the maximum absolute change between 2 iterations. Rump width was considered as a trait representative of the average convergence rate of the procedure based on the incomplete Cholesky decomposition (hereafter referred to as ICD) for the 16 analyses: the value of the standardized norm of the change between 2 iterations obtained for rump width after 40 iterations was reached after 25 iterations for one trait, 33 to 40 iterations for 10 traits and between 41 and 46 iterations for 5 other traits. For rump width, 200 iterations with ICD and 300 iterations with the standard procedure (GS-J) were carried out and compared to intermediate solutions. The results are summarized in table I. Figure 5 shows the distribution of the changes by class of variation between 2 iterations for ICD.
Figures 3-5 and table I clearly show that convergence was much faster with ICD. Whatever the value of the relaxation parameter used, the evolution of the convergence criteria in GS-J tends to be very slow after a rather fast decline during the first iterations. This phenomenon was mentioned by  and seems to worsen because of the size of the data set and the existence of 3 fixed effects in the model. The fact that ICD does not exhibit this undesirable characteristic may prove its robustness to not so well conditioned problems. For practical purposes, 3 exact figures may be considered satisfactory for a proper ranking of all animals. Starting from 0 and with no acceleration procedure implemented (in contrast with eg Ducrocq et al, 1990), this requirement for all animals was reached for ICD after about 40 to 45 iterations. Even faster convergence may be achieved when starting from solutions of a previous evaluation or by optimizing the value of the relaxation parameter used. Figures 3 and 4

and table I
show that the same convergence requirements are reached in about 150 iterations with GS-J. But the cost of each iteration is not the same: in the case of ICD, genetic value effect by reading a data file sorted by HRC, as proposed in Wiggans et al (1988). This reduces to 2 the number of copies of the data file read at each iteration. The fact that the first 2 fixed effects are not defined within HRC effects or vice-versa prevents any further reduction, as in the case described by Wiggans et al (1988). In conclusion, ICD appears to be at least 3 to 4 times faster than a standard procedure in the particular situation studied here.
For ICD, the main computing requirements in terms of core storage can be divided into 2 independent parts: 1), for the computation of d 7l ( This can be further reduced by noting that the previous vector of solutions for a is required only to compute convergence criteria and that current solutions for a need not to be stored for nonparent animals.

DISCUSSION AND CONCLUSIONS
It has been shown that in the context defined here the approximate incomplete Cholesky factor of the coefficient matrix always exists. However, this does not guarantee that B * will be very close to B. In fact, when a large number of generations are considered, the terms from the true Cholesky factor of B which are ignored in the incomplete factorization may become large and one may wonder whether the proposed algorithm converges at all. Indeed, in the application presented here where pedigrees were traced up to 12 generation back in some cases, the algorithm diverges when no relaxation parameter is used: convergence seemed faster at first but about 300 estimates of genetic value (0.03%) did not stabilize and led to divergence first, of groups of unknown parents and then, of all animals' genetic values. In contrast, for a research run with a smaller data set and tracing pedigrees back for only 4 generations, the same standardized convergence criteria as in the example described here were obtained after nearly 3 times fewer iterations. This illustrates that the discrepancy between B and B * increases with the number of generations considered. A modification of the algorithm in order to make B * closer to B while keeping a similar sparsity structure should be investigated. For example, mate contributions to the inverse of the relationship matrix could be no longer ignored. However, this may not be necessary as it is shown in Appendix 2 that there always exists a relaxation parameter w such that convergence is guaranteed. Quite unfortunately, as for successive over-relaxation and second order Jacobi procedures, the optimal w is data dependent.
The model presented in [6] and which led to the form of the incomplete Cholesky factor described in figure 2 is rather restrictive. In particular, it assumes the existence of only one fixed effect. The section Dealing with several fixed effects showed a way to treat models with more than one fixed effect. For more complex situations, a simple procedure would be to apply the incomplete Cholesky factorization only to the block (Z'Z + aA * ). The remaining part of the system can be solved using a more standard procedure. But then, convergence is likely to be somewhat slowed down. A trade-off is to increase the amount of information available to estimate fixed effects at each iteration by &dquo;pseudo-absorption&dquo; of genetic value equations, as described in Ducrocq et al (1990). Whatever the alternative chosen, it seems quite apparent that faster and more efficient algorithms from a computational viewpoint need to take larger advantage of the known sparsity structure of the mixed model equations than simpler, general-purpose algorithms. Of course, the final practical choice between competing algorithms should take into account other considerations such as programming costs and computer availability. Then, rather specialized algorithms such as the one proposed here may be attractive only for huge routine evaluations based on animal models.

APPENDIX 1
Practical algorithm sketch for the solution of the system B * s = r In the following section, indices related to fixed and random effects (eg, b and a in r b and r a ) will be dropped for clarity. The indices j, j s , j d and h will refer to animal j, its sire (or sire's group), its dam (or dam's group) and the herd-year in which it was recorded. n = {n h } will be the vector of number of records in each herd-year . Define s = ! . Index g will refer to all groups of unknown parents. a Then, the whole algorithm for the solution de B * s = r, including steps [21] to (25!, can be detailed as follows: 1) Initialize r, n, s to zero 2) Read the data file 9 For each j with record: y j , add (wy j ) to r h and r j add 1 to n h 3) Compute v h = r h/ n h for h = 1, ... r h = r hwn h s h 4) Initialize v j as v j = r! for j = 1, ...
Read the data file with progeny preceding parents For each j: 6) Read the data file with parents preceding progeny For each j: 8) Go to (3) until convergence. Note that at each iteration, the computation of Sh , h = 1, ... is completed only after reaching the end of the second data file. Therefore, the update of r in [25] -which requires s h -would make necessary a third reading of the data file. This can be avoided by transferring this part of the updating (-wZ'Xb in [25] to the beginning of the next iteration. This explains the term (v j -w Sh ) in (4).

APPENDIX 2
Proof that the splitting method based on an incomplete Cholesky decomposition of the coefficient matrix always converges The following section shows that there always exist a value of the relaxation parameter w such that the general algorithm proposed in [5] converges for all starting values. For splitting methods, a necessary and sufficient condition for this is -with the notation used in [3] and [5] -that p (I &mdash; wB * -'B) < 1 where p(M) is the spectral radius of M, ie, the largest absolute value of the eigenvalues of M (Golub and Van Loan, 1983;Coleman, 1984). Indeed, if A i is an eigenvalue of B * -1 B, then A i is also an eigenvalue of T * -1 BT * -T where B * = T * T * '. Obviously, this matrix is nonnegative definite, so all its eigenvalues are nonnegative. Let À 1 be the largest one. The eigenvalues of the matrix I &mdash; wB * -1 B are all of the form (I -wA i ) and if one chooses 0 < w < & m d a s h ; then p (I &mdash; wB*-1B) ! 1. Equality holds Al when B is singular, which is indeed the case when groups of unknown parents and another fixed effect are considered simultaneously. A constraint on solutions for groups (or the other fixed effects) can be added to avoid this problem. But is was found that convergence is faster when no constraint is included. As for Gauss-Seidel iteration, it seems that there is a built-in constraint in the iterative system preventing problems from occurring.