Computing inbreeding coefficients quickly

An algorithm for computing inbreeding coefficients (F) for all animals of large populations is described. The technique relies on the subset of the numerator relationship matrix (A) required to compute its diagonal, which contains F. A simple example illustrates that this subset is a very small part of A. inbreeding coefficients / fast algorithm Résumé-Calcul rapide des coefficients de consanguinité. On présente un algorithme pour calculer rapidement les coefficients de consanguinité de tous les anirrcaux, dans une grande population. La méthode fait seulement intervenir le sous-ensemble de la matrice des coefficients de parenté qui est nécessaire au calcul de sa diagonale, qui contient les coefficients de consanguinité. Un exemple simple illustre que ce sous-ensemble n'est qu'une petite partie de la matrice des coefficients de parenté. L'algorithme est précisé sous la forme d'un codage symbolique. coefficients de consanguinité / calcul rapide INTRODUCTION


INTRODUCTION
Wright's (1922) coefficient of inbreeding (F) describes the probability that 2 alleles at any locus are identical by descent. Inbred offspring result from mating two animals which have one (or more) common ancestors. Breeders of livestock perceive inbreeding as being deleterious and consequently try to avoid mating close relatives.
The relationship between prospective mates can be computed to determine the degree of inbreeding of the resulting offspring from such a mating.
For inbred populations, inbreeding coefficients are required to compute A-1 directly using Henderson's (1975) rules. Henderson (1976) showed the relationship between the inbreeding coefficients of an animal's parents and the &dquo;contributions&dquo; * AGBU is a joint venture between the University of New England and NSW Department of Agriculture and Fisheries.
that each animal's pedigree makes to !4 !. For the ith animal with parents j and k, the following contributions are added to A-1 : where d ii = 1.0 -0.25(a!! + a kk ).
Two algorithms are commonly used to compute inbreeding coefficients. Originally they were calculated using the path coefficient method (Wright, 1922). While this method is easy to use for computing F for animals which have few ancestors and are only slightly inbred, it is a very complex method for animals with many common ancestors. A simpler approach is to generate the numerator relationship matrix (A) using the tabular method as attributed to Lush by Emik andTerrill (1949, cited by Hudson et al, 1982). Inbreeding coefficients can be computed from the diagonal elements of A : F i = a ii -1. This paper describes an efficient implementation of the tabular method.

COMPUTING INBREEDING COEFFICIENTS USING THE TABULAR METHOD
To compute A by the tabular method animals in the population must be numbered from 1 to N so that parents precede their progeny. The pedigrees of all animals must be stored in memory. A zero in the pedigree denotes an unknown parent. The upper triangle of A can be computed on a row by row basis in consecutive order working from first to last. Diagonal elements are computed using the formula: The remainder of the row (to the right of the diagonal) can be computed by applying the formula: where p and q are the parents of the jth animal and i < j. If either parent is unknown (p or q = 0) then a oi = 0. When a row is complete the corresponding column in the lower triangle can be completed by symmetry. Because A is symmetric it is only necessary to store the upper (or lower) triangle. The storage required to compute A is proportional to the square of the numbers of animals and so limits the size of the population for which A can be computed.
Using the technique of Hudson et al (1982) memory is only required to store the nonzero elements of A and it can be computed for larger populations. When inbreeding coefficients only are required then it is not necessary to compute A completely, nor all its non-zero elements.

THE RECURSIVE PEDIGREE METHOD
This is an adaptation of the tabular method and depends upon computing the subset of A required to compute its diagonal.
As each animal's pedigree is read, Equation 1 is used to identify the element which describes the relationship between its parents. If the animal's inbreeding coefficient is already known and no new ancestral information is available it is stored. If not, then the element (apq) is placed in the subset (flagged) to be computed. Equation 2 can now be applied to identify those elements required to compute elements already flagged. This requires searching the matrix (upper triangle) from bottom to top and from right to left. As each identified element is found the two elements required to apply equation 2 can also be flagged. Because these two elements lie to the left of the flagged element -animals are numbered so that parents precede their progeny -searching A in this manner results in all the elements required to compute the diagonal of A being identified. The flagged subset can now be computed by applying equations 1 and 2 starting with the first row, and proceeding sequentially to the last. Computation of each row of A can be considered in 3 parts. Firstly, elements on the left of the diagonal have already been computedthey were on the right of the diagonal in earlier rows.
Secondly, the off-diagonal element (a jk ) required to compute the diagonal element (a ii ) has already been computed (j, k < i) and a ii can be computed by equation 1. Lastly, the flagged elements on the right of the diagonal can be computed using equation 2. Table III illustrates the recursive pedigree method for the sample population.
Firstly, off diagonal elements from (1) are identified and flagged. Because animals 1, 2 and 3 have at least 1 unknown parent they are not inbred and 1 is stored in the corresponding diagonal element. The diagonal elements of animals 4, 5, 6 and 7 are represented by the letters P, Q, R and S respectively. The offdiagonal elements required to compute them are represented by the same letter in lowercase. Other elements identified by t, u, v and w are required to compute p, q, r and s.
As the rows are processed, elements that must be computed before the current element can be computed are identified and flagged. a S6 (s) is the first flagged element to be found. It requires that a 35 (t) and a 45 (u) are known. Table IV illustrates the identification and flagging of new elements as they were found using the search procedure described above.

COMPUTATIONAL CONSIDERATIONS
To take advantage of the saving in space offered by this method it is necessary to use a sparse storage technique to store the subset of A. In this case, elements in the subset to the right of the diagonal in each row of A can be stored as a linked list.
Elements in a linked list are not stored in memory in any particular order but are linked together in some sequence by pointers. There is a pointer to the location in memory of the first element in each row. Each element in the list has an associated pointer to the location in memory of the next element in the sequence. The pointer associated with the last element in each row (list) is 0. To find any element in such a list it is necessary to &dquo;follow&dquo; the pointers until the required element is found. As new elements are stored in a list the pointers are adjusted to maintain the integrity of the sequence (for a detailed explanation of linked lists see Knuth, 1968).
To expedite the searching phase, the elements in each row should be linked in reverse column order (from highest to lowest, fig la). After the subset has been identified the pointers can be adjusted so that the rows are linked in column order (fig lb). For this application it is desirable to have a pointer (called RECENT in the Appendix) to the most recently added element as well as a pointer (ROW) to the first element in each list. As only the upper triangle is stored, elements on the left of the diagonal appear in the column above the diagonal. These elements must be added to the lists holding the rows above the diagonal. Because these new elements will generally be adjacent to the previous addition to that row using this pointer (RECENT) can avoid repeated searching through the lists. Similarly, repeated searching for the same new elements (arising from animals having the same parent) within a row and searching the row to the right of the current element should also be avoided.
Diagonal elements can be stored in a separate vector (DIAG). This can be used to point to the physical location in the linked list of the element required to compute equation 1 until the diagonal element is determined. Because the theoretical maximum for a diagonal element is 2.0, the first 2 elements in the linked list must be reserved. Thus a value greater than 2 in the diagonal vector is a pointer, one less than 3 is a diagonal element. As each diagonal element is determined it can be stored in this vector. Subsequently, inbreeding coefficients can be derived from this vector.
An efficient method for finding elements on the left of the diagonal for any row is required. One way of doing this is to adjust the pointers so that after the elements in each row have been computed the elements on the right of the diagonal are relinked on a column basis in reverse row order (fig Ic). This requires a vector (COLUMN) which points to the most recently added element of each column.

SIMULATION STUDY
For this example a population with the following characteristics was simulated. Starting with a base population of 20 sires and 500 dams, 40 years of progeny were generated. The adult population was held constant at 20 sires and 500 dams. Sires were mated randomly to the dams, all of which had 1 calf. After each year the oldest half of the sires were replaced and the oldest quarter of the dams were culled. Yearlings (offspring from the previous &dquo;year&dquo;) were randomly selected to replace the culled animals. As a result, some animals had up to 20 generations of ancestors.
Three sets of inbreeding coefficients were computed for the animals, viz after each group of 10 years assuming that no inbreeding coefficients had been calculated on this population before; -for each decade assuming that inbreeding coefficients had been calculated in the previous year ie only inbreeding coefficients for the latest group of calves were unknown; -for parents only when no inbreeding coefficients were known. A detailed algorithm used to compute inbreeding is shown in the Appendix. These computations were carried out on a GOULD NP1 computer and the results from this are shown in table V.
For the population of 20 520 animals, the subset of A included 516 435 elements out of a total of 421070 400 (0.123%). The population size which required computation of the largest proportion of A (0.146%) was 10 520. When inbreeding coefficients were known for all but the most recent group of calves, or were only required for parents, a significantly smaller proportion of A was required.

DISCUSSION
The results in table V illustrate how very small a subset of A is required to compute its diagonal for the simulated population. Although the subset is small as a proportion of A, more than 6 Mbytes of memory were required to store the largest subset (516 435 elements). Table Vb illustrates the computing resources required when inbreeding coefficients are known for all but the most recent group of calves. As this technique can make use of previously computed inbreeding coefficients, problems that are too large for a computer can be divided into a series of smaller problems. To avoid repeated computation of the same elements of A, subsets should not be chosen on a chronological basis but rather on a related group, herd or family basis. The technique could be readily adapted to compute any subset of A that is of interest.

ACKNOWLEDGMENTS
The financial support of the Australian Meat and Livestock Corporation is gratefully acknowledged, as are helpful comments and encouragement .from colleagues and referees.  Subroutine RESERVE reserves space in the linked lists for the elements in the subset while maintaining the integrity of the lists. A check to ensure that there is sufficient space to complete the analysis could be included. RESERVE requires access to the global variables LLSIZE, LASTEL, LARGE, ROW, RECENT, NEXT, JAY and AIJ.