Marker assisted estimation of breeding values when marker information is missing on many animals

Two methods are presented that use information from a large population of commercial animals, which have not been genotyped for genetic markers, to calculate marker assisted estimates of breeding value (MA-EBV) for nucleus animals, where the commercial animals are descendants of the marker genotyped nucleus animals. The first method reduced the number of mixed model equations per commercial animal to one, instead of one plus twice the number of marked quantitative trait loci in conventional MA-EBV equations. Without this reduction, the time taken to solve the mixed model equations including markers could be very large especially if the number of commercial animals and the number of markers is large. The solutions of the reduced set of equations were exact and did not require more iterations than the conventional set of equations. A second method was developed for the situation where the records of the commercial animals were not directly available to the nucleus breeding programme but conventional non-MA-EBVs and their accuracies were available for nucleus animals from a large scale (e.g. national) breeding value evaluation, which uses nucleus and commercial information. Using these non-MA- EBV, the MA-EBV of the nucleus animals were approximated. In an example, the approximated MA-EBV were very close to the exact MA-EBV. © Inra/Elsevier, Paris


INTRODUCTION
Fernando and Grossman [3] presented a method to calculate the best linear unbiased predicted-estimates of breeding values (BLUP-EBV) using the information that DNA markers are linked to a quantitative trait locus ((aTL). Goddard [4] extended the method to the use of flanking marker information.
Although, these methods are relatively easy to use, the number of equations rapidly becomes large when there are many animals. Even with only one marked QTL, there are three equations per animal: two estimating both gametic effects at the QTL and one for the polygenic effect (the joint effect of the background genes). Every extra marked QTL increases the number of equations per animal by two. Moreover, when the flanking markers are close to the QTL, the probabilities of double cross-overs become small and the equations close to singular, and thus difficult to solve [13]. Meuwissen and Goddard [8] avoided these singularity problems by assuming a negligible probability of double recombinations within the flanking markers.
As genetic markers become more frequently used in comnrercial breeding programmes, the situation will commonly arise where only a small fraction of the animals have been genotyped. The phenotypes of non-genotyped animals may, however, be vital to the calculation of the effects of marked QTL as, for instance, in a granddaughter design where only bulls are genotyped but only cows are phenotyped. Calculation of two QTL effects for each marker for many non-genotyped animals is wasteful and may inhibit the implementation of marker assisted selection. Hoeschele [7] greatly reduced the number of equations in very general population structures, but this method is complicated and therefore difficult to apply in practice, mainly because it eliminates as many equations as possible. A more simple breeding structure such as a genotyped nucleus and non-genotyped commercial population structure can greatly simplify the elimination of equations. In some situations the organisation controlling the nucleus breeding programme may not have access to the records on commercial animals but may still need to include this information in the calculation of marker assisted EBVs (MA-EBVs) on nucleus animals. The aim of this paper is to present a method that reduces the number of marker assisted breeding value estimation equations in a population where the nucleus animals are marker genotyped and the commercial animals are not genotyped. The reduction mainly eliminates the equations of non-genotyped animals. Furthermore, an approximate method of calculating MA-EBVs on nucleus animals is presented, which uses only the conventional non-MA-EBVs of nucleus animals from a national genetic evaluation to represent the data from commercial animals.

Reducing the number of equations
The population was split into nucleus and commercial animals. Here, the definition of a commercial animal is: an animal that is not marker genotyped and has no descendants that are genotyped. The nucleus animals are all marker genotyped animals plus their ancestors. The method will still work if a commercial animal is erroneously considered as a nucleus animal, although the number of equations will not be reduced for such an animal. The method will fail, however, if a nucleus animal is erroneously considered as a commercial animal. For simplicity we ignored fixed effect equations, but including them is straightforward. Similarly, we assumed here only one marked QTL, since the inclusion of more marked QTL is straightforward. Partitioning the population into nucleus and commercial animals, the model can be written as: where y i (y 2 ) is the vector of phenotypic records of nucleus (commercial) animals; a l (a 2 ) is the vector of polygenic effects of nucleus (commercial) animals; q l is the vector of marked QTL effects of the nucleus animals; q 2 (q 3 ) is the vector of paternally (maternally) derived QTL effects of the commercial animals; e l (e 2 ) is the vector of environmental effects of nucleus (commercial) animals; Z l is the incidence matrix of polygenic effects of nucleus animals; Z 2 is the incidence matrix of QTL effects of the nucleus animals; and Z 3 is the incidence matrix of polygenic effects of the commercial animals. Note that Z 3 is also used as the incidence matrix of the paternally and of the maternally derived QTL effects of the commercial animals, because these effects have the same incidence matrix as the polygenic effects of the commercial animals. The Z 2 matrix can differ substantially from Z l when the inheritance of QTL effects is traced from parent to offspring by the markers [8]. In order to solve the BLUP equations, we need the inverses of the (co)variance matrix of [a' a'] and of [q' q' q'], which are obtained using the methods of Quaas [10,11! and Fernando and Grossman !3!, respectively.
In order to reduce the number of equations of the commercial animals, the 'reduced animal model' approach of Quaas and Pollak [12] was adopted. This approach was also used by Cantet and Smith [2] and Bink et al. [1] to absorb the equations of non-parents in a model with QTL and polygenic effects. We re-write equation (1) as: where U2 ! a z + q z + q 3 . For the mixed model equations that follow from equations (2), we need the inverse of the (co)variance matrix of [a' q! 1 U/ 2 1. Following Quaas [10, 11!, we will assume that the animals within the nucleus and within the commercial are sorted from old to young. Next, we write every element of [at 1 qf 1 uf 2 in terms of its 'parental' elements plus an independent deviation from the 'parental' elements, where 'parental' elements denote the ai, q l or U2 elements of the parents of the current animal: where P is an indicator matrix of the parents of a i , such that P ij = 0.5 if animal j is a parent of animal i, and otherwise P ij = 0; Q2! = B2! if QTL, is with probability O ij a direct copy of QTL,, where QTL J was one of the two 'parental' QTL alleles of QTL,, with 'parental' denoting that QTL j was involved in the Mendelian sampling process that resulted in QTL I , and for all other i and j: Qij = 0; Rij = 0.5 if nucleus animal j is a parent of commercial animal i, and otherwise Ri! = 0; Si! = 0.5 if one of the two QTL of commercial animal i is a direct copy of the nucleus gamete j with a probability of 0.5 (the probability is always 0.5 because commercial animals are not marker genotyped), otherwise S j = 0; T ij = 0.5 when commercial animal j is a parent of i, otherwise T ij = 0.
The elements of E l , E2 and E3 are all independent, unless the markers are not completely informative, i.e. it is not always possible to trace which marker is inherited from the sire and which from the dam. In the latter case, the elements of E2 may be correlated and the method of Wang et al. [14] can be used to set up (the inverse of) the (co)variance matrix of the QTL effects of the nucleus animals. The calculation of the (co)variance matrix of the QTL effects of the nucleus animals becomes even more complex when ancestors of nucleus animals have missing marker genotypes; however, for this situation, Wang et al. provide an approximate method to set up the (co)variance matrix of QTL effects. We will ignore these complications of obtaining the inverse of the (co)variance matrix of the QTL effects of the nucleus animals here, because the method that is used to obtain the inverse of this (co)variance matrix does not affect the setting up of the inverse of the (co)variance matrix of the u 2 equations. This is because the situation of uninformative marker information and ungenotyped ancestors of genotyped animals did not occur within the group of commercial animals, since none of the commercial animals were genotyped.
Let the variance of the polygenic effects be denoted by Q a and the variance of the QTL effect of one gamete be denoted by o, q, 2 then their variances are: where D l is a diagonal matrix with D iii equal to Q a, 0.75 Q a or 0.5 Q a when no, one or both parents are known of nucleus animal i, respectively; D 2 is a diagonal with D 2ii equal to a) or 2Bi!(1 -0g )a) when gamete i is a founder gamete or is derived from gamete j with probability Bi! !3!, respectively; and D 3 is a diagonal with D 3ii equal to Q u, 0.75 Q u or 0.5u!, when no, one or both parents of commercial animal i are known, respectively, where 0 ' = a2 + 2 Q q.
Next we solve equation (3) for v' = [a' q' u'] to obtain: Taking variances on both sides yields, Finally the inverse of Var(v) is G-1 which is obtained as: Similar to Quaas (10, 11!, the following rules can be found to set up G-1 . 1) For the polygenic effects of the nucleus animals part of G-1 : follow Quaas' rules (multiply by I/or2 to account for the different variances in different parts of G-1).
2) For the QTL effects of the nucleus animals part of G-1 : follow the rules of Fernando and Grossman [3] (multiply by 1/ Q9 ). If there are no equations for the QTL alleles a i , a 2 , a 3 and/or a 4 the additions to their positions are cancelled. When all alleles a i , a 2 , a 3 and/or a 4 have no equations, the additions simplify to the original rule of Quaas [10,11].
As can be seen from the above additions, the commercial animals add the same values as in Quaas' rules to the elements of their parents, but if the parents are nucleus animals these values are added to their polygenic and QTL effects.
After setting up the G-1 matrix, we can set up and solve Henderson's [6] mixed model equations: and Q e is the environmental variance. These equations will yield exact solutions of the estimates of polygenic (a l ) and QTL effects (q l ) of the nucleus animals, and of the sum of the polygenic and QTL effects of the commercial animals (u 2 ) (unless approximations have to be applied for setting up the (co)variance matrix of the QTL effects of the nucleus animals owing to missing marker genotypes of ancestors of nucleus animals). A small example of the calculation of the G-1 matrix is given in Appendix A.

The use of conventional EBV to predict MA-EBV
In the case of cattle breeding schemes especially, the commercial animals may not be owned by the breeding organisation and this organisation may not have access to the phenotypic information of the commercial animals. However, BLUP breeding value estimates and their accuracies may be available from a national breeding value evaluation. We would like to use this information to improve the accuracy of the marker assisted breeding value estimates in the nucleus. This problem is similar to that of incorporating AI sire evaluations into intraherd breeding value predictions by Henderson (5!, and our approach will therefore also be similar to that of Henderson. The first step is to absorb the commercial animal equations into the nucleus equations, which will reveal which information from the commercial animals is needed. The full mixed model equations are [writing out equations (8) and (5)] see (8bis) in the following page.
Absorption of the commercial animal equations (u z ) yields equation (9) Note also that the additions R'BR and R'b are the same as those in the MA-EBV equation (9). Hence, if we obtain approximations for R'BR and R'b in equation (10) we can approximate equation (9). We know the EBV and their accuracies, r i , which result from equation (10). Let the matrix C = (M + R'BR)-1 , then the diagonal elements of C are: where A = (7 e 2/(72 U. Now it is assumed that R'BR can be approximated by a diagonal matrix A, i.e. we find a diagonal matrix A such that: where only the diagonal elements of C are known. The diagonal elements of A, !ii, yield the effective number of records that should be added to a nucleus animal i, such that the accuracy of its EBV is equal to the accuracy when the commercial animals were included. A similar effective number of records was derived by Henderson !5!, but in his situation the animals within the herd did not contribute significantly to the EBV of the sire. Here, we used the following iteration scheme to disentangle the information that came from the nucleus animals, which is represented by the matrix M, and the information that comes from the commercial animals, which is represented by the matrix A. Newton's iteration algorithm was used to calculate the diagonal matrix Given the approximated mixed model coefficient matrix of the nucleus animals after absorption of the commercial animals, M + A, an approximation of the right hand side of equation (10), is obtained from: where ARHS is an approximation of the term R'b in equation (10). Since, EBV and Ziy l are known, ARHS can be calculated from the above equation.
Next we will calculate the absorbed coefficient matrix of the marker assisted mixed model equation (9), and their right hand side. From the previous section we concluded that we could approximate R'BR I by D ii , where R i is the ith column of R. The vector R i indicates which commercial animal is an offspring of nucleus animal i by containing a 1/2 if the commercial animal is an offspring of i or a 0 otherwise. If a l is one of the QTL alleles of nucleus animal i, the a l th column of S, S al , contains a 1/2 if the commercial animal is an offspring of animal i. If every nucleus animal has two unique QTL alleles, as in the model of Fernando and Grossman !3!, it follows that R i = S al = S a2 , with a l and a 2 denoting the QTL alleles of animal i. Hence: and, similarly, where a x denotes a l or a 2 . Thus, the addition D.ti to the diagonal of the polygenic equation of the nucleus animal i should also be added to the offdiagonal of the polygenic equation i and QTL allele equation a l and a 2 ; to the diagonal of both QTL equations a l and a 2 ; and to the off-diagonal elements of a l and a 2 . And the term ARHS I should be added to the right hand side of the equation of animal i, and of the QTL equations a l and a 2 . In conclusion, to account for the information of commercial animals, for every nucleus animal i we add to the coefficient matrix of the MA equations of the nucleus animals that ignores information of commercial animals: where a l and a 2 denote the equations for the QTL effects of animal i; and we add to the right hand side of these nucleus equations for every nucleus animal i: Thus, the additions (11) and (12) result in an approximation of the marker assisted nucleus equations (9) using only the EBV and accuracies to account for the information of commercial animals.
The equality of R i to S a£ requires that the QTL allele a!. is only present in one animal i. However, in the model of Meuwissen and Goddard !8!, QTL alleles might be traced from parent to offspring with certainty, because flanking markers were used and double recombinations were ignored. In this model different animals may carry the same QTL allele a!, and S ax = Ei, 7 Ax Ri, where the summation is over all animals i that carry QTL allele a x . This complication of S ax being the sum of several R i terms does not affect the additions in equations (11) and (12) which are due to terms that are linear in S ax , because the correct additions are still performed as all the animals contributing to S ax are evaluated. However, the additions to the QTL allele * QTL allele block of equation (11), are due to second order terms of S ax , which implies that more off diagonal terms of the absorption matrix B have to be added. We will ignore these extra off diagonal terms of B, which are due to the second order terms of Sa,!, and perform the additions as described in equation (12), which adds another level of approximation to this method.
In the above, the fixed effect structure of the nucleus animal data was ignored, but can be accounted for by absorbing the fixed effect equations into the equations of the nucleus animals, i.e. the matrix M would be the conventional mixed model coefficient matrix after absorption of fixed effects. Alternatively, if absorption of fixed effects is computationally too demanding, the following steps can be undertaken to account for fixed effects: step 1: approximate O.L i as in the forementioned Newton algorithm, except that (M + ![1']) -1 is replaced by the animal * animal block of: where X is the design matrix of the fixed effect structure of the nucleus data; step 2: if the fixed effect solutions are not available from the national breeding value evaluation, solve for the fixed effect solutions, /3, using: step 3: calculate ARHS using: The above methods that account for fixed effects assume that different fixed effects are estimated in the nucleus than in the commercial animals, which will be the case in most situations. A brief example of the use of non-MA-EBV in the estimation of MA-EBV of nucleus animals is given in Appendix A.

Simulation
A data set was simulated to test whether the reduced number of marker assisted mixed model equations indeed yielded the same solutions as the original full set of equations, to compare the number of iterations needed to solve the reduced set of equations and the original equations (which might be more diagonally dominant), and to test the approximate absorption of all equations for commercial animals. The data set resulted from five generations of simulation of a nucleus and a female commercial population, where the nucleus animals are selected on conventional BLUP-EBV, and the unselected commercial females are mated to the selected sires of the nucleus. The parameters of the simulated data set are presented in table I. MA-EBVs were calculated in a manner similar to that of Meuwissen and Goddard [8] in which it is assumed that if markers cannot trace the inheritance of QTL alleles from parent to offspring, then the QTL allele inherited is treated as equally likely to be either of the two alleles in the parent (the possibility that a segregation analysis of the marker data might improve this prediction, was ignored). The probability that the markers could not trace the inheritance of the QTL was assumed to be 0.1, which occurred in 158 instances in the nucleus. In these instances a new QTL effect was postulated and estimated. Including the 400 founder QTL effects (= 2 * 200 founder nucleus animals), the number of QTL effects of nucleus animals was 558. The commercial animals were not marker genotyped and so no QTL effects could be traced. If no equations were eliminated, this would result in 10 000 (= 2 * 5 000 commercial animals) commercial QTL effects. The equations were solved by Gauss-Seidel iteration. The convergence criterion was: where SS is the sum of squares of the deviations of the left hand side from the right hand side of the equations; SST is the sum of squares of the solutions. The number of iterations needed to reach this convergence criterion is a (imperfect) measure of how easily the equations could be solved. This measure is not perfect because the solution vectors of both methods are not the same, and SS may be small while the solution vector is still far from the exact solution. Table II shows the results of the EBV calculations. Without any reduction in the number of equations, the total number of equations is: 16 558 (6 000 animal and 10 558 QTL effects). When the QTL equations of the commercial animals were eliminated, the number of equations was reduced by 10 000. In practice, this figure will often be much larger, because the commercial population will be much larger than in the simulation. Furthermore, the number of iterations that was needed to reach the convergence criterion, was smaller with the reduced set of equations. This suggests that the reduced set of equations was not any harder to solve than the original large set of equations. The solutions to both sets of equations were virtually identical (result not shown), except that the reduced set did not yield estimates of individual QTL effects of commercial animals. If the estimates of the QTL effects of the commercial animals were required, they could be obtained by back solving (see Appendix B). Table III shows the results of the approximate method using conventional EBV of nucleus animals to estimate the MA-EBV of the nucleus animals. When we want to predict the total breeding value of the animals, u i , the use of conventional EBV of nucleus animals, instead of the phenotypes of commercial animals, leads to very accurate predictions of MA-EBV in the simulated data set. The predictions of the individual QTL effects, q i , were also accurate, i.e. the correlation and regression is close to 1. In this approximate method the number of equations was only 1 558.  (table 11).

RESULTS
An alternative to this method is the use of a reduced animal model [1, 2!, which would eliminate equations for commercial animals only if they are not parents. Thus the method presented eliminates more equations than the reduced animal model approach but less than Hoeschele (7!. The importance of using the data on commercial animals when calculating MA-EBVs for the nucleus animals is illustrated by the case of a granddaughter design where all phenotypic data come from the commercial granddaughters. The method proposed could be used in a national genetic evaluation where the number of additional equations would be proportional to the size of the nucleus.
Extension of the method to more marked QTL is straightforward. Let the parental QTL alleles of the first marked QTL be denoted by a i , az (a 3 , a 4 ), and those of the second marked QTL by b l , b 2 (b 3 , b 4 ), where the elements between the brackets are needed when a second nucleus parent is known. Now extra rows and columns for b l , b 2 (b 3 , b 4 ) are augmented to the additions in equations (6) and (7), where the values in the augmented rows and columns are the same as those in the rows and columns of a i , a 2 (a 3 , a 4 ). Also, the off diagonal elements between aj and bh are the same as those between a l and a 2 for all j and k. Hence, with n marked QTL, the number of equations of commercial animals reduces from 2n, + 1 to just one. In many marker assisted selection schemes, there may also be many nucleus animals that are not genotyped, namely the old ancestors of the nucleus that were born before marker genotyping started. Some cryo-conserved semen of old sires may still be available for genotyping, but many old ancestors will remain non-marker genotyped. In situations, where the old ancestors result in a computationally unmanageable number of equations, the approach of Hoeschele [7] eliminates all QTL equations of non-genotyped ancestors that are not on a genetic pathway between two marker genotyped animals. Although more difficult to apply, this method can result in a substantial reduction in the number of QTL equations of non-genotyped ancestors. In the present simulated data set, all nucleus animals were genotyped and a further reduction in the number of equations was not obtained by using the approach of Hoeschele (7!. The rules presented for setting up the G-1 matrix, did not account for the inbreeding of the animals. If inbreeding is not negligible, Quaas' [10, 11! rules can be used to account for inbreeding in the nucleus animal * nucleus animal part of the G-1 matrix, which results in reducing the elements of the D 1 matrix by a fraction equal to the average inbreeding coefficient of the parents. Also, Wang's [14] method accounts for inbreeding, where the inbreeding coefficient is calculated at the QTL given the marker information. In the equations of the commercial animals, the average inbreeding coefficients can be accounted for by reducing the o,2 term in additions (6) and (7) by a fraction equal to the average inbreeding coefficients of the parents. The latter will be slightly biased because the average inbreeding coefficient of the parents will be different at the QTL. This bias can be corrected by using a weighted average inbreeding coefficient, where the conventional inbreeding coefficient averaged over the parents, the inbreeding at the QTL of the sire, and that at the QTL of the dam, are weighed in proportion to their variances, i.e. or 2 ,a2and QQ , respectively.

Use of conventional EBV to predict MA-EBV
A method was developed that uses the information of conventional EBV of nucleus animals and their accuracies instead of the data on commercial animals to increase the accuracy of MA-EBV of nucleus animals. In the simulated data set, the approximate MA-EBV, which used conventional EBVs, were very close to the original MA-EBV based on the full set of equations which used the phenotypic records from the commercial animals. The method was also tested in a granddaughter design !15!, where a genotyped grand sire has genotyped sons which have conventional EBVs based on daughter records. In this situation, the prediction of the QTL effects of the sons and the grand sires was identical to when the original records of the daughters had been used (result not shown). This method could be used by a breeding organisation controlling the nucleus breeding programme without including any information on commercial animals. The method can be compared to the use of daughter yield deviations to represent data on the (commercial) daughters of a nucleus bull. However, this method uses all commercial descendents of a nuclear animal (via its conventional EBV) and avoids double counting of the information from descendents in the nucleus. The method could be extended for QTL detection studies based on REML estimation of the variance due to a QTL, bracketed by the markers. The approximate method to incorporate EBVs from the commercial animals into the nucleus MA-EBV is similar to the use of foreign EBV in the national evaluation of a country. Except that the foreign EBV calculation does (almost) not use local information, such that the foreign EBV yield independent extra information. Hence, the accuracy of the foreign EBV can be directly converted into an effective number of records (or daughters) that is added to the diagonals of the coefficient matrix, and into deregressed proofs, i.e. extra records (or daughters) are invented that contain the information of the foreign EBV.
Here, the situation was more complicated because the conventional EBV of the nucleus animal already contained the information of the commercial animals, which made it more difficult to determine the extra information.
The calculation of the MA-EBV using conventional EBV of commercial animals relies strongly on the accuracy of the conventional EBV. In the present simulation study, the accuracies were calculated by inversion of the conventional mixed model matrix, i.e. exact accuracies were used. In the case of a national evaluation of EBV, the number of equations is too large for direct inversion and the accuracies have been approximated. These approximations are often good (e.g. !9!). However, poor approximations of the accuracies of the conventional EBV will probably reduce the accuracies of the MA-EBV substantially. In any case, the method presented here seems to make as much use as possible of the conventional EBV of national evaluations to improve the accuracies of the MA-EBV of the nucleus animals.