Multibreed analysis by splitting the breeding values

An equivalent model for multibreed variance covariance estimation is presented. It considers the additive case including or not the segregation variances. The model is based on splitting the additive genetic values in several independent parts depending on their genetic origin. For each part, it expresses the covariance between relatives as a partial numerator relationship matrix times the corresponding variance component. Estimation of fixed effects, random effects or variance components provided by the model are as simple as any model including several random factors. We present a small example describing the mixed model equations for genetic evaluations and two simulated examples to illustrate the Bayesian variance component estimation.


INTRODUCTION
The additive genetic covariance between relatives in multibreed cases was described by Elzo [6]. Nevertheless, this definition ignored the segregation terms, which explain the difference in the additive variance of the F1 and F2 groups, as was pointed out by Lo et al. [17]. These authors describe the rules to obtain G, the covariance matrix of the additive values, including both the pure breed contributions and the segregation deviations and where P is the number of breeds involved in the founder generation, f i p is the proportion of genes of the animal i coming from breed p, S and D are the sire and the dam of animal i, S and D are the sire and the dam of animal j, σ 2 p is the additive variance component of breed p and σ 2 pp is the segregation variance between breeds p and p . These rules are closely related to the tabular method to obtain inbreeding coefficients [9]. Lo et al. [17] also showed details about the calculation of G −1 , closely derived from the conventional rules to obtain the inverse of the numerator relationship matrix [12,19].
After this particular definition of G, the analysis of multibreed populations is fully standard. The model y = Xb + Za + e is defined as where b are the fixed effects including the breed effects and other systematic effects, X is the incidence matrix relating fixed effects and records, Z is the incidence matrix for random effects and σ 2 e is the residual variance. It must be noticed that the columns of X corresponding to breed effects consist of the genetic contributions of each breed to each phenotype. The missing data of the model, that is, the additive genetic values, are normally distributed as a ∼ N (0, G) after discarding the breed effects. Genetic evaluations for this model are also standard, that is, fixed and random effects can be predicted with the mixed model equations [13] as described in [3].
where the additive covariance matrix G can not be expressed as the numerator relationship matrix times the additive variance components as in the conventional animal model because both the pure breed and the segregation components are mixtured into G.
For that reason, genetic variance components for the model defined in formulas (3) and (4) are difficult to estimate. For instance, restricted maximum likelihood estimates for multibreed models became more complicated than in a conventional animal model; the details can be found in Elzo [7] and Birchmeier et al. [2]. On the contrary, Bayesian inferences applied on this model resulted in non standard full conditionals for the variance components, making a Metropolis-Hastings implementation involving the repeated evaluation of the determinants of G necessary [4].
In this paper we will describe an equivalent model to (3) and (4) where G will be splitted into several parts allowing a simple analysis of the variance components. As an example, the Bayesian inference will be described for this case in a simpler way than that described in the literature.
In Section 2, the equivalent model for multiple breed cases will be proposed. In Section 3 we will illustrate the implementation of the genetic evaluation in a small example and we will describe both the Bayesian and frequentist inference on variance components in Section 4.

THE MODEL
Formula (1) shows that the variability of a genetic value g ii can be splitted into several independent sources, some of them related with the pure breed variabilities and others related with segregation terms. The matrix G can then be partitioned into several pieces just considering only one source of variability and assigning 0 to the others. Lacy et al. [15,16] used this procedure to analyze the partial inbreeding of the animals due to each founder of the population. We will use it here to split the source of each genetic covariance g i j , i.e., each element of G.
For instance, the partial G related with the breed p, G (p) , can be obtained by considering every genetic component but σ 2 p null. Then, the rules in equations (1) and (2) become and g (p) iD .
The partial G due to the segregation between breeds p and p , G (pp ) , can be obtained after the same argument over σ 2 pp . and It can be noticed that G (p) or G (pp ) depend each on a single variance component and they can be expressed as G (p) = A p σ 2 p or G (pp ) = A pp σ 2 pp , being both A p and A pp the partial numerator relationship matrices due to pure breeds or segregation terms respectively. Although not explicitly described as partial numerator relationship matrices, splitting the inbreeding of the animals into several independent parts was described yet in [15,16]. Following this idea, Rodrigañez et al. [20] calculated the partial inbreeding coefficients by founder origin by using a modification of the tabular method. It is verifiable that Replacing (10) into (4) we have the variance of the model defined as which can be analyzed by including the corresponding missing data a p and a pp for each component as which is just a conventional animal model with several random factors. For instance, considering a case with two pure breeds, the model becomes y = Xb + Za 1 + Za 2 + Za 12 + e, being a 1 , a 2 and a 12 the breeding values splitted by origin. Assuming the variance components to be known, the Henderson mixed model equations [13] will be where α x = σ 2 e /σ 2 x . Equations (5) and (13) provide equivalent results, being verifiable that a = a 1 + a 2 + a 12 .
The inverses of the partial numerator relationship matrices involved in (13) can be easily obtained from where P is a matrix relating progeny to parents and D x are the pivots recursively obtained by the conventional algorithm [19]. Only D − is specific for pure breeds or segregations and they will be described in the next section.
Note that formula (13) uses generalized inverses to take into account that some animals have null contributions from some breeds or segregations.
In these cases, the inverse of the whole A −1 x does not exist because of the existence of null rows. From here on, we will use A − x as the inverse of the corresponding nonzero part of A but keeping the pattern of null rows as in A. We will use D − x and a x accordingly. We could not demonstrate the equivalence of both models in cases including null contributions, but their equivalence will be illustrated by using a small numerical example in Section 3.
In fact, the conventional strategies to set up the mixed model equations can be applied as if all contributions were present, but the equations corresponding to null contributions have to be discarded in order to solve the linear system in equation (13).

Calculation of the diagonal elements of the partial numerator relationship matrices
First, considering that in the new model G can be factored as in Aσ 2 for each partial covariance matrix, recursive formulas to compute the tabular method (6), (7), (8) and (9) can be written respectively as where A i j is the element at the ith row and the jth column of the corresponding partial A. Second, by defining c i = f i p for the pure breed contributions and for segregation effects, the tabular method can be applied in both cases by using the same algorithm The Quaas procedure [19] can be easily reformulated to include the contributions of each genetic origin. In fact, following his procedure (Sect. 3.1.a of the referred paper) for the complete numerator relationship matrix where L is the Choleski factor of A and v and u are working vectors (details can be found in [19]), the partial relationship matrices A − for pure breeds or segregation effects can be obtained only by setting a slight modification Other parts of the conventional algorithm to compute the diagonal elements of A − do not suffer any modification, but skipping the animals with null contributions for some particular partial numerator relationship matrices. It is verifiable that (14) is a particular case of (15) for single breed cases with all c i = 1.

NUMERICAL EXAMPLE TO ILLUSTRATE THE GENETIC EVALUATION FOR THE MULTIPLE BREED MODEL
In Table I, an example is presented including a herd effect with two levels (h) and 11 animals. The four base animals were classified into two breeds (g) and the true variance components are σ 2 e = 4, σ 2 1 = 1, σ 2 2 = 2 and σ 2 12 = 0.25. We implemented the Cantet and Fernando [3] method for the animal model y = X 1 h + X 2 g + a + e as presented in equation (5), where b = (h g ). In this case, the number of equations is 15. The incidence matrix for the breed effect, X 2 , includes the contribution of genes of each breed for each phenotype. We discarded the equation for the first herd to allow the estimability of the system. The relevant part of this formulation is the matrix G, which has been calculated following the rules of Lo et al. [17] at equations (1) and (2).
The results for the fixed and random effects are presented in column 1 of Table II.
We also implemented the mixed model equations as presented in formula (13), which consisted of 37 equations. The relevant parts are A 1 , A 2 and A 12 . In order to allow an easy verification of formula (10), that is, G = A 1 σ 2 1 + A 2 σ 2 2 + A 12 σ 2 12 , we present them including the corresponding null rows and multiplied by its variance component.
The A − x required to set up the mixed model equations are really the inverses formed by the nonzero rows of each partial numerator relationship matrix. The results of this model are presented in Table II in columns 2 to 4. It is easily verifiable that the sum of the partial predictors for each animal equals the complete predictor provided by the Cantet and Fernando's [3] formulation at column 1. The null partial results for some animals correspond to the discarded equations because of their null contributions.

ESTIMATION OF VARIANCE COMPONENTS
Model (12) can be understood as a conventional animal model with several random factors with different covariance matrices. Variance component estimations on such kinds of models have been described in the literature under both the restricted maximum likelihood or the Bayesian inference. The main difference between the model presented here and other models involving, for instance, permanent environmental and litter effects, is the particular nature of the partial numerator relationship matrices A 1 , A 2 and A 12 .
In this section we will use two simulated data sets and we will estimate the variance components via the Gibbs sampler. We will also outline the estimation of the variance components via the expectation maximization algorithm.

Data simulation
Two crossed populations were simulated in order to analyze the variance component estimation. In both cases we simulated 4000 base animals in two pure breeds and four non-overlapping generations of random mating. The total number of animals was 20 000 in both cases. Additive genetic values were obtained after genetic simulation with 100 loci and AA = 2, Aa = 0 and aa = −2, we simulated a fixed effect with 20 levels assigned at random and the continuous residual deviates were obtained from a normal distribution with σ 2 e = 400. All animals were recorded and the pedigree was considered to be completely known.
In case I, all genetic frequencies were 0.5 and 0.7 for breeds 1 and 2 respectively, resulting in a segregation variance of σ 2 12 = 8. In case II, all genetic frequencies were 0.1 and 0.8, respectively, resulting in a segregation variance of σ 2 12 = 98. True values of both the variance components and the estimable differences between genetic means are presented in Table III. Both data sets were simulated with the same structure. They have the same fixed levels and residual deviates and they only differ in the additive genetic values. The number of animals with nonzero contributions was 16 826 for breed 1, 16 716 for breed 2 and 10 958 for the segregation.

Bayesian analysis
We analyzed both data sets by using the Bayesian inference on model (12) which consists of a conventional mixed model with several random effects. Considering flat unbounded priors for both the variance components and the fixed levels to simplify the implementation, the joint posterior distribution is p a 1 , a 2 , a 12 where e = y − Xb − Za 1 − Za 2 − Za 12 , n is the number of records and q 1 , q 2 and q 12 are the number of animals with nonzero contributions of breed 1, 2 and segregation, respectively. Full conditional distributions from 16 are the well known normal distributions for the levels of both b and a and inverted chi-square distributions for variance components. Details of these conditionals can be found in [21] and are not presented in this paper because they are fully standard.
To evaluate the posterior density in (16) it suffices to compute for each quadratic form φ 2 i /d i , where φ represents the Mendelian sampling terms. Note that in this case, it only makes sense to consider the animals with nonzero contributions.
The Bayesian inference for both cases was implemented by using a Gibbs sampler. Just the special structure of the numerator relationship matrices and the use of different amounts of animals in each part of equation (16) differed in this analysis in relation to the conventional mixed models.
The burn-in lengths provided by the coupling argument [10] were 13 428 and 2426 in cases I and II, respectively, to reach a difference between chains of 10 −4 . After burn-in, posterior distributions were obtained by averaging 50 000 cycles of a single chain. The effective chain lengths after burn-in were 274.56 and 131.08 for cases I and II, respectively. Table III shows the true simulated values and the posterior expectations and variances for both cases. It must be noticed that the accuracy of the segregation variance estimate is worse than the additive genetic variances of the pure breeds.

Restricted maximum likelihood
Restricted maximum likelihood [18] estimates of the variance components for multibreed analysis is also a particular case of a mixed linear model with several random factors. Then, the expectation-maximization algorithm [5] can be implemented by using the iterative approach described in [11]. This algorithm requires for each iteration the fixed and random solutions of the linear system provided in formula (13). The genetic variance components can be calculated at each iteration witĥ where i represents pure breeds and segregations,σ 2 i is the variance component corresponding to pure breeds or segregation effects, q i is the number of animals with nonzero contributions from this particular source of variability (number of nonzero c i as described above) and C ii is the corresponding random by random block of the generalized inverse of the coefficient matrix in formula (13). The animals with null contributions have to be discarded both in a i A − i a i and A − i C ii . The residual variance can be obtained at each round of iteration after the well known formulaσ whereσ 2 e is the residual variance estimate at the current round of iteration, n is the number of records, r (X) is the column rank of X and i goes throughout the pure breeds and segregation effects. Note that we considered null the elements of a i corresponding to animals with a null contribution from breed or segregation i.
We did not implement this algorithm because it requires a high computation demand. It requires the inversion of the coefficient matrix in (13) at each round of iteration, whose rank is greater than in a conventional single breed analysis.

DISCUSSION
The model proposed in this paper is equivalent to the current model of choice [2][3][4]17] for genetic evaluations in crossbred populations, including the segregation components. When discarding the segregation terms, the proposed model can also be used as an equivalent model to that proposed by Elzo's group [6][7][8]. It will provide equivalent results because it assumes the same distribution of records: the expectation is the same in both models and considering (10), equations (11) vs. (4) provide the same variance. They only differ in the definition of the missing data, that is, the complete additive values in the conventional model were replaced by the partial additive values. As far as these values will be integrated out, both models will provide equivalent results, as we illustrated in a small example assuming the variances to be known (Sect. 3).
Our proposal requires a linear mixed model very much larger than the classical one. For instance, examples including two founder breeds require a linear system around three times larger, depending on the number of equations discarded because of the null contributions. Furthermore, the resulting linear system in formula (13) is not expected to be more sparse than the linear system in formula (5), because the sparsity of both approaches depends mainly on the pedigree structure, that is, on (I − P).
Although its computational demand is expected to be higher, the proposed model is a more natural approach in the sense that it defines a set of missing data for each source of variability, as linear models in animal breeding usually do. For that reason, it provides very simple formulas for the variance component estimation. When using the Gibbs sampler, the variance components will have the well known scaled inverted chi-square full conditional distributions, avoiding the use of the non standard forms described in Cardoso and Tempelman [4]. Although not implemented in this paper, restricted maximum likelihood estimates of the variance components via the EM algorithm (details in [2]) will also be simpler when based on the quadratic forms of the partial additive values.
The model can also be used to analyze models including genetic groups with different genetic variances, as Alfonso and Estany [1] analyzed. Nevertheless, the required number of equations will grow dramatically even for models with a small number of genetic groups.
Even for single breed analysis, splitting the breeding values of the animals into several independent parts by genetic origin can also be useful. For instance, it can be used to evaluate the impact of some group of founders in the breeding values of their descendants, for instance, multiple ovulation embryo transfer nucleus or artificial insemination centers. Further research should be focused to develop models to evaluate the impact of some non founder animals in their descendants.
Although not described in this paper, the extension to multiple trait models is expected to have the same difficulties as the conventional models including several random factors.

IMPLICATIONS
Variance component estimates in multiple breed populations have been implemented by using particular and specific computer programs. General purpose packages to estimate variance components, such as VCE5 [14], can not be used currently for multiple breed analysis, but they can handle models with multiple random factors. These packages will only require an extra coding effort to calculate both the genetic contributions and the partial relationship matrices outlined in algorithm (15), which can be easily implemented from the code they actually include to calculate the diagonal elements of the numerator relationship matrix (algorithm (14)).