Data transformation for rank reduction in multi-trait MACE model for international bull comparison

Since many countries use multiple lactation random regression test day models in national evaluations for milk production traits, a random regression multiple across-country evaluation (MACE) model permitting a variable number of correlated traits per country should be used in international dairy evaluations. In order to reduce the number of within country traits for international comparison, three different MACE models were implemented based on German daughter yield deviation data and compared to the random regression MACE. The multiple lactation MACE model analysed daughter yield deviations on a lactation basis reducing the rank from nine random regression coefficients to three lactations. The lactation breeding values were very accurate for old bulls, but not for the youngest bulls with daughters with short lactations. The other two models applied principal component analysis as the dimension reduction technique: one based on eigenvalues of a genetic correlation matrix and the other on eigenvalues of a combined lactation matrix. The first one showed that German data can be transformed from nine traits to five eigenfunctions without losing much accuracy in any of the estimated random regression coefficients. The second one allowed performing rank reductions to three eigenfunctions without having the problem of young bulls with daughters with short lactations.


INTRODUCTION
The multiple across country evaluation (MACE) [17] methodology is currently used for international dairy bull comparisons. Estimated breeding values from each country are deregressed to obtain a value analogous to daughter yield deviations (DYD) for bulls that have daughters with records. Despite the fact that only a single EBV per bull is permitted for each country in international genetic evaluation, the current MACE has a large number of equations, since each evaluated sire will, conceptually, have a breeding value for all traits, i.e. for all countries, although it might have daughters only in one. Traditionally, the corresponding genetic covariance matrices have been considered unstructured, i.e. for k countries there were k(k + 1)/2 distinct (co)variance components to be estimated. For example, when the current Interbull Holstein evaluation run for production traits includes 26 countries, the (co)variance matrix of genetic effects involves 325 correlations. By and large, restrictions on estimates are imposed only to ensure that estimates were within the parameter space, i.e. that all variances and conditional variances are positive, that all correlation estimates are in the range of -1 to +1, and that all partial correlations are consistent with each other [13]. In statistical terms, this is equivalent to the requirement that the estimated covariance matrix is positive semidefinite, i.e. that none of its eigenvalues is negative.
In contrast, other areas of statistics have long since assumed and estimated structured covariance matrices. A well-designed structural model for genetic covariances uses information, external to the data, to explain genetic covariability in terms of few parameters, leading to more precise estimates of genetic correlations between countries [16]. For example, in international dairy sire evaluations, traits are currently defined according to country borders. However, similarity in production systems between herds in different countries depends not only on geographical proximity but also on climatic conditions, on management practices, and on the genetic composition of the cow population. If information is available about these variables, it can be used to explain the genetic covariance structure between countries. Rekaya et al. [16], Minéry et al. [14] and Leclerc et al. [6] have proposed structural models in order to reduce the number of parameters to be estimated across countries.
On the contrary, principal component (PC) analysis is widely used as a dimension reduction technique, but so far has had only limited applications in quantitative genetic analyses. PC analysis requires estimating eigenvectors and -values of covariance matrices [3]. Eigenvectors define independent linear functions of the variables considered, the so-called principal components, that successively explain the maximum amount of variation, measured by the corresponding eigenvalues. This implies that for a given number of components considered, PC approximate the multivariate data most accurately. It follows that PC with variances (eigenvalues) close to zero contribute virtually no information to the analysis that is not already contained in the PC with larger eigenvalues. Hence, these components can be ignored resulting in equivalent analyses involving fewer variables, i.e. traits or countries, and often reduced sampling variation. Reducing the dimension of analyses by considering fewer variables can considerably decrease computational requirements. Different authors have proposed the reduction of the number of parameters to be estimated across countries using PC analysis, and the closely related factor analysis [5,[10][11][12][13].
Since more and more countries have upgraded their national genetic evaluation system to a multiple trait model or a multiple lactation random regression test day model (RRTDM) [7], differences between models for national and international evaluations have become increasingly evident. In order to optimise genetic evaluation models for both national and international evaluations, Sullivan and Wilton [18] and Liu et al. [9] proposed a multiple trait MACE (MT-MACE) model for international bull comparison. This model extended the current single trait MACE (ST-MACE) allowing a variable number of correlated traits for countries using a multiple trait model in national genetic evaluation. Although the MT-MACE model can better utilise the information derived from the RRTDM in national genetic evaluation than the ST-MACE model, the huge size of the MT-MACE system can be a limiting factor for international genetic evaluation involving all dairy populations. In order to reduce the size of the MT-MACE system, it is even more necessary than for the ST-MACE to apply rank reduction techniques to find intermediate MACE models that can be a reasonable compromise between feasibility and accuracy.
In parallel to international bull comparison, France and Germany strengthened their collaboration at the end of 2005 in a joint project. One of the main goals of this project was to perform joint French-German bull and cow evaluations using pre-corrected records (i.e. yield deviations) in a two-step approach [1] following the multiple trait MACE model proposed by Liu et al. [9]. Since Germany uses a multiple lactation random regression test day model in national evaluation, a random regression MACE model (RR-MACE) was performed and was feasible for the parameter estimation and the joint bull evaluation of milk production traits from France and Germany [19,20]. In the near future, the RR-MACE model will be applied for the joint bull and cow evaluation for milk production traits from France and Germany. In this case, the number of traits per country can be a limiting factor due to the much larger equation system (i.e. millions of equations).
The aim of this paper was to explore different methods to reduce the rank of the German RRTDM in order to make MT-MACE applicable for joint French-German evaluation and/or international genetic evaluation involving a higher number of countries. After the analysis of a multiple lactation MACE model (ML-MACE), two rank reduction alternatives based on principal component analysis were performed to investigate their accuracy and suitability.

Data
Once February 2006 German national genetic evaluations were run, DYD of bulls were obtained as the average of daughter performance adjusted for fixed effects and non-genetic random effects of daughters and genetic effects of bull's mates [8]. There were 14 887 Black and White Holstein bulls with DYD available from Germany.
Full pedigree information of bulls with sire and dam relationship was used for MACE evaluation. There were 67 541 animals in the pedigree file and 32 genetic groups for unknown parents. Genetic groups were defined according to the breed, country of origin, selection path (son to sire, son to dam, daughter to sire and daughter to dam) and birth year of the animal. Small phantom groups were merged automatically given a predefined minimum number of animals per group. The following rules were applied to combine or merge small groups: selection paths were merged based on the sex of the parent (son to sire with daughter to sire, and son to dam with daughter to dam), countries were merged accordingly (North America, Western Europe, and the rest), minor breeds within Holstein breed were merged.

The MT-MACE model
For a country using a multi-trait model in national genetic evaluation, the following statistical model was applied to DYD of a bull i: where q i is a vector of DYD of the i-th bull, f is a vector of general means for all traits, u i is the vector of additive genetic effects of bull i, and ε i is a vector of residual effects. (Co)variance matrices for the random effects are the following: where G is genetic (co)variance matrix, and Ψ i is a multitrait equivalent daughter contribution (MTEDC) matrix associated with the DYD vector q i . Four sub-models of the MT-MACE model that differed with respect to the data analysed were considered in this investigation. The subindex i will be omitted from the formulae to simplify nomenclature.
Rank reduction in MT-MACE models 299

The random regression MACE model
The random regression MACE model (RR-MACE) analysed q i on a random regression coefficient (RRC) basis. Ignoring pedigree information, the mixedmodel equation (MME) for estimating the breeding valueû RRC for bull i is: where a ii is the diagonal element of the inverse of the relationship matrix for bull i. The matrix Ψ RRC and the vector Δ RRC were generated in national evaluation using the MTEDC procedure [8]. The equation system was solved using a pre-conditioned conjugate gradient algorithm (PCG) with an iteration-on-data technique [9]. The convergence criterion, defined as the logarithm of the sum of squares of differences in solutions between two consecutive rounds of iteration divided by the sum of squares of solutions in the last round, was set to -10.
The RRTDM in Germany modelled the additive genetic effects of an animal as a normalised orthogonal third-order Legendre polynomial per each of the first three lactations [7]. Thus, in order to model the additive genetic effects more closely to national evaluation, the RR-MACE evaluation estimated nine breeding values for each animal in the pedigree file. These breeding values on a RRC basis can be easily summarised to a combined lactation basis [7].

The multiple lactation MACE
In order to reduce the number of traits per bull, the ML-MACE model was designed for evaluating DYD on a 305-day lactation basis. Now, each bull with data in Germany had one DYD for each of the three lactations. In order to conduct the evaluation with the ML-MACE model, the following conversions need to be performed to equation system Ψ L + a ii G −1 L û L = Δ L prior to evaluation: where matrix L converts the information from RRC to a 305-day lactation basis. Note that the EDC matrix Ψ RRC and the matrix product L(Ψ RRC ) −1 L do not have a regular inverse when bull i has daughters with some missing lactations. In this case, only a full rank submatrix corresponding to lactations with data need to be inverted.

The rank reduced random regression MACE model based on genetic correlations
Because the genetic (co)variance matrix from Germany (Tab. I) has some relatively low eigenvalues, principal component analysis could be applied in order to reduce the number of equations per bull from nine traits to a lower number of eigenfunctions. The German genetic (co)variance matrix G = S rG U rG D rG U rG S rG is decomposed as a product of the eigenvectors U rG and the eigenvalues D rG of the genetic correlation matrix (Tab. II). The matrix S rG is the diagonal matrix of genetic standard deviations. The data transformation matrix is defined as T = S rG U rG D 1 2 rG . For rank reduction, the smallest eigenvalues in D rG can be set to zero and deleted, and the corresponding columns in U rG removed.
In order to conduct the evaluation with the rank reduced random regression MACE model based on genetic correlations (rG-MACE), the following conversions need to be performed to the equation system Ψ rG + a ii G −1 rG û rG = Δ rG prior to evaluation: G rG = I.
Note that since no matrix has to be inverted, there was no problem to calculate Ψ rG for bulls with missing values in second and/or third lactation (see Appendix).
Once the breeding valuesû rG were estimated, they were back transformed to the RRC basis asû = Tû rG (10). These back transformed EBV are approximate solutions of RRC breeding values.

The rank reduced random regression MACE model based on combined lactation weights
The previous rank reduction based on the genetic correlation matrix gave the same relative importance to all random regression coefficients. However, the German combined lactation EBV depend mainly on the first coefficient of each lactation [7]. In order to give more weight to the coefficients that influence lactation production the most, the genetic (co)variance matrix was decomposed G = S −1 CW U CW D CW U CW S −1 CW as a product of a matrix S −1 CW and the eigenvectors U CW and the eigenvalues D CW of a matrix C CW . The matrix C CW = S CW GS CW is the product of the genetic (co)variance matrix and a diagonal matrix S CW that contains the weight of each coefficient for calculating the combined lactation EBV [7]. In order to conduct the evaluation with the rank reduced random regression MACE model based on combined lactation weights (CW-MACE), the transformation matrix was defined as T = S −1 CW U CW D 1 2 CW and used in equation (5) and (6) as for rG-MACE.

Multiple lactation MACE model
Pearson correlations between the ML-MACE and the RR-MACE were 0.994, 0.995 and 0.995 for the first, second and third lactation EBV, respectively and 0.995 for the combined lactation EBV (Tab. III). The correlations by birth year were over 0.999 for old bulls although the values dropped for the very last years (Tab. III). The correlations also dropped for bulls with low average number of test day records of daughters (results not shown). However, the number of daughters had a very limited impact on correlations. In general, very similar combined lactation EBV were obtained using RR-MACE and ML-MACE, but significant differences existed for the youngest bulls with short lactations of daughters. These differences led to a correlation of 0.84 within the top 300 Holstein bulls (Tab. V).

Rank reduction based on genetic correlations
The rank reduction from nine traits to five eigenfunctions based on the genetic correlation matrix was done discarding the smallest four eigenvalues in Table II. EBV correlations between the rG-MACE of rank 5 and the RR-MACE were over 0.99 for the first and second random regression coefficients (Tab. IV), but slightly lower for the third RRC. On a 305-day lactation basis, EBV correlations were 1.000, 0.997 and 0.991 for the first, second and third lactation. When the three lactations were summarised to a combined lactation basis, the EBV correlations were over 0.999 independently from the birth year (Tab. III), the number of daughters and the average number of test day records of daughters of the bulls. The ranking correlations within the top 300 Holstein bulls were around 0.98 between the rG-MACE of rank 5 and the RR-MACE (Tab. V). Only official bulls were included in the rankings.
Further rank reduction to three eigenfunctions had lower performance. In this case, almost all EBV correlations with RR-MACE were under 0.99 both on a RRC basis (Tab. IV) and a combined lactation basis (Tab. III).

Rank reduction based on combined lactation weights
Rank reduction to three eigenfunctions performed better when it was based on combined lactation weights. In this case, only three eigenvalues and their associated eigenvectors explain most of the total genetic variance (Tab. II). After discarding the lowest eigenvalues, it was possible to perform CW-MACE with three eigenfunctions that kept EBV correlations with RR-MACE around 0.99 for the first random regression coefficients (Tab. IV), although the EBV correlations for the second and third coefficients were poor. The EBV correlations on a combined lactation basis were 0.995 and were quite independent from the birth year (Tab. III), the number of daughters and the average number of test day records of daughters of the bulls. Then, using this data transformation, it is possible to reduce the rank to three eigenfunctions without having the short lactation problem encountered with ML-MACE. In spite of the high EBV correlations, the correlations within the top 300 Holstein bulls was around 0.762 with the RR-MACE (Tab. V).

DISCUSSION
Principal component analysis can be applied to countries having a genetic (co)variance matrix close to singular in order to reduce the number of traits. Ignoring the lowest eigenvalues of the genetic correlation matrix, one can perform data transformation to reduce the number of traits to a lower number of eigenfunctions. Generally, the eigenvalues should be obtained from correlation rather than covariance matrices, especially if traits with greatly differing variation are included. In a covariance matrix, functions of traits with high variances have high eigenvalues and are selected first whereas functions of the other traits with less variances may be discarded. Data transformation using eigenvalues of the genetic correlation matrix allowed reducing the German dataset from nine traits to five eigenfunctions without losing accuracy in any of the random regression coefficients. Back transformation of breeding values allowed getting back the same traits that are presented to breeders so that these traits will be more closely inspected and easier to understand for the industry. The impact of rank reduction to five eigenfunctions in the top lists can be considered negligible.
In the near future, the previous rank reduction can be applied in the joint French-German bull and cow evaluation for milk production traits. However, if the number of traits is still a limiting factor, the German dataset can be reduced to three eigenfunctions. In this case, PCA could not be based on genetic correlations because there was a loss of accuracy in all RRC, but it can be applied based on combined lactation weights. This approach concentrated the loss of accuracy on the less important random regression coefficients but keeping the high EBV correlations on a combined lactation basis. This higher rank reduction led to higher differences in the top lists with respect to the RR-MACE and to a lower accuracy for second and third coefficients that would not allow selection for lactation persistency. However, it can be a compromise between feasibility and accuracy.
In the future, the multi-trait MACE model can be applicable for international bull comparison involving a higher number of countries. In such a model, the number of traits would be enormous (Germany 9 RRC, Canada, Netherlands and Italy 15 RRC and so on) and rank reduction would be necessary. Rank reductions could be performed within country and principal components could be used as exchange tools across countries. The problem here is that genetic parameters differ a lot across countries. For example, the genetic correlation of second with third lactation was estimated to be 0.97 in Germany but only 0.83 in Canada [15]. This could lead to much different eigenvectors and eigenvalues across countries. If this is a problem, the within country data can be reduced to three lactations. This ML-MACE model performed very well except for the youngest bulls, which can be traced back to the issue of short lactations. Instead of first, second and third lactations, Guo et al. [2] proposed to exchange average yield and a regression on maturity. This uses two traits instead of three to provide nearly the same information and allows lactations higher than three to be easily included, thereby increasing reliability especially for cows. All these options still should be analysed in order to find a compromise between feasibility and accuracy. The within country rank reductions can be combined with across country ones, e.g. Leclerc et al. [5,6], in order to make MT-MACE applicable for international genetic evaluation involving a higher number of countries.
Another way to reduce the rank of MT-MACE models is to exchange the same traits across countries as in national evaluations and perform the principal component analysis at the Interbull level. Here the estimation of genetic (co)variance matrix for so many traits and countries prior to rank reduction would be difficult [4].
Other applications of data transformation are in total merit index construction within country using a multiple trait animal model [1]. In Germany, production traits (milk, fat and protein) and somatic cell counts are analysed with RRTDM. In this case, data reduction within and across traits could be performed before the joint analysis of all traits.

CONCLUSIONS
Different sub-models of the multiple trait MACE model were implemented for the evaluation of German DYD for production traits. If computing requirements of the random regression MACE model are a limiting factor for international genetic evaluations involving different dairy populations, then the data transformation for rank reduction to five eigenfunctions based on eigenvalues of genetic correlation matrix would be a reasonable compromise between feasibility and accuracy while keeping the possibility to select for lactation persistency. Higher rank reductions lead to more changes on the rankings of top bulls and lower accuracies for coefficients related to lactation persistency, but appear satisfactory for international 305-day combined lactation EBV.