Principal component approach in variance component estimation for international sire evaluation
 AnnaMaria Tyrisevä^{1}Email author,
 Karin Meyer^{2},
 W Freddy Fikse^{3},
 Vincent Ducrocq^{4},
 Jette Jakobsen^{5},
 Martin H Lidauer^{1} and
 Esa A Mäntysaari^{1}
DOI: 10.1186/129796864321
© Tyrisevä et al; licensee BioMed Central Ltd. 2011
Received: 1 November 2010
Accepted: 24 May 2011
Published: 24 May 2011
Abstract
Background
The dairy cattle breeding industry is a highly globalized business, which needs internationally comparable and reliable breeding values of sires. The international Bull Evaluation Service, Interbull, was established in 1983 to respond to this need. Currently, Interbull performs multipletrait across country evaluations (MACE) for several traits and breeds in dairy cattle and provides international breeding values to its member countries. Estimating parameters for MACE is challenging since the structure of datasets and conventional use of multipletrait models easily result in overparameterized genetic covariance matrices. The number of parameters to be estimated can be reduced by taking into account only the leading principal components of the traits considered. For MACE, this is readily implemented in a random regression model.
Methods
This article compares two principal component approaches to estimate variance components for MACE using real datasets. The methods tested were a REML approach that directly estimates the genetic principal components (direct PC) and the socalled bottomup REML approach (bottomup PC), in which traits are sequentially added to the analysis and the statistically significant genetic principal components are retained. Furthermore, this article evaluates the utility of the bottomup PC approach to determine the appropriate rank of the (co)variance matrix.
Results
Our study demonstrates the usefulness of both approaches and shows that they can be applied to large multicountry models considering all concerned countries simultaneously. These strategies can thus replace the current practice of estimating the covariance components required through a series of analyses involving selected subsets of traits. Our results support the importance of using the appropriate rank in the genetic (co)variance matrix. Using too low a rank resulted in biased parameter estimates, whereas too high a rank did not result in bias, but increased standard errors of the estimates and notably the computing time.
Conclusions
In terms of estimation's accuracy, both principal component approaches performed equally well and permitted the use of more parsimonious models through random regression MACE. The advantage of the bottomup PC approach is that it does not need any previous knowledge on the rank. However, with a predetermined rank, the direct PC approach needs less computing time than the bottomup PC.
Background
Globalization of dairy cattle breeding requires accurate and comparable international breeding values for dairy bulls. The international Bull Evaluation Service, Interbull, has for years performed international genetic evaluations for dairy cattle for several traits, serving the cattle breeders worldwide. Due to different trait definitions and evaluation models in countries participating in the international genetic evaluation of dairy bulls, biological traits like protein yield are treated as different, but genetically correlated traits across countries [1]. Therefore, each bull will have a breeding value on the base and scale of each participating country. For protein yield in Holstein, this currently leads to 28 breeding values per bull and the number of partipating countries is expected to increase. Such a model is challenging for those responsible for the evaluations and estimation of the corresponding genetic parameters. The size of the (co)variance matrix is large: for 28 traits, the genetic covariance matrix of the classical, unstructured, multipletrait model comprises 406 distinct covariance components. Furthermore, the full rank model becomes overparameterized due to high genetic correlations. In addition, links between populations are determined by the amount of exchange of genetic material among the populations and can vary in strength. These special characteristics have led to a situation, where variance components e.g. for protein yield in Holstein are estimated in subsets of countries, and are then combined to buildup a complete (co)variance matrix [2, 3]. Also, country subsetting is not problemfree since it is often necessary to apply a "bending" procedure in order to obtain a positive definite (co)variance matrix when combining estimates from the analyses of subsets [4]. Even if the complete data could be analyzed simultaneously, variance component estimation would remain a challenge since the usual estimation methods are very slow or unstable, when the (co)variance matrices are illconditioned. Mäntysaari [5] has hypothesized that with the high genetic correlations among countries, estimation of parameters for the full size (co)variance matrix may underestimate the genetic correlations and yield unexpected partial correlations. As an extreme case, this can result in a situation where the bull's daughter performance in one country can effect negatively the bull's EBV in another country. This has been illustrated by van der Beek [6].
Different solutions have been proposed to deal with the problem of overparameterisation. Madsen et al. [7] have introduced a modification of the average information (AI) algorithm that could be applied to estimate heterogeneous residual variance, residual covariance structure and matrices of reduced rank. Rekaya et al. [8] have employed structural models to estimate genetic (co)variances. They modelled genetic, management and environmental similarities to explain the genetic (co)variance structure among countries and to obtain more accurate estimates of genetic correlations. The authors considered the method useful, especially when there was a lack of genetic ties between countries. However, they noted a 15 to 20% increase in computing time compared to the standard multivariate model. Leclerc et al. [9] have approached the structural models in a different way. They selected a subset of wellconnected base countries to build a multidimensional space. The coordinates defined by these countries were used to estimate a distance between base countries and other countries and thus the genetic correlations between them. This decreased the number of parameters to be estimated compared to the unstructured variance component matrix for the multipletrait across country evaluation (MACE) approach [10]. However, when they studied a field dataset, a relatively large number of dimensions was needed to model the genetic correlations appropriately and the estimation process often led to local maxima, decreasing the utility of the approach.
The principal component (PC) approach has also been investigated as a possible solution to deal with the problems of variance component estimation for the international genetic evaluation of dairy bulls. This approach is of special interest because it allows for a dimension reduction. Principal components are independent, linear functions of the original traits. PC are obtained through an eigenvalue decomposition of a covariance or correlation matrix, which yields its eigenvectors and corresponding eigenvalues. Eigenvalues describe the magnitude of the variance that the eigenvectors explain. For highly correlated traits, the first few principal components explain the major part of the variation in the data and those with the smallest contribution on the variance can be excluded without notably altering the accuracy of the estimates, e.g. [11]. Factor analysis (FA) is closely related to the PC approach, but it models part of the variance to be traitspecific. Thus, generally it does not lead to a reduction in rank (assuming all traitspecific variances are nonzero), but benefits from the more parsimonous structure of the (co)variance matrix. Leclerc et al. [12] have studied both PC and FA approaches, but instead of estimating parameters directly from the complete data, they used a subset of welllinked base countries, performed a dimension reduction for the subset and estimated a contribution of the other countries to these PC or factors.
The above studies were motivated by an attempt to reduce the number of parameters in the variance component estimation for MACE, but except for the study of Rekaya et al. [8], they were based on data subsetting. Kirkpatrick and Meyer [13] and Mäntysaari [5] have suggested two different PC approaches meant to use complete datasets. Kirkpatrick and Meyer [13] have introduced a direct PC approach that exploits only leading principal components to model the variation in a multivariate system to improve the precision of the estimation and to reduce the computational burden inherent in the analysis of large and complex datasets. However, the approach was not specifically designed for MACE and has not been tested for such datasets. The bottomup PC approach, introduced by Mäntysaari [5], is based on the random regression (RR) MACE model that enables rank reduction. It adds traits, i.e. countries, sequentially in the analysis and defines a correct rank in each step, until all countries are included and the final rank is determined. The bottomup PC approach was designed to estimate the genetic parameters of large, overparameterized datasets, for which the estimation of the complete, full rank dataset might not be possible. So far it has only been tested on a simulated dataset. This article studies the value of the direct and the bottomup PC approaches to estimate the variance components for MACE using real datasets and evaluates the validity of the bottomup PC approach to determine the appropriate rank of the (co)variance matrix.
Methods
Random regression MACE
where y_{ i } is a n_{ i } vector of national deregressed breeding values for bull i, b is a vector of t country effects, u_{ i } is a vector of t different international breeding values for bull i and ε_{ i } is a n_{ i } vector of residuals. X_{ i } and Z_{ i } are incidence matrices and the variance of the bull's breeding values is Var(u _{ i }) = G. Differences in residual variances, var(ε_{ i } ), were taken into account by carrying out a weighted analysis. Specifically, this involved fitting residual variances at unity and scaling the other terms in the model (1) with weights, w_{ ij } = EDC_{ ij }/g_{ jj }λ_{ j } , where g_{ jj } is the sire variance of the j'th country, with heritabilities provided by each participating country j and EDC_{ ij } is the bull's effective daughter contribution in country j[14]. Contrary to the official MACE evaluations, in this study animals with unknown parentage were not grouped into phantom parent groups.
where ν_{ i }is a vector of t regression coefficients for bull i with var(ν_{ i }) = D.
Estimation of the G matrix with appropriate rank
where the r × r D_{ r } contains the r largest eigenvalues and the t × r matrix V_{ r } the r corresponding eigenvectors [17]. Consequently, t × t matrix G_{ r } has now only r(2t  r + 1)/2 parameters.
Bottomup PC approach
The bottomup PC approach is comprised of a sequence of REML analyses that starts with a subset of traits. New traits/countries are added one by one into the analysis, and after each trait addition step the correct rank of the model is determined. The latter can be inferred based on the size of the smallest eigenvalues of G[5] or of the correlation matrix or by using likelihood based model selection tools such as Akaike's information criterion (AIC) [18], which takes into account both the magnitude of the likelihood and the number of parameters in the model, thus penalizing for overparameterized models. The latter was used in this study. For given starting values in each step, we decomposed G into S and D, estimated D conditional on S and combined S and D to update G. At the beginning of the analysis, starting values provided by Interbull were used and in the subsequent steps, estimates were obtained from the previous steps.
The rationale behind the bottomup algorithm is to select in each step the highest rank, which is still justified by the AIC criteria. Each time a new country/trait, k + 1, is added to the analysis, the variance of the previous traits is already completely described by the r eigenvectors. The genetic variance of the new trait and its covariance with the previous eigenvectors is estimated and if it is considered to provide new information on breeding values, the new breeding value equation and the new rank, r + 1, is kept.
 1.Initial step
 (a)
choose k countries as starting subset
 (b)
use starting values G _{0}, take EDC_{ ij } and λ_{ j } for bull i to model the residual variance by applying weights w_{ ij }
 (c)
estimate k × k matrix for the k starting countries under the full rank model, r = k
 (d)
calculate Akaike's information criterion value AIC_{ r } = 2 log L + 2p, where log L is the maximum log Likelihood and p = r(r + 1)/2 the number of parameters
 (a)
 2.Determination of the correct rank
 (a)
for a given rank decompose
 (b)
derive , where is obtained from by removing the smallest eigenvalue from and the corresponding eigenvector from
 (c)
update the weights using , EDC_{ ij } and ?_{ j }
 (d)
estimate a new with and as covariables by fitting model (5).
 (e)
calculate AIC _{r1}
 (f)select the best model ("rank reduction" step)

after the initial step: while AIC_{r1}< AIC_{ r }, set r = r1 and repeat step 2, otherwise take and and proceed to step 3

after the country addition step: if AIC_{r1}< AIC_{ r }, replace and with and , otherwise take and and proceed to step 3

 (a)
 3.Addition of a new country/trait
 (a)if k < t, k = k + 1 and r = r + 1

add a new row and column of zeros to and , and set the k^{ th }element of to 1 and the r^{ th }diagonal element of to twice the average genetic variance from countries j = 1, k. Two times the mean value was used as a starting value for estimation of the variance of a new country to improve the convergence of iteration.

 (b)
update the weights using , EDC_{ ij } and ?_{ j } (w_{ ij } = EDC_{ ij }/g_{ jj }?_{ j } )
 (c)
estimate a new and backtransform to using Equation (5)
 (d)
calculate AIC_{ r }
 (a)
 4.
repeat steps 2 and 3 until k = t
 5.
Final step: update the weigths and reestimate the parameters
Direct PC approach
Genetic principal components can be estimated directly from the data [13]. The genetic (co)variance matrix is decomposed into matrices of eigenvalues and eigenvectors and only the leading principal components with notable contribution to the total variance are selected to estimate the genetic parameters. The direct estimation method requires a priori knowledge of the number of principal components fitted in the model or it must be estimated.
Defining the correct rank of matrix
where t is the number of traits and r_{ ij,m } is the estimated genetic correlation between traits i and j from an analysis fitting m PC. The genetic correlations from the suboptimal models were contrasted with the estimates from the direct PC rank 20 model (r_{ ij } ,_{20}), which was the optimal rank selected by the bottomup approach.
When the rank of the model is appropriately defined, [19] AIC should be at its minimum and the magnitude of the leading principal components and the sum of the eigenvalues stabilized, indicating that there is no repartitioning of the genetic variance into the residual variance, which is the case if too few principal components are fitted [11]. Further, the improvement of the Log Likelihood beyond the optimal model is expected to be negligible.
Differences between the direct and bottomup PC approaches
The parameterization in the bottomup PC approach differs from the direct PC approach in the matrix that is used for the eigenvalue decomposition. In the bottomup PC approach, the eigenvalue decomposition was done on the correlation matrix, while in the direct PC approach the parameterization was on the (co)variance matrix [13]. For both PC approaches, the heterogeneity in residual variances were taken into account using weights, as outlined above. In the bottomup PC approach, they were updated after each REML run, implying that were fixed, whereas were estimated in the direct PC approach.
Test application
Structure of the datasets for protein yield and somatic cell count (SCC).
Protein yield  SCC  

Country  Code  Number of bulls  Common bulls^{a}  Number of bulls  Common bulls^{a}  
Total  Foreign bulls, % ^{c}  Min^{b}  Max^{b}  Mean  Total  Foreign bulls^{c} , %  Min^{b}  Max^{b}  Mean  
Canada  CAN  7028  33  2  1044  267  7730  34  4  1191  331 
Germany  DEU  16734  23  56  1194  370  18624  25  49  1526  469 
DnkFinSwe^{d}  DFS  8900  13  12  590  248  9459  13  19  731  314 
France  FRA  11127  20  3  568  220  12254  19  7  622  274 
Italy  ITA  6322  20  8  607  253  7254  23  11  777  338 
The Netherlands  NLD  9696  24  26  1194  346  10935  26  37  1526  481 
USA  USA  23380  6  6  1044  410  25281  6  10  1191  507 
Switzerland  CHE  715  37  4  209  118  946  45  9  325  182 
Great Britain  GBR  4361  51  7  873  316  4017  55  12  855  377 
New Zealand  NZL  4253  24  3  560  209  4886  22  6  725  255 
Australia  AUS  4950  26  5  681  216  5404  31  12  895  325 
Belgium  BEL  634  97  12  425  143  665  97  14  466  166 
Ireland  IRL  1260  79  0  354  153  1337  96  3  388  183 
Spain  ESP  1499  48  2  408  203  1720  45  3  455  246 
Czech Republic  CZE  2036  75  12  590  202  2453  75  17  768  279 
Slovenia  SVN  196  55  5  68  32  ^{e}         
Estonia  EST  472  46  2  93  30  556  49  6  117  40 
Israel  ISR  773  11  0  59  27  853  11  1  68  33 
Swiss Red Hol^{f}  CHR  1162  45  3  256  103  1359  42  10  327  147 
French Red Hol^{f}  FRR  145  72  0  73  9  168  71  1  84  15 
Hungary  HUN  1898  46  2  502  192  1638  63  5  573  246 
Poland  POL  5071  16  0  295  118  ^{e}         
South Africa  ZAF  920  48  1  372  148  882  54  3  402  180 
Japan  JPN  3177  67  1  226  97  3562  63  1  272  123 
Latvia  LVA  232  71  6  71  29  ^{e}         
Danish Red Hol^{f}  DNR  ^{e}          232  38  1  83  16 
Total number of bulls  116941  122215 
The total number of records was 116 941 for protein yield and 122 215 for SCC. These represented 103 676 and 100 551 bulls with deregressed breeding values, respectively. The number of bulls with records in protein yield varied from 145 to 23 380 among countries, with a mean of 4 678 bulls per country.
Corresponding values for SCC were 168 to 25 281, with a mean of 5 314 bulls per country. For both biological traits, bulls were used mainly in one country; only 5% of the bulls were used in two countries and 1% in three countries. Further, only 286 bulls (i.e. 0.3%) with records for protein yield and 321 bulls (i.e. 0.3%) with records for SCC were used in more than 10 countries. Breeding policies vary notably among countries in terms of how much countries rely on their own breeding schemes or whether they import most of their breeding animals. USA is an example of a country that has a long tradition of Holstein breeding: only 6% of the bulls were imported bulls for the 2007 protein yield data (Table 1). Conversely, Belgium is an example of a country that leans heavily on import: in the same data, 97% of the Holstein bulls used in Belgium were imported (Table 1). The number of common bulls between countries varied from zero to 1 194 for protein yield, with a mean of 178, and for SCC from one to 1 526, with a mean of 240. Substantial variation existed in the number of common bulls among countries. For both biological traits, French Red Holstein shared the smallest number of common bulls with the other countries and the USA, as a popular trading partner, shared the most.
Bottomup PC runs were performed for both traits. Direct PC runs with ranks 15, 17, 19, 20 and 25 were carried out for protein yield to evaluate the optimal rank using the methods proposed by Meyer and Kirkpatrick [19]. For SCC, however, only the rank suggested by the bottomup PC approach was used in the direct PC analyses.
The sensitivity of the bottomup PC approach to different orders of country addition was tested for a subset of nine countries: France, USA, Czech Republic, Latvia, Poland, NewZealand, Australia, Slovenia and Ireland. These nine countries that were well and loosely linked, represented different hemispheres, and different managing systems and thus constituted a representative sample of all countries involved in the Interbull evaluation. Two different orders were tested. Order1 was the order of introduction of the countries above and order2 was the reverse of order1. For both orders, the analysis started with four countries.
The order of country addition should not affect the estimates, if only nonsignificant eigenvalues are excluded. To test this, we modified the bottomup PC approach. Instead of selecting the best model based on the AIC (steps 2ef, 3d), we determined a rank based on the proportion of explained variance in the transformation step 2a. Therefore, steps 2bd became optional, depending on whether the rank was reduced or not. We tested three scenarios: the modified bottomup approach was required to include 97, 99, or 99.5% of the total variance in the transformation step. For comparison, a full fit direct PC analysis (rank 9) and a basic bottomup analysis were carried out for the subset of nine countries.
The WOMBAT software [21] was used for the direct PC analyses, as well as for the variance component estimation in the bottomup PC approach. The average information REML algorithm was applied for both approaches. Bull pedigrees were based on sire and maternal grand sire information. Genetic correlations estimated by Interbull in their test runs (protein yield: test run preceding August 2007 evaluation, SCC: test run preceding April 2009 evaluation) were used for comparison.
Results and Discussion
Bottomup approach  effect of the order of country addition on the results
The effect of the order of country addition on the estimates of the bottomup PC approach for protein yield.
Differences  

Countries^{a}  Genetic correlations, direct PC 9  Direct PC 9 vs. Bottomup PC rank 8  Bottomup PC order1^{b} vs. order2^{c}  
1  2  rank 8  rank 7  rank 6  
FRA  USA  0.87  0  0  0  0.04 
FRA  CZE  0.58  0  0  0  0.03 
FRA  LVA  0.24  0.02  0  0  0.24 
FRA  POL  0.65  0  0  0  0.02 
FRA  NZL  0.68  0  0  0  0.07 
FRA  AUS  0.76  0  0  0  0.01 
FRA  SVN  0.51  0.01  0.02  0.14  0.17 
FRA  IRL  0.78  0  0  0.01  0 
USA  CZE  0.59  0  0  0  0 
USA  LVA  0.31  0.01  0.01  0.02  0.40 
USA  POL  0.56  0  0  0  0.02 
USA  NZL  0.54  0  0  0  0.02 
USA  AUS  0.65  0  0  0  0.05 
USA  SVN  0.36  0.02  0.03  0.12  0.08 
USA  IRL  0.63  0  0  0.02  0.08 
CZE  LVA  0.09  0.04  0  0.03  0.02 
CZE  POL  0.55  0  0  0  0.05 
CZE  NZL  0.47  0  0.01  0.01  0 
CZE  AUS  0.53  0  0  0  0.06 
CZE  SVN  0.44  0  0.04  0  0.04 
CZE  IRL  0.51  0.01  0  0.02  0.04 
LVA  POL  0.62  0.01  0  0.01  0.28 
LVA  NZL  0.15  0.05  0.02  0.01  0.13 
LVA  AUS  0.51  0.03  0.01  0.01  0.08 
LVA  SVN  0.21  0.07  0.01  0.12  0.16 
LVA  IRL  0.33  0.02  0.02  0.02  0.08 
POL  NZL  0.49  0  0  0  0.06 
POL  AUS  0.70  0  0  0  0.07 
POL  SVN  0.57  0.01  0  0.04  0.06 
POL  IRL  0.68  0  0  0  0.04 
NZL  AUS  0.80  0  0  0  0.01 
NZL  SVN  0.34  0.01  0.03  0.14  0.33 
NZL  IRL  0.81  0.01  0  0.01  0.05 
AUS  SVN  0.42  0.01  0.01  0.14  0.07 
AUS  IRL  0.84  0  0  0.01  0.07 
SVN  IRL  0.74  0.03  0  0.12  0.13 
Mean  0.54  0.002  0.003  0.021  0.022  
Mean_abs^{d}  0.54  0.010  0.006  0.028  0.085  
Max  0.87  0.07  0.04  0.14  0.40 
Correct rank
Selection of the appropriate rank for protein yield under the direct PC approach.
Rank 15  Rank 17  Rank 19  Rank 20  Full fit  

 68  19  0  4  19 
log L^{b}  105  36  2  0  0 
 0.029  0.017  0.004  0  0.001 
No of parameters  271  290  305  311  325 
Sum of eigenvalues  1696  1695  1695  1695  1695 
E1^{d}  1326  1330  1331  1331  1331 
E2  78.9  76.7  76.1  76.1  76.0 
E3  69.8  65.0  60.3  60.1  60.1 
E4  43.6  44.5  47.4  47.2  47.1 
E5  36.6  35.2  33.2  33.0  33.1 
E6  30.9  30.4  28.8  28.6  28.6 
E7  22.3  21.3  21.4  21.3  21.3 
E8  19.7  17.8  17.2  17.3  17.2 
E9  15.0  15.4  16.2  15.9  16.0 
E10  12.9  12.3  12.3  12.3  12.3 
E11  10.6  10.5  10.6  10.6  10.6 
E12  9.8  9.9  8.8  8.5  8.5 
E13  9.2  8.6  8.4  8.3  8.3 
E14  6.3  6.5  6.5  6.7  6.7 
E15  4.3  5.2  5.2  5.2  5.2 
E16  3.9  4.2  4.1  4.1  
E17  2.7  3.2  3.3  3.3  
E18  2.8  2.8  2.8  
E19  1.1  1.3  1.3  
E20  1.1  1.2  
E21  0.0  
E22  0.0  
E23  0.0  
E24  0.0  
E25  0.0 
The bottomup PC run terminated with rank 20 for protein yield, indicating that the approach is able to find the correct rank. Under the bottomup PC, G is obtained by backtransforming it and only the matrix of eigenvalues is directly estimated, thus p = r(r + 1)/ 2, and only 65% of the parameters were sufficient to describe the complete (co)variance matrix for that method. Based on the bottomup results, the appropriate rank was 15 for SCC. Thus, only 44% of the parameters under the bottomup PC were needed to describe the 23 × 23 (co)variance matrix for SCC, whereas the corresponding number for the direct PC rank 15 analysis was 87%.
Our results on the importance of fitting an optimal rank in the principal component analysis are supported by earlier studies by Meyer [22, 11] and Meyer and Kirkpatrick [19]. While studying reduced rank multivariate animal models for beef cattle, Meyer noticed that fitting too few principal components resulted in inaccurate estimates of the genetic parameters [22, 11]. A more recent study of Meyer and Kirkpatrick [19] has listed three sources of bias of reduced rank estimates: spread of sample roots, constraining estimates to the parameter space and picking up the wrong subset of the genetic PC, if too few PC are fitted.
Comparison of genetic correlations
The average estimates of genetic correlations from the direct PC rank 20, direct PC full fit, bottomup PC rank 20 and Interbull analyses for protein yield were very similar, ranging from 0.68 to 0.70 (Figure 1). Based on the first and third quantiles and the median, the distribution of the Interbull estimates was on a somewhat higher level compared to those of the PC approaches. Nevertheless, the Interbull estimates included the lowest value for protein yield, being as low as 0.02 between NewZealand and Latvia. The means of the SCC estimates were much higher, from 0.87 to 0.89 (Figure 2), compared to those for protein yield. In addition, the lowest values were rather high, ranging from 0.61 (Interbull) to 0.65 (bottomup PC). The distributions of the estimates of genetic correlations from the different approaches were very similar for SCC, although those for the Interbull were on a slightly higher level. The plots of genetic correlations also showed that overparameterization of the model for protein yield had virtually no effect on the estimates (Figure 1) since both rank 20 and 25 models resulted in almost identical genetic correlations.
Dissection of the estimates of France, NewZealand SouthAfrica, Slovenia and Latvia: magnitude of the genetic correlations and their standard errors for protein yield.
FRA^{a}  NZL  ZAF  SVN  LVA  

Genetic correlations  
Min  0.40  0.22  0.17  0.23  0.08 
Median  0.80  0.56  0.49  0.51  0.43 
Mean  0.76  0.57  0.49  0.50  0.40 
Max  0.90  0.81  0.69  0.69  0.62 
SEs of genetic correlations  
Min  0.01  0.02  0.04  0.07  0.07 
Median  0.02  0.03  0.05  0.08  0.08 
Mean  0.03  0.05  0.06  0.09  0.09 
Max  0.09  0.15  0.23  0.14  0.18 
On the one hand, the lack of connections between countries hinders the estimation of genetic parameters and this can explain the low genetic correlations for e.g. Slovenia and Latvia (Tables 1 and 4). On the other hand, countries like SouthAfrica and NewZealand were also associated with a lower level of genetic correlations for protein yield (Table 4). However, on average, they have strong links with the other countries (Table 1). Furthermore, the standard errors of the estimates of NewZealand and SouthAfrica with the other countries were relatively small, unlike those of Slovenia and Latvia with the other countries. Based on the study of Jakobsen et al. [1], different trait definitions and national genetic evaluation models, as well as genotype by environment interactions explain the low to moderate genetic correlations in international genetic evaluations of dairy bulls.
One of the main challenges in national genetic evaluation schemes incorporating foreign bulls is to adequately model systematic genetic differences by defining genetic groups. This holds especially for countries with small populations and where information on their daughters is scarce. This might result in proofs for foreign sires which are biased. As these national proofs are the data used in variance component estimation, inadequate genetic grouping at the national level may be one of the factors contributing to low estimates of genetic correlations for protein yield in different countries. Because selection of bulls is predominantly targeting production traits, the impact of illdefined genetic groups on proofs for other, nonproduction traits is expected to be smaller. In addition, imported bulls may be more representative of the population in the country of origin as, for instance, for SCC.
Altogether, using a more parsimonious covariance structure did not resolve the problem of some small genetic correlations for protein yield. Currently, Interbull postprocesses genetic correlations to correspond to the level of some justified expectation. This is done by utilizing information on the new estimates, estimates from the previous run, Interbull's own expectations and for nonHolstein breeds, correlations from Holstein [3]. Until the ultimate reason for the low estimates of genetic correlations has been identified, one alternative to the current postprocessing would be to apply prior expectations under the Bayesian MACE suggested by, e.g., Mark et al. [23]. With insufficient data, the prior expectations would not be overridden and the level of the final estimates would be closer to their biologically justified expectations compared to the current nonpostprocessed estimates or those obtained under the PC approaches. By applying the approach suggested by Mark et al. [23], we might reduce the degree of parsimony which can be attained using the PC approaches, but this may be offset by the prior information utilized and thus reduce mean square errors.
Performance of the PC approaches
The run time of the direct PC analysis for protein yield reached a maximum for the rank 15 model (22 days), decreased with increasing rank, being shortest for the rank 20 model (5 days) and was 17 days for the full fit model. The memory needed for the direct PC rank 20 model for protein yield was 4.1 GB, whereas it was highest i.e. 6.3 GB for the full fit model. Thus, the costs of the overparameterization were a longer run time and a higher RAM memory requirement without any increase in the accuracy of the estimation. Furthermore, the magnitude of standard errors of the estimates increased with the number of parameters to be estimated. This is a consequence of the increased sampling variance when estimating more parameters [see [22, 11]]. Interestingly, fitting too few parameters in the model prolonged the run time. This occurred also for SCC (results not shown) and for the factor analytic models (manuscript in preparation). If the rank of the model is reduced too much, the number of available parameters is not sufficient to describe the (co)variance structure of the model, which in turn, detrimentally affects the convergence rate of the REML analysis.
Run time (d:hr:min) and number of iterates required for analyses of protein yield.
Country addition step  Rank reduction step  Total time  

Countries  Iterates  Time  Rank  Iterates  Time  
Bottomup PC  7  5  0:00:46  7  4  0:00:26  0:01:12 
8  9  0:01:48  8  4  0:00:41  0:02:29  
9  8  0:02:21  9  5  0:01:12  0:03:33  
10  8  0:03:24  10  6  0:02:02  0:05:26  
11  11  0:06:05  11  5  0:02:25  0:08:30  
12  14  0:10:24  11  6  0:03:49  0:14:13  
13  13  0:10:53  12  6  0:03:50  0:14:43  
14  13  0:14:09  13  6  0:05:00  0:19:09  
15  12  0:16:34  14  5  0:05:28  0:22:02  
16  77  6:03:56  15  8  0:11:04  6:15:00  
17  12  1:06:04  16  6  0:10:40  1:16:44  
18  17  2:10:31  16  13  1:03:47  3:14:18  
19  12  1:13:49  17  6  0:13:00  2:02:49  
20  21  3:14:19  17  12  1:07:15  4:21:34  
21  14  1:22:37  18  5  0:13:08  2:11:45  
22  28  5:11:05  19  7  0:22:04  6:09:09  
23  15  3:14:23  19  11  1:15:25  5:04:48  
24  15  3:17:11  20  6  1:24:00  4:17:35  
25  14  4:05:09  20  12  2:04:03  6:09:12  
46:23:11  
Direct PC  25  20  24  5:13:27 
Run time (d:hr:min) and number of iterates required for analyses of SCC.
Country addition step  Rank reduction step  Total time  

Countries  Iterates  Time  Rank  Iterates  Time  
Bottomup PC  7  25  0:02:45  5  3+2+4^{a}  0:00:28  0:03:13 
8  14  0:00:56  6  3  0:00:09  0:01:05  
9  13  0:01:29  7  5  0:00:22  0:01:51  
10  6  0:01:19  8  5  0:00:37  0:01:56  
11  6  0:01:58  9  5  0:00:56  0:02:54  
12  3  0:01:50  9  21  0:05:05  0:06:55  
13  11  0:03:59  10  5  0:01:21  0:05:20  
14  26  0:12:18  11  8  0:02:49  0:15:07  
15  22  0:13:43  11  6  0:03:11  0:16:54  
16  8  0:05:26  11  4  0:02:21  0:07:47  
17  9  0:06:01  11  4  0:02:17  0:08:18  
18  10  0:06:31  12  8  0:03:58  0:10:29  
19  12  0:10:09  12  6  0:04:04  0:14:13  
20  11  0:11:20  13  5  0:03:51  0:15:11  
21  13  0:14:34  14  7  0:06:25  0:20:59  
22  15  1:01:54  14  6  0:07:39  1:09:33  
23  9  0:13:13  15  7  0:07:59  0:21:12  
7:18:57  
Direct PC  23  15  86  7:00:02 
Overall, both approaches tested in this study performed very well and estimates of genetic correlations were similar to the Interbull estimates. Both PC approaches were applied to complete datasets unlike those suggested in the earlier studies [7, 8, 12, 9] and the current Interbull procedure [2, 3]. One advantage of the direct over the bottomup PC approach are the potentially much reduced computational requirements. This applies in particular when an analysis is started with a small subset of countries and countries with a problematic data structure are added in early on. Such problems were encountered for protein yield, resulting in the computational requirements shown in table 5. The current test version of the bottomup PC approach has not been streamlined yet by any means. The analysis of the performance of the bottomup approach (Tables 5 and 6) as well as preliminary tests give evidence that the computation time can be reduced by starting from a higher number of countries. Furthermore, when a new country is added, zero starting covariances between new and old countries could be replaced with covariances calculated from the mean correlation of countries already in the dataset and from the variances of those countries. The advantage of the bottomup approach is that we estimate r elements of r × r D _{ r }matrix in the parameter estimation step and therefore, there is no danger of picking up the wrong subset of principal components.
Conclusions
This study shows that both the direct and the bottomup principal component approaches and the use of models with optimal rank are useful in the variance component estimation for MACE. Furthermore, both approaches can be applied to large datasets and data subsetting is not needed. Based on the results, we emphasize the importance of the selection of the appropriate rank of the (co)variance matrix to obtain good estimates. The bottomup PC approach is capable of determining the appropriate rank for highly overparameterized models and thus leads to a more parsimonous variance structure. However, with a predetermined rank, the direct PC approach needs less computing time than the bottomup PC. The third approach that is considered for variance component estimation for MACE is the direct factor analytic approach that will be presented in an upcoming paper.
Declarations
Acknowledgements
The study was part of the cooperation project of Interbull Centre and MTT Agrifood Research Finland.
Authors’ Affiliations
References
 Jakobsen JH, Dürr JW, Jorjani H, Forabasco A, Loberg A, Philipsson J: Genotype by environment interactions in international genetic evaluations of dairy bulls. Proc 18th Assoc Advmt Anim Breed Genet, AAAGB, Roseworthy, Australia. 2009, 133142.
 Jorjani H, Emanuelson U, Fikse WF: Data subsetting strategies for estimation of acrosscountry genetic correlations. J Dairy Sci. 2005, 88: 12141224. 10.3168/jds.S00220302(05)727880.View ArticlePubMed
 Genetic correlation estimation procedure. [http://www.interbull.org/images/stories/Genetic_correlation_estimation_procedure_2009t2_110110.pdf]
 Jorjani H: Simple method for weighted bending of genetic (co)variance matrices. J Dairy Sci. 2003, 86: 677679. 10.3168/jds.S00220302(03)736467.View ArticlePubMed
 Mäntysaari EA: Multipletrait acrosscountry evaluations using singular (co)variance matrix and random regression model. Interbull Bull. 2004, 32: 7074.
 Beek van der S: Exploring the (Inverse of the) International Genetic Correlation matrix. Interbull Bull. 1999, 22: 1420.
 Madsen P, Jensen J, Mark T: Reduced rank estimation of (co)variance components for international evaluation using AIREML. Interbull Bull. 2000, 25: 4650.
 Rekaya R, Weigel KA, Gianola D: Application of a structural model for genetic covariances in international dairy sire evaluations. J Dairy Sci. 2001, 84: 15251530. 10.3168/jds.S00220302(01)701865.View ArticlePubMed
 Leclerc H, Minéry S, Delaunay I, Druet T, Fikse WF, Ducrocq V: Estimation of genetic correlations among countries in international dairy sire evaluations with structural models. J Dairy Sci. 2006, 89: 17921803. 10.3168/jds.S00220302(06)722482.View ArticlePubMed
 Schaeffer LR: Multiplecountry comparison of dairy sires. J Dairy Sci. 1994, 77: 26712678. 10.3168/jds.S00220302(94)77209X.View ArticlePubMed
 Meyer K: Multivariate analyses of carcass traits for Angus cattle fitting reduced rank and factor analytic models. J Anim Breed Genet. 2007, 124: 5064. 10.1111/j.14390388.2007.00637.x.View ArticlePubMed
 Leclerc H, Fikse WF, Ducrocq V: Principal components and factorial approaches for estimating genetic correlations in international sire evaluation. J Dairy Sci. 2005, 88: 33063315. 10.3168/jds.S00220302(05)730149.View ArticlePubMed
 Kirkpatrick M, Meyer K: Direct estimation of genetic principal components: Simplified analysis of complex phenotypes. Genetics. 2004, 168: 22952306. 10.1534/genetics.104.029181.PubMed CentralView ArticlePubMed
 Fikse WF, Banos G: Weighting factors of sire daughter information in international genetic evaluations. J Dairy Sci. 2001, 84: 17591767. 10.3168/jds.S00220302(01)746115.View ArticlePubMed
 Tarres J, Liu Z, Ducrocq V, Reinhardt F, Reents R: Data transformation for rank reduction in multitrait MACE model for international bull comparison. Genet Sel Evol. 2008, 40: 295308.PubMed CentralPubMed
 van der Werf JHJ, Goddard ME, Meyer K: The use of covariance functions and random regressions for genetic evaluation of milk production based on test day records. J Dairy Sci. 1998, 81: 33003308. 10.3168/jds.S00220302(98)758953.View ArticlePubMed
 Tyrisevä AM, Lidauer MH, Ducrocq V, Back P, Fikse WF, Mäntysaari EA: Principal component approach in describing the across country genetic correlations. Interbull Bull. 2008, 38: 142145.
 Akaike H: Information theory and an extension of the maximum likelihood principle. Second International Symposium in Information Theory. Edited by: Petrov BN, Csaki F. 1973, Akad. Kiado, Budapest, Hungary, 267281.
 Meyer K, Kirkpatrick M: Perils of parsimony: Properties of reducedrank estimates of genetic covariance matrices. Genetics. 2008, 180: 11531166. 10.1534/genetics.108.090159.PubMed CentralView ArticlePubMed
 Jairath L, Dekkers JCM, Schaeffer LR, Liu Z, Burnside EB, Kolstad B: Genetic evaluation for herd life in Canada. J Dairy Sci. 1998, 81: 550562. 10.3168/jds.S00220302(98)756073.View ArticlePubMed
 Meyer K: WOMBAT  A tool for mixed model analyses in quantitative genetics by REML. J Zheijang Univ Sci B. 2007, 8: 815821. 10.1631/jzus.2007.B0815.View Article
 Meyer K: Genetic principal components for live ultrasound scan traits of Angus cattle. Anim Sci. 2005, 81: 337345.View Article
 Mark T, Madsen P, Jensen J, Fikse WF: Prior (co)variances can improve multipletrait acrosscountry evaluations of weakly linked bull populations. J Dairy Sci. 2005, 88: 32903302. 10.3168/jds.S00220302(05)730125.View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.