Measuring connectedness among herds in mixed linear models: From theory to practice in large-sized genetic evaluations

A procedure to measure connectedness among groups in large-sized genetic evaluations is presented. It consists of two steps: (a) computing coefficients of determination (CD) of comparisons among groups of animals; and (b) building sets of connected groups. The CD of comparisons were estimated using a sampling-based method that estimates empirical variances of true and predicted breeding values from a simulated n-sample. A clustering method that may handle a large number of comparisons and build compact clusters of connected groups was developed. An aggregation criterion (Caco) that reflects the level of connectedness of each herd was computed. This procedure was validated using a small beef data set. It was applied to the French genetic evaluation of the beef breed with most records and to the genetic evaluation of goats. Caco was more related to the type of service of sires used in the herds than to herd size. It was very sensitive to the percentage of missing sires. Disconnected herds were reliably identified by low values of Caco. In France, this procedure is the reference method for evaluating connectedness among the herds involved in on-farm genetic evaluation of beef cattle (IBOVAL) since 2002 and for genetic evaluation of goats from 2007 onwards.


INTRODUCTION
The problem of disconnectedness in genetic evaluation is becoming increasingly important in animal breeding. In this context, the best linear unbiased prediction (BLUP) of breeding values allows meaningful comparisons between animals, but only when genetic links exist between the different environments (e.g. [7]). Disconnectedness was originally defined for fixed effects models in terms of non-estimability [2]. Such a definition implies that disconnectedness never occurs for random effects, since their contrasts are always estimable. However, the data design is the same whether the effect is fixed or random. Laloë [14] has defined disconnectedness for random effects in terms of "non-predictability" of contrasts: a contrast is not predictable if its coefficient of determination (CD) is null. Laloë [14] showed the close relationship between the concepts of inestimability and non-predictability. Laloë and Phocas [15] showed that both decrease in accuracy and potential bias in a genetic evaluation are due to the same phenomenon of regression towards the mean. These authors proved that these two effects of disconnectedness were assessed by CD of comparisons of the BLUP of breeding values of animals raised in different environments. Although using a different terminology (the "standardised prediction variance" is equal to 1 -CD), Huisman et al. [10] used the square root of the CD of comparisons as a criterion of connectedness.
Several other methods have been proposed to evaluate connectedness. Foulley et al. [7] proposed a connectedness index (IC) equal to the relative decrease in prediction error variance (PEV) when fixed effects are known. Kennedy and Trus [12] measured the connectedness between two herds as the average PEV of differences in expected breeding value between all pairs of animals in the two herds (av_PEV). Lewis et al. [17] proposed the correlation (r ij ) of breeding value prediction errors as a pairwise connectedness statistic and suggested averaging this statistic for all pairs of animals in different management units to evaluate connectedness between units. Mathur et al. [18] introduced a similar correlation statistic, the connectedness rating (CR ij ), to measure connectedness, based on the error (co)variances of fixed management estimates. Other connectedness measures such as functions of counts of direct links between test station groups have also been suggested [21]. Laloë et al. [16] compared IC and av_PEV to CD. The CD was found to combine data structure and amount of information. It also provides a balance between the decrease of PEV and the loss of genetic variability owing to the genetic relationships between animals. These authors concluded that CD was the best method for judging the precision of a genetic evaluation or optimising corresponding designs, especially when genetic relationships among animals are to be accounted for through a relationship matrix.
Kuehn et al. [13] examined the importance of connectedness and weighed the merits of CD, r ij and CR ij . They compared different connectedness scenarios and found that only CD showed a consistent relationship with bias reduction across all scenarios tested. However, as stated by [13], "the CD is difficult 147 to calculate for routine genetic evaluation due to storage and processing time involved in calculating both the inverse of the coefficient matrix and the (noninverted) relationship matrix". They advocated measuring connectedness by other criterions, highly correlated to CD, but easier to compute.
Another way to circumvent this drawback is to turn to methods of approximated estimation of variance-covariance matrices. Garcia-Cortes et al. [9] and Fouilloux and Laloë [4] have proposed sampling methods that theoretically allow the estimation of entire variance-covariance matrices, and, as a result, the estimation of CD of contrasts among genetic levels of units. Two units should be considered connected as soon as their genetic levels are predicted with sufficient accuracy. The choice of this level is rather arbitrary just like the choice of a level of accuracy for individual genetic values (EBV) to be published. However, links between CD of contrasts among units and both accuracy of contrasts among animals of different units and bias reduction as established by [16] should help to choose such a level. Once the minimal level of CD is chosen, say χ, groups of connected units have to be built, generally through clustering methods [7].
To be applicable in large-sized genetic evaluations, a method for building groups of connected units should meet two requirements: (i) to explicitly build clusters of units in such a way that the CD of the contrast between the genetic levels of two randomly picked units will be higher or equal to χ; (ii) to handle a large number of units, which may reach several thousands. This paper presents a new clustering method for the estimation of connectedness in across herd genetic evaluation, named "Caco". The method is first validated in a small genetic evaluation of the French Bazadais beef cattle breed. Subsequently, the application is demonstrated in two large genetic evaluations. These are the evaluation of 210-day weight in the French Charolais beef cattle breed and the evaluation of protein yield in a multi-breed population of dairy goats.

Theory
Consider the following mixed model: where y is the performance vector, b the fixed effect vector, u the random effect vector, e the residual vector, and X and Z the incidence matrices that associate elements of b and u with those of y. The variance structure for this model is the following: where A is the numerator relationship matrix, and the scalars σ 2 a and σ 2 e are the random effect and the residual variances, respectively. The BLUE (Best Linear Unbiased Estimation) of b, denoted b˚, and the BLUP (Best Linear Unbiased Prediction) of u, denotedû, are the solutions of: The precision of a comparison between the genetic values of animals or groups of animals is assessed by the CD of the corresponding contrast [14]. Typically, a given contrast can be written as a linear combination of the breeding values (c u). Hence, for any linear combination c u, we have: For instance, the CD of the estimated breeding value of a single animal is obtained by using a vector c null except a 1 in the appropriate position corresponding to this breeding value.

Estimation of CD of contrasts
The method presented by Fouilloux and Laloë [4] to estimate CD of estimated breeding values has been applied to a sire model to approximate the CD of contrasts between herds. The procedure is as follows: 1-The animals involved in the simulation are sorted from the oldest to the youngest.
2-The direct genetic value u i of the animal i is calculated according to the status of its sire (j). If j is unknown, u i is generated from N 0, σ 2 a . If j is known, u i is calculated by u i = 0.5 u j + ϕ i where ϕ i is drawn from N 0, 3σ 2 a 4 . 3-Performance of each performance-tested animal (l) is simulated using the generated breeding value of its sire (j). Fixed effects are set to 0. Consequently, y l = s i + ε l , with s j = 0.5 × u j and the residual ε l is drawn from N 0, σ 2 ε where σ 2 ε = 3σ 2 a 4 + σ 2 e .
4-The vectorŝ is obtained by solving the mixed model equations (3) using y. This process repeated n times leads to vectors of genetic values, u (k) k = 1, n and û (k) k = 1, n whereû = 2 ×ŝ. 5-The CD of contrasts of interest are estimated by computing their empirical variances and covariances (quoted with *) and substituting them in (4): The NAG [19] subroutines were used for drawing random numbers. The approximated distributions of vectors were obtained with 1000 replicates. BLUP were estimated using a successive overrelaxation iterative method, ceasing iteration when the following convergence criterion was less than 10 −3

Selecting the set of connected herds: the Caco method
The main practical goal of connectedness studies is to identify sets of connected herds. Two herds are considered connected if the CD of the contrast between their genetic levels is greater than an a priori threshold, say χ. A set of connected herds should then be built in such a way that any pairwise CD of contrasts between herds of the set is greater than χ. This might be achieved through the use of a clustering method, namely the complete linkage method. Complete linkage is a hierarchical agglomerative clustering method that finds small, compact clusters that do not exceed some diameter threshold. However, this method cannot handle very large problems. Here, we are proposing an alternative agglomerative clustering procedure, which is explicitly designed for building compact clusters and is suitable for large-sized data sets.
At the start of the process, each herd begins in a cluster by itself, and each step involves aggregating herds one by one into appropriate clusters: Step Step 2. A similarity index is calculated for each herd outside the cluster {h 1 ,h 2 }. The similarity index of a given herd is equal to its lowest CD with the herd currently in the cluster. The herd with the highest similarity index is added to the cluster. The Caco ("Criterion of Admission to the group of COnnected herds") of this new clustered herd is equal to its similarity index at this step. Supposing, for the sake of simplicity, that this herd is h 3 , then, the new partition is the following: [{h 1 ,h 2 ,h 3 },...,{h n }].
The process stops either when all herds are clustered, or when the CD of comparison between the clustered herds and each of the remaining herds are all lower than the fixed a priori threshold χ. In that latter case, the algorithm is applied to the remaining herds for the building of other possible clusters. Eventually, two herds within the same cluster are ensured to be compared with a CD χ. The choice of χ can be considered in relation to CD, which can be taken as a criterion of accuracy. Laloë and Phocas [15] showed that, for balanced sire designs where connections are established using common sires across units, the CD is a perfect indicator of potential bias arising when comparing individuals in separate units. They proved that, in such a design, the CD of the contrast between genetic levels of two herds is equal to: where n is the number of progeny per sire, η is the proportion of progeny from common sires, and λ the variance ratio defined in equation (3). Therefore, CD depends on three factors, (i) the amount of information through n, the number of progeny per sire; (ii) the quality of the design through η; and (iii) the heritability via the variance ratio λ. It is worth noting that the need for links decreases with the number of progeny per sire, but not with the number of sires per herd. These theoretical results were confirmed by Kuehn et al. [13]. Formula (5) may be used as a rule of thumb when choosing the value of χ.

Validation of the procedure
Validation of this procedure was done with the data used on the official French on-farm beef cattle evaluation, IBOVAL, for the Bazadais breed and the year 2006 [11]. The trait analysed was 210-day weight. The model used was a sire model and included the same fixed effect factors as in the actual IBOVAL evaluation [11].
The data set consisted of 4957 weights and 371 sires. Unknown sires were replaced by one dummy sire in each management unit [6]. Management unit (400 levels), sex (2 levels), parity-age of dam (12 levels) and season (9 levels) were included in the model as fixed effects. The heritability (h 2 ) was equal to 0.23. The connectedness was studied among the 45 herds that had calf performances recorded during the last five years. The empirical variances and covariances needed to estimate the CD of comparison between herds were calculated using their later calves only, because these are the most relevant to current selection.
The threshold χ was chosen to be equal to 0.4. In the IBOVAL context, considering a number of progeny per sire equal to 25 (the minimum progeny number required by IBOVAL to publish bull indexes) which corresponds to a CD of 0.6), formula (5) leads to a rate of Artificial Insemination (AI) use of 44%.
The estimated values of CD of comparison among herds were computed by performing the re-sampling method described in Section 2.2. The limited size of the data set also allowed the computations of the true CD of comparison by the direct inversion of the coefficient matrix.

Application of the procedure
The procedure to assess connectedness was applied to the genetic evaluation of 210-day weight (IBOVAL [11]) in the Charolais beef cattle breed and to the French goat multi-breed genetic evaluation for the protein yield [3]. The Charolais data included approximately 2 600 000 weaning weights from 80 000 bulls. The model used was a sire model (h 2 = 0.26) and included the same fixed effects as in the real IBOVAL evaluation: management unit (75 000 herd-year), parity-age of dam (25 levels), sex (2 levels), season (10 levels) and supplementation level (2 levels, supplementary fed or not). The data included 3576 herds with calf performances recorded during the last five years.
The method was also applied to the genetic evaluation of protein yield (h 2 = 0.30) in French dairy goats. The data included 1 720 000 first lactation records from 89 500 sires, with the following fixed effects: herd-year (56 700 levels), age of the female at the beginning of the lactation (8 levels) and kidding month (6 levels). The connectedness was studied among In both analyses, unknown sires were replaced by one dummy sire in each management unit [6]. All the computations used a RISC 595 supercomputer with a CPU of 133 MHz. Plots, dendrograms and smoothing surfaces were drawn with the R software [20] and the contributed packages Rcmdr [8] and mgcv [22].

Validation of the procedure
The true CD of the 990 contrasts between the Bazadais herds range between 0.001 and 0.703 (Tab. I), with a mean of 0.262 and a standard deviation of 0.145. The approximated CD values range between 0.011 and 0.716, with a mean of 0.294 and a standard deviation of 0.151. The approximated CD values are slightly overestimated compared to the true CD, but the correlation between the estimated and true values is very close to 1 (r = 0.966), as illustrated by Figure 1.
The clustering procedure was applied to the two sets of CD. The maximum and minimum for true Caco and CD are the same, as well as the maximum and minimum for estimated Caco and CD (Tab. I). The mean and standard deviation of the true Caco are 0.297 and 0.209, respectively, while the corresponding values for the estimated Caco are 0.320 and 0.207. As for the CD, the approximated Caco is slightly overestimated. Estimated and true Caco are highly correlated (r = 0.976), as illustrated by Figure 2. Figure 3 highlights the clustering processes applied to the true CD (left hand tree) and the estimated CD (right hand tree). In consultation with French beef cattle breed societies, the threshold level χ to consider a herd as connected was taken as 0.40. This was consistent with a previous measure of connection based on the number of calves born in a herd-year from AI sires. This threshold also connected small herds with a high proportion of AI and herds   with a low use of AI but which exchanged numerous natural service sires. Processes using true and estimated CD were quite comparable and led to similar clusters. In each process, only one cluster was found. Fourteen herds made up the cluster when considering true CD. These 14 herds were again included in the cluster built using estimated CD (right hand tree), but, owing to the slight overestimation shown in Figure 1, a 15 th herd was added to the cluster.

Charolais beef cattle evaluation
The 1000 replicates carried out to estimate the CD of contrasts required less than three hours of CPU time. The cluster analysis was performed and the Caco criterion was calculated in about 13 min. Among the 3576 herds, 2791 had a Caco greater or equal to 0.40 and, therefore, were considered connected to each other. For each herd, the average, minimal and maximal Caco and CD of comparisons with other herds were computed (Tab. II). The major difference between all herds and the connected ones concerned the minimal CD. Its average value increased from 0.05 (whole set) to 0.46 (connected set). As expected, the minimum value of this statistic was null in the whole set and always greater or equal to χ the connected set.
Relationships among the Caco and other parameters describing the herds were investigated (Tab. III). The Caco of a herd was more related to the average (r = 0.94) and maximal (r = 0.77) than to the minimal (r = 0.18) of the CD of contrasts pertaining to the herd. It depended only slightly on the herd size (r = 0.16), while it increased with the number of sires used (r = 0.57). The percentage of unknown sires in a herd tended to decrease Caco (-0.27). Moreover, for 13 herds in which the percentage of unknown sires was 100%, the Caco was equal to zero. Furthermore, Caco increased with sire accuracy (r = 0.75), and for AI link sires used across herds (r = 0.76).  Caco of herds involved in IBOVAL are re-estimated every year including records on calves born in the last five years. The stability of Caco was checked by comparing the results from successive years. Among the 3801 herds involved in the Charolais genetic evaluations of weaning weight in 2005 and 2006, the correlation between the Caco of both years was equal to 0.94. The main variations are easily explained by high modification of the AI rate or of the percentage of unknown sires in the herds concerned.

Goat breed application
The average Caco value was equal to 0.34 with a standard deviation of 0.26. As in the beef breed genetic evaluation, the Caco was strongly related to the percentage of goats with AI sire and to the percentage of unknown sires that reached on average 42% among French goats [3]. All herds with 100% of unknown sires had null Caco. Most of the herds that used AI sires had a high Caco, as highlighted by the correlation between the Caco and the percentage of goats with AI sire (r = 0.72). However, some herds in which the AI rate was low also had a high Caco. That is explained by the use, in those herds, of natural mating sires sired by AI. The correlation between Caco and the percentage of goats with an AI paternal grandsire was indeed equal to 0.76. A two-dimensional smoothing plot (Fig. 4), based on a generalised additive model [22] highlighted these relationships among Caco, the percentage of goats with AI sire (%AIs) and AI paternal grandsire (%Algs).
These results emphasise that the pedigree information was taken into account for measuring the connectedness level among herds.
As in beef breed genetic evaluation, the stability of the Caco estimation was high. Herds (1979)

CONCLUSION
A new two-step process to analyse connectedness among herds is described: the first step involves computing the CD of comparisons between groups of animals using a sampling method, while in the second step, clusters of well connected groups are formed. In applying this method, a decision needs to be made on the threshold χ for the CD to be achieved before a herd is considered to be connected. Such a decision still is and will always be a subjective matter. However, a more informed choice is possible using CD as a criterion of accuracy and potential bias, and by considering the relationships between CD, the amount of information, and quality of design.
Thanks to the simplicity of this method, different models of analysis may be easily adapted. The choice of the best model depends on the size of the analyses and the knowledge of the pedigree. Hence, application of single-or multi-trait analyses using an animal model with or without maternal effects will be possible for small-sized evaluations, while sire or sire-maternal grandsire models can be used for large-sized evaluations, depending on the amount of the unknown sires or grandsires in the pedigree files. For instance, in the French beef breed genetic evaluation, a sire-maternal grandsire model with maternal effects will soon be tested to measure how worthwhile it is to add pedigree information and especially to consider the connection due to the dams. Furthermore, since first lactations represent only 29% of the whole dataset used in the French goat genetic evaluation, the method should be extended to a repeated trait model.
The clustering method was appropriate to condense the relevant information of large matrices of similarities (here, CD of contrasts between genetic levels of herds). It meets the requirement to construct sets of well connected units, and may very quickly handle large problems (4000 herds in less than 15 min). Finally it provides a set of connected herds, which makes it possible to define a pool of animals for which estimated breeding values can be compared.
The procedure efficiently accounts for known pedigree and data structure when measuring connectedness between herds. It matched the expectations of breeder associations based on their knowledge of the level of genetic exchange among herds, and has been well adopted. The procedure has been used as the reference method to determine the connected herds in the regular IBOVAL evaluation of nine French beef cattle breeds since 2002 [5]. It will also be applied to the goat genetic evaluation from 2007 onwards [1].