Using pooled data to estimate variance components and breeding values for traits affected by social interactions

Background Through social interactions, individuals affect one another’s phenotype. In such cases, an individual’s phenotype is affected by the direct (genetic) effect of the individual itself and the indirect (genetic) effects of the group mates. Using data on individual phenotypes, direct and indirect genetic (co)variances can be estimated. Together, they compose the total genetic variance that determines a population’s potential to respond to selection. However, it can be difficult or expensive to obtain individual phenotypes. Phenotypes on traits such as egg production and feed intake are, therefore, often collected on group level. In this study, we investigated whether direct, indirect and total genetic variances, and breeding values can be estimated from pooled data (pooled by group). In addition, we determined the optimal group composition, i.e. the optimal number of families represented in a group to minimise the standard error of the estimates. Methods This study was performed in three steps. First, all research questions were answered by theoretical derivations. Second, a simulation study was conducted to investigate the estimation of variance components and optimal group composition. Third, individual and pooled survival records on 12 944 purebred laying hens were analysed to investigate the estimation of breeding values and response to selection. Results Through theoretical derivations and simulations, we showed that the total genetic variance can be estimated from pooled data, but the underlying direct and indirect genetic (co)variances cannot. Moreover, we showed that the most accurate estimates are obtained when group members belong to the same family. Additional theoretical derivations and data analyses on survival records showed that the total genetic variance and breeding values can be estimated from pooled data. Moreover, the correlation between the estimated total breeding values obtained from individual and pooled data was surprisingly close to one. This indicates that, for survival in purebred laying hens, loss in response to selection will be small when using pooled instead of individual data. Conclusions Using pooled data, the total genetic variance and breeding values can be estimated, but the underlying genetic components cannot. The most accurate estimates are obtained when group members belong to the same family.

Results: Through theoretical derivations and simulations, we showed that the total genetic variance can be estimated from pooled data, but the underlying direct and indirect genetic (co)variances cannot. Moreover, we showed that the most accurate estimates are obtained when group members belong to the same family. Additional theoretical derivations and data analyses on survival records showed that the total genetic variance and breeding values can be estimated from pooled data. Moreover, the correlation between the estimated total breeding values obtained from individual and pooled data was surprisingly close to one. This indicates that, for survival in purebred laying hens, loss in response to selection will be small when using pooled instead of individual data. Conclusions: Using pooled data, the total genetic variance and breeding values can be estimated, but the underlying genetic components cannot. The most accurate estimates are obtained when group members belong to the same family.
Direct, indirect and total genetic variances can be estimated from individual data. However, it can be difficult or expensive to obtain individual phenotypes on certain traits, e.g. egg production and feed intake. Alternatively, data can be obtained on group level, resulting in pooled records. However, pooling data reduces the number of data points. Moreover, multiple animals influence each data point, increasing the complexity of the data. Although there is an obvious loss of power, previous studies have shown that pooled data can be used to estimate direct genetic variances for traits not affected by social interactions [15][16][17]. However, with social interactions, indirect genetic effects emerge and the complexity of the data increases further. It is unclear whether pooled data are still informative in these situations. Therefore, the main objective of this study was to determine whether pooled data can be used to estimate direct, indirect and total genetic variances, and breeding values for traits affected by social interactions. In addition, optimal group composition was determined, i.e. the optimal number of families represented in a group to minimise the standard error of the estimates.

Methods
This study was performed in three steps. First, all research questions were answered by theoretical derivations. Second, a simulation study was conducted to investigate the estimation of variance components and optimal group composition. Third, individual and pooled survival records on 12 944 purebred laying hens were analysed to investigate the estimation of breeding values and response to selection. Table 1 lists the main symbols and their meaning.

Variance components and breeding value estimation
In this section, we examined whether direct, indirect and total genetic variances, and breeding values can be estimated from pooled data. With social interactions, an individual phenotype consists of the direct genetic (A D ) and environmental (E D ) effects of the individual itself (i), and the indirect genetic (A I ) and environmental (E I ) effects of its group mates (j): where n is the number of individuals per group [11]. From an animal breeding perspective, the total breeding value (A T ) is of interest because it determines total response to selection. An animal's A T consists of a direct and indirect component: where A D is expressed in the phenotype of the animal itself and A I is expressed in the phenotype of each group mate.
A pooled record (P * ) consists of the individual phenotypes of all group members (k): It follows from Equations (1) and (3) that, with social interactions, a pooled record consists of the A D and E D of each group member, as well as their A I and E I that are expressed n -1 times: Because an animal's A D and A I are expressed in the same pooled record, the direct Z-matrix that links pooled phenotypes to A D 's and the indirect Z-matrix that links pooled phenotypes to A I 's are completely confounded (as shown in Appendix A by using a fictive  (2) and (4) that, with social interactions, a pooled record contains the total genetic effect of each group member: Equation (5) shows strong similarities with: which shows the content of a pooled record when social interactions do not occur. Previous studies have shown that pooled data can be used to estimate direct genetic variances (σ 2 A D ) and direct breeding values for traits that are not affected by social interactions [15][16][17]. Similarly, pooled data can be used to estimate total genetic variances (σ 2 A T ) and total breeding values for traits that are affected by social interactions.

Optimal group composition
In this section, the standard error (s.e.) ofσ 2 A T is derived for three experimental designs that differ with respect to group composition, i.e. group members belonged to either one, two or n families. The s.e. of an estimate of the genetic variance depends on the between-σ 2 b À Á and within-family variance σ 2 w À Á , the relatedness within a family (r), the number of families (N), and the number of records per family (m) [18]: s:e:σ 2 Analysis of variance was used to derive σ 2 b and σ 2 w for each design (see Appendix B for derivation).
The s.e. ofσ 2 A T differs between experimental designs because the group composition changes the withinfamily variance and the number of records per family ( Table 2). On the one hand, the within-family variance decreases when the number of families per group decreases, causing a strong decrease in s.e.. On the other hand, the number of records per family decreases when the number of families per group decreases, causing a slight increase in s.e.. Overall, to obtain the most accurate estimate of σ 2 A T , group members should belong to the same family. The only exception is when family size (o) equals group size (n). In this case, there is only one record per family and σ 2 A T would not be estimable.
Ideally, group members should be full sibs rather than half sibs, since an increase in relatedness causes a decrease in the s.e. ofσ 2 A T .

Simulation
To validate the theoretical derivations, a simulation study was conducted in R v2.12.2 [19]. A base population of 500 sires and 500 dams was simulated. Each animal in the base population was assigned a direct and indirect breeding value, drawn from N 0 0 and σ 2 A I were set to 1.00, and σ A DI was set to −0.50, 0.00 or 0.50. Each sire was randomly mated to a single dam, resulting in 12 offspring per mating for a total of 6000 simulated offspring. For each offspring, direct and indirect breeding values were obtained as: where the direct and indirect Mendelian sampling terms were drawn from . Each offspring was also assigned a direct and indirect environmental value, drawn . The σ 2 E D and σ 2 E I were set to 2.00, and σ E DI was set to −1.00, 0.00 or 1.00. Animals were placed in groups of four. Depending on the scenario, group members belonged to one, two or four families. Individual phenotypes were obtained by summing the direct and indirect genetic and environmental components according to Equation (1). Pooled records were obtained by summing individual phenotypes according to Equation (3). Seven scenarios were simulated, which differed in σ A DI , σ E DI or group composition (Table 3). For each scenario, 100 replicates were produced.
Based on the previous section, expectations are that the use of a direct-indirect animal model for pooled data will fail to differentiate between direct and indirect genetic effects, while the use of a traditional animal model for pooled data will yield estimates of σ 2 A T . To validate these theoretical predictions, both models were run. First, the simulated pooled records were analysed Table 2 Within-family variance (σ 2 w ) and number of records per family (m) for three group compositions with the following direct-indirect animal model in ASReml v3.0 [20]: where y * is a vector that contains pooled records (P * ); μ * is a vector that contains the pooled mean; Z Ã D is an incidence matrix linking the pooled records to A D 's (each pooled record was linked to the A D 's of the four group members); a D is a vector that contains A D 's; Z Ã I is an incidence matrix linking the pooled records to A I 's (each pooled record was linked to the A I 's of the four group members); a I is a vector that contains A I 's; and e * is a vector that contains residuals. Second, the simulated pooled records were analysed with the following traditional animal model in ASReml v3.0 [20]: where y * , μ * and e * are as explained above; Z * is an incidence matrix linking the pooled records to A's (each pooled record was linked to the A's of the four group members); and a is a vector that contains A's. Based on the previous section, expectations are that the most accurate prediction of σ 2 A T will be obtained when group members belong to the same family. To validate this theoretical prediction, the predicted s.e. ofσ 2 A T was compared to (i) the standard deviation (s.d.) of 100 estimates of σ 2 A T (σ 2 A T 's reported by ASReml) and (ii) the mean of 100 s.e.'s ofσ 2 A T (s.e.'s reported by ASReml) for three group compositions (scenarios 1, 6 and 7 of Table 3).

Data analyses
The dataset was part of the pre-existing database of Hendrix Genetics (The Netherlands) and contained routinely collected data for breeding value estimation. Animal Care and Use Committee approval was therefore not required.
To validate the theoretical derivations and to gain insight into response to selection, individual and pooled data on survival in purebred laying hens (Gallus gallus) were analysed. Survival in group-housed laying hens is a well-known example of a trait affected by social interactions, since a bird's chance to survive depends on the feather pecking and cannibalistic behaviour of its group mates. Ellen et al. [5] used individual survival data on three purebred lines to estimate direct and indirect genetic (co)variances. Large and statistically significant indirect genetic effects were found in two out of three purebred lines. In the current study, we used data from the same two lines. Data were provided by the "Institut de Sélection Animale B.V.", the layer breeding division of Hendrix Genetics. Data on 13 192 White Leghorn layers were provided of which 6276 were of line W1 and 6916 were of line WB.
At the age of 17 weeks, the hens were placed in two laying houses. The laying houses consisted of four or five double rows, and each row consisted of three levels. Interaction with neighbours on the back of the cage was possible, but interaction with neighbours on the side was prevented. Four hens of the same purebred line were randomly assigned to each cage. Hens were not beak-trimmed. Further details on housing conditions and management are in Ellen et al. [5].
The individual phenotype was defined as the number of days from the start of the laying period until either death or the end of the experiment, with a maximum of 398 days. The individual phenotypes were summed per cage to obtain pooled records. If one individual phenotype was missing, the entire cage was omitted from the analysis. The final dataset contained records on 6092 W1 and 6852 WB hens.
To obtain the direct, indirect and total genetic parameters for survival time, the individual phenotypes were analysed with the following direct-indirect animal model in ASReml v3.0 [20]: where y is a vector that contains individual phenotypes; X is an incidence matrix linking the individual phenotypes to fixed effects; b is a vector that contains fixed effects, which included an interaction term for each laying house by row by level combination, an effect for the content of the back cage (full/empty) and a covariate for the average number of survival days in the back cage; Z D is an incidence matrix linking the individual phenotypes to A D 's; a D is a vector that contains A D 's; Z I is an incidence matrix linking the individual phenotypes to A I 's; a I is a vector that contains A I 's; V is an incidence matrix linking the individual phenotypes to random cage effects; cage is a vector that contains random cage effects (to account for the non-genetic covariance among phenotypes of cage members [21]); and e is a vector that contains residuals. This model yields estimates of σ 2 A D , σ A DI and σ 2 A I , from whichσ 2 A T can be calculated. Similarly, it yields estimates of A D 's and A I 's, from whichÂ T 's can be calculated. To improve a trait, animals should be selected based on theirÂ T , since σ 2 A T determines a population's potential to respond to selection.
Alternatively, a traditional animal model can be used to analyse individual or pooled data. A traditional animal model on individual data only yields estimates of σ 2 A D and A D 's. A traditional model on pooled data is expected to yield estimates of σ 2 A T and A T 's, but not of σ 2 A D and A D 's. To validate this theoretical prediction, these traditional models were also run. First, the individual phenotypes were analysed with the following traditional (direct) animal model in ASReml v3.0 [20]: where y, X, b, Z D , a D , V, cage and e are as explained above. Second, the pooled records were analysed with the following traditional animal model in ASReml v3.0 [20]: where y * is a vector that contains pooled records (P * ); X * is an incidence matrix linking the pooled records to fixed effects; b * is a vector that contains fixed effects (the same fixed effects as mentioned above); Z * is an incidence matrix linking the pooled records to A's (each pooled record was linked to the A's of the four group members); a is a vector that contains A's; and e * is a vector that contains residuals. The estimated variance components and breeding values of all three models were compared. In addition, we calculated the loss in response to selection that would occur when applying a traditional model to individual or pooled data instead of a direct-indirect model to individual data. The direct-indirect model applied to individual data yielded estimates of σ 2 A T and A T 's. Based on theirÂ T , 250 animals were selected and the corresponding response to selection was calculated. Similarly, for the two traditional animal models, 250 animals were selected based on theirÂ D (obtained from individual data) andÂ (obtained from pooled data). Once the top 250 animals were selected, theirÂ T (obtained from individual data) was used to calculate the total response to selection. Then, the loss in total response to selection was calculated.

Simulation
The direct-indirect animal model on pooled records failed to converge, confirming that direct and indirect (co)variances cannot be estimated from pooled data. The traditional animal model on pooled records yielded estimates of σ 2 A and σ 2 E Ã . These estimates did not differ significantly from the true σ 2 A T and σ 2 E Ã (Table 4), where (derived by [14]) and (analogous to [17]).
Based on Equation (7), the s.e. ofσ 2 A T was predicted for three scenarios that differed in group composition, i.e. group members belonged to one, two or four families. The theoretical s.e. ofσ 2 A T was compared to (i) the s.d. of 100 estimates of σ 2 A T (σ 2 A T 's reported by ASReml) and (ii) the mean of 100 s.e.'s ofσ 2 A T (s.e.'s reported by ASReml) ( Table 5). The theoretical s.e. ofσ 2 A T did not Table 4 True and estimated σ 2 AT and σ 2 E Ã for five scenarios A AEs:e: À À À À À À À À À À À À À σ 2 E Ã § § §σ 2 E Ã AEs:e: À À À À À À À À À À À À À AD and σ 2 AI were set to 1.00; σ 2 ED and σ 2 EI were set to 2.00; group members belonged to four different families. § § σ 2 differ significantly from the values obtained by simulation. Moreover, as predicted, the most accurate estimate of σ 2 A T was obtained when group members belonged to the same family. In comparison, the s.e. ofσ 2 A T was twice as large when group members belonged to different families. This indicates that group composition is crucial when aiming to obtain accurate estimates. Table 6 shows the estimated variance components for individual survival data analysed with a direct-indirect animal model, and the estimated variance components for individual and pooled survival data analysed with a traditional animal model. The direct-indirect animal model on individual data yielded estimates of σ 2 A D , σ A DI and σ 2 A I . Based on these components,σ 2 A T was calculated (according to Equation (13)). The traditional animal model on individual data yielded estimates of σ 2 A D . The traditional animal model on pooled data yielded estimates of σ 2 A that closely resembled the estimates of σ 2 A T from individual data. The direct-indirect animal model on individual data also yielded estimates of σ 2 Cage and σ 2 E . As derived by Bergsma et al. [21],σ 2

Data analyses
Cage is an estimate of 2σ E DI þ n−2 ð Þσ 2 E I . As derived by Bijma [22],σ 2 E is an estimate of σ 2 E D −2σ E DI þ σ 2 E I . As shown in Equation (14), quently, theσ 2 Cage andσ 2 E from the direct-indirect animal model on individual data should sum to theσ 2 E Ã from the traditional animal model on pooled data. More precisely: The expectedσ 2 E Ã , calculated based on theσ 2 Cage and σ 2 E from the direct-indirect animal model on individual data, and theσ 2 E Ã from the traditional animal model on pooled data closely resembled each other. Table 6 does not show heritability estimates. Where the classical heritability (h 2 ) is used to express σ 2 A D relative to the phenotypic variance ( σ 2 P ), T 2 is used to express σ 2 A T relative to σ 2 P [21]. Comparing values of T 2 obtained from individual and pooled data would be misleading because they are not expected to be similar. Unlike for a trait that is not affected by social interactions, σ 2 P Ã cannot simply be divided by the number of group members to obtain σ 2 P . When group members are unrelated, The non-proportional increase of σ 2 P does not enable a meaningful comparison between values of T 2 obtained from individual and pooled data.
In conclusion, when group members are unrelated, a traditional animal model on individual data yields Table 5 Theoretically predicted s:e:σ 2

AT
, s:d:σ 2 AT § and s:e: Àσ 2 AT Á À À À À À À À À À À À À À § § for three group compositions AT Á À À À À À À À À À À À À À AEs:d: AT Á À À À À À À À À À À À À À based on 100 s.e.'s reported by ASReml. § § § σ 2 AD and σ 2 AI were set to 1.00; σ ADI was set to 0.00; σ 2 ED and σ 2 EI were set to 2.00; σ EDI was set to 0.00. estimates of σ 2 A D , while a traditional animal model on pooled data yields estimates of σ 2 A T . Moreover, the estimated cage and error variances from a direct-indirect animal model on individual data sum to the pooled error variance from a traditional animal model on pooled data. This result could explain the 'inconsistencies' found by Biscarini et al. [17], who assumed that a traditional animal model on individual and pooled data should yield the same genetic variance. Moreover, Biscarini et al. [17] expected to find a pooled error variance that is four times larger than the individual error variance. For body weight at the age of 19 and 27 weeks, these expectations were met. For body weight at the age of 43 and 51 weeks, however, the genetic variance estimated from pooled data was smaller than expected, while the pooled error variance was larger than expected. Biscarini et al. [17] mentions the emergence of competition effects as a possible cause. We indeed expect to find indirect genetic effects when the individual data on body weight at the age of 43 and 51 weeks were reanalysed with a directindirect animal model. Using Equations (13) and (15), the estimated variance components from individual data would resemble the estimated variance components from pooled data.
The regression coefficients ofÂ D 's obtained from individual data on theÂ 's obtained from pooled data strongly deviated from one (0.363 ± 0.006 for W1; 0.392 ± 0.010 for WB). The regression coefficients ofÂ T 's obtained from individual data on theÂ 's obtained from pooled data were close to, and not significantly different from, one (1.004 ± 0.003 for W1; 1.001 ± 0.001 for WB). This indicates that theÂ 's obtained from pooled data are unbiased estimates of theÂ T 's obtained from individual data. Table 7 shows Spearman correlation coefficients be-tweenÂ D 's andÂ T 's obtained from individual data and theÂ 's obtained from pooled data. The Spearman correlation coefficients between theÂ T 's obtained from individual data and theÂ 's obtained from pooled data were close to, but significantly different from, one. This indicates only a minor loss in the accuracy ofÂ T 's when using pooled instead of individual data, which will be reflected in a minor loss in response to selection when using pooled instead of individual data.
To gain more insight, we calculated the loss in response to selection that occurs when applying a traditional model to individual or pooled data instead of a direct-indirect model to individual data. When applying a traditional model to individual data, the loss in total response to selection was 46.9% for W1 ( Figure 1A) and 54.9% for WB ( Figure 1C). When applying a traditional model to pooled data, the loss in total response to selection was 3.3% for W1 ( Figure 1B) and 0.3% for WB ( Figure 1D). In conclusion, the loss in total response to selection will be large when using a traditional animal model on individual data, but will be small when using a traditional animal model on pooled data. However, this outcome may be specific to this dataset. Survival in purebred laying hens was recorded in cages with four unrelated birds. Both direct and indirect genetic effects strongly influenced the trait. Group size, group composition, and the relative impact of direct and indirect genetic effects might influence the loss in total response to selection. For example, for body weight at 19 and 27 weeks of age, indirect genetic effects are expected to be small. In that case, an animal's A T is mainly expressed in the phenotype of the animal itself. Consequently, we expect that more accurate estimated breeding values can be obtained when using individual instead of pooled data. Biscarini et al. [17] found a correlation of~0.75 between the estimated breeding values based on individual and pooled data, resulting in a large loss in response to selection when using pooled instead of individual data. Thus, using pooled data does not always seem to be a proper alternative and requires further research.

Conclusions
Using pooled data, the total genetic variance and breeding values can be estimated, but the underlying direct and indirect genetic (co)variances and breeding values cannot. The most accurate estimates are obtained when group members belong to the same family. While quantifying the direct and indirect genetic effects is interesting from a biological perspective, obtaining the total genetic effect is most important from an animal breeding perspective. When it is too difficult or expensive to obtain individual data, pooled data can be used to improve traits.

Appendix A
This section demonstrates why direct and indirect (co)variances can be estimated from individual data, but cannot be estimated from pooled data.
Consider a situation where four base parents produce six offspring. Animals are kept in groups of two and individual phenotypes are recorded on all six offspring (Table 8).
When analysing individual data with a direct-indirect animal model, the Z-matrices would be: Figure 1Â T 's obtained from individual data plotted againstÂ D 's obtained from individual data andÂ 's obtained from pooled data on survival in laying hens. A and B for data on W1 hens. C and D for data on WB hens. ΔG 1 represents the total response to selection when selecting animals based on theirÂ D obtained from individual data orÂ obtained from pooled data. ΔG 2 represents the total response to selection when selecting animals based on theirÂ T obtained from individual data.  Z D and Z I are not identical, indicating that the direct and indirect genetic effects are estimated based on different information sources, enabling the model to distinguish between these two effects.
When analysing pooled data with a direct-indirect animal model, the Z-matrices would be: Z Ã D and Z Ã I are identical, indicating that the direct and indirect genetic effects are estimated based on the same information source, causing complete confounding between direct and indirect genetic effects. The model will not be able to distinguish between these two effects.