The use of communal rearing of families and DNA pooling in aquaculture genomic selection schemes

Background Traditional family-based aquaculture breeding programs, in which families are kept separately until individual tagging and most traits are measured on the sibs of the candidates, are costly and require a high level of reproductive control. The most widely used alternative is a selection scheme, where families are reared communally and the candidates are selected based on their own individual measurements of the traits under selection. However, in the latter selection schemes, inclusion of new traits depends on the availability of non-invasive techniques to measure the traits on selection candidates. This is a severe limitation of these schemes, especially for disease resistance and fillet quality traits. Methods Here, we present a new selection scheme, which was validated using computer simulations comprising 100 families, among which 1, 10 or 100 were reared communally in groups. Pooling of the DNA from 2000, 20000 or 50000 test individuals with the highest and lowest phenotypes was used to estimate 500, 5000 or 10000 marker effects. One thousand or 2000 out of 20000 candidates were preselected for a growth-like trait. These pre-selected candidates were genotyped, and they were selected on their genome-wide breeding values for a trait that could not be measured on the candidates. Results A high accuracy of selection, i.e. 0.60-0.88 was obtained with 20000-50000 test individuals but it was reduced when only 2000 test individuals were used. This shows the importance of having large numbers of phenotypic records to accurately estimate marker effects. The accuracy of selection decreased with increasing numbers of families per group. Conclusions This new selection scheme combines communal rearing of families, pre-selection of candidates, DNA pooling and genomic selection and makes multi-trait selection possible in aquaculture selection schemes without keeping families separately until individual tagging is possible. The new scheme can also be used for other farmed species, for which the cost of genotyping test individuals may be high, e.g. if trait heritability is low.


Background
Traditional family-based aquaculture breeding programs, in which families are kept separately until individual tagging and most traits are measured on the sibs of the candidates, are costly and require a high level of reproductive control, e.g. through stripping of the parents [1]. Therefore, alternatives to the above traditional family-based breeding programs are often used in aquaculture breeding schemes. The most widely used alternative is a selection scheme, in which families are reared communally and the candidates are selected based on their own individual measurements of the traits under selection. However, in the latter selection schemes, inclusion of additional traits depends on the availability of non-invasive techniques to measure the traits, such as the Torry Fat meter [2] to measure fat content, since family information is not available. This is a severe limitation of these schemes.
In genomic selection schemes [3], large numbers of (SNP) markers can be used instead of pedigree information and thus family-based selection schemes as in [4,5] are not needed. However, in aquaculture breeding there are many thousands of selection candidates and test individuals, which make genotyping costs high even if the genotyping costs per individual are low.
The aim of this paper is to develop a new selection scheme that combines communal rearing of families, pre-selection of candidates, DNA pooling and genomic selection and makes multi-trait selection possible in aquaculture selection schemes without keeping families separately until individual tagging is possible. We compare the effects of different designs on accuracy of selection, genetic gain and rates of inbreeding using computer simulations.

Simulation of the starting population
A population with an effective population size (N e ) of 1000 was simulated for 4000 generations according to the Fisher-Wright population model [6,7]. Five hundred males and 500 females were randomly selected and mated using sampling with replacement. From the last of these 4000 generations, generation zero (G0) of the selection population of the breeding scheme was obtained.

Simulation of the breeding scheme in generations G0-G5
For generations G0-G5, the selection population was simulated as follows. One hundred sires and 100 dams (Nfam = 100) were randomly split into groups with Nfampergroup families per group (Nfampergroup = 1, 10 or 100, the latter resulting in all individuals being in one group). There was also one scheme with Nfam = 50 and Nfampergroup = 10. Each sire was randomly mated to one dam and vice versa, using sampling without replacement. Each mating resulted in a family that was split into one group of (Ncand/Nfam) selection candidates and a second group of (Ntest/Nfam) test individuals (Ncand = 20000 and Ntest = 2000, 20000 or 50000). Hence, family sizes were (Ncand+Ntest)/Nfam offspring with equal numbers of males and females. Every selection candidate family was grouped with (Nfampergroup -1) other randomly chosen families. Similarly, every family of test individuals was grouped with the test individuals from the same (Nfampergroup -1) other families as were the selection candidates, i.e. the same families were grouped together as test individuals and selection candidates. Strictly, it will not be necessary to group the candidate families separately, as classical parentage testing can be done using the same markers used to estimate the effects of the traits.
Two traits were considered: GROWTH, a trait measured on the Ncand selection candidates; and SIB_TRAIT, a trait that is measured on Ntest test individuals (sibs of the candidates), which were sacrificed to record the SIB_TRAIT.
The Ncand selection candidates were mass-selected across all families for their GROWTH phenotype. A total of Npresel candidates passed this preselection step, and (Ncand-Npresel) individuals were culled, Npresel being 1000 or 2000.
The test individuals were recorded for the SIB_TRAIT. Within each group, the 50% highest SIB_TRAIT individuals were sorted into the H-pool and the 50% lowest into the L-pool. DNA of the H-pool was extracted, pooled and genotyped. Similarly the L-pool's DNA was extracted, pooled and genotyped, which resulted in estimates of the within-pool frequencies of the marker alleles. These frequency estimates were assumed to contain no errors here. Marker effects were estimated and used to estimate the genome-wide breeding values (GEBV) for the SIB_TRAIT of the Npresel selection candidates (see Calculation of phenotypic values and true and estimated genome-wide breeding values). Nfam sires and Nfam dams were selected across families and groups from these preselected selection candidates using truncation selection for the SIB_TRAIT GEBV.

Genome
Creation of the genomes of the population was as described in [4]. Briefly, the genome structure of individuals was diploid with 10 chromosomes 100 cM long. The infinite sites mutation model [8] was used to create new bi-allelic SNP, using a mutation rate of 10 -9 per nucleotide and assuming the number of nucleotides per cM to be 1000000. Inheritance of the SNP followed Mendel's law and the Haldane mapping function [9] was used to simulate recombinations. For each trait 50 SNP per chromosome were sampled randomly to be QTL (sampling without replacement from the SNP with minor allele frequency (MAF) >0.05). From the remaining SNP, 1000 with the highest MAF were chosen as genetic markers. This resulted in a total of 10000 markers spread over 1000 cM. Reduced numbers of markers were obtained by selecting every 10 th and 20 th marker, resulting in a number of markers, Nmarkers = 1000 and 500 markers, respectively. The reduced marker sets either reflected a situation where few markers are known or where genotyping costs are reduced by genotyping few markers.
Effects of the QTL alleles were sampled from the gamma distribution with a shape parameter of 0.4 and a scale parameter of 1.66 [10]. There were no pleiotropic QTL effects, and no genetic or environmental correlation between the two traits. The QTL effects were assumed to be either positive or negative with a probability of 0.5, because the gamma distribution only gives positive values. After sampling, these QTL allelic effects were standardized so that the total genetic variance was 1 for each trait.

Calculation of phenotypic values and true and estimated genome-wide breeding values
The true genome-wide breeding value of an individual for t = GROWTH and t = SIB_TRAIT was calculated as: where x ijk is the number of copies that individual i has at the j th QTL position and k th QTL allele, and g jk (t) is the effect of the k th QTL allele at the j th position. The phenotypic value of the individuals for trait t was simulated by adding an error term sampled from a normal distribution to the true breeding value (TBV i (t)): where ε i (t) is an error term for animal i, which was normally distributed N(0, s 2 e ) and s 2 e was adjusted so that the heritability was 0.4 for GROWTH and 0.1 or 0.4 for SIB_TRAIT.
The statistical model used to estimate the marker effects on SIB_TRAIT was the BLUP of marker effects method [3], using the mixed model equations: where a is a vector of the estimated SNP effects; X is a matrix of SNP genotypes, where element X ij equals the standardised genotype of individual i for SNP j, i.e. X ij is -2p j ,/√H, (1-2p j )/√H or 2(1-p j )/√H for genotypes '11', '12', or '22', respectively, where H is heterozygosity (H = 2p j (1-p j )) and p j is allele frequency at locus j; l is the variance ratio of the error variance to the SNP variance, which is the genetic variance divided by the number of SNP in the genome; y i is the phenotype of individual i, which is 1 (0) if i belongs to H (L)-pool. Thus, at this stage the phenotype is assumed binary, either because it is truly binary or because a continuous variable is split into two classes. Each pool (H, L) contains 50% of the individuals.
Since the test individuals are not individually genotyped, X ij is unknown, but X'X is expected to equal the (co)variance matrix of SNP genotypes (X ij ) times the number of individuals (n). Here, the covariance matrix of the SNP genotypes will be estimated from the individually genotyped selection candidates instead of from the test individuals, i.e. element (j,k) of this matrix is calculated by Cov(X ij ,X ik ), where X ij is the standardised genotype of the i-th selection candidate.
Also X'y cannot be calculated because the test individuals are not individually genotyped. X'y is expected to equal the covariance between genotypes (X ij ) and phenotypes times n. The following regression equation will be used to estimate the covariance between the genotypes and phenotypes: where Δx j is the average difference in allele frequency for SNP j between the individuals with 'y = 1' and those with 'y = 0'; b xj on y is the regression of the SNP genotype on the phenotype; and Δy is the difference in phenotype, which is 1. Since the variance of y is 0.25 (50% of the y's are 1), the above regression equation reduces to: and thus Cov(X ij ;y i ) is estimated by 0.25* Δx j where Δx j is recorded by the pooled genotyping of the 'y i = 1' individuals and the 'y i = 0' individuals. In conclusion, X'X is estimated by n*Cov(X ij ,X ik ) and X'y is estimated by n*Cov(X ij ,y i ), which are needed for Equation [1], and Cov(X ij ,X ik ) and Cov(X ij ,y i ) are estimated from the genotypes of the selection candidates and from the pooled genotypes, respectively.
Estimated genome-wide breeding values for the selection candidates for SIB_TRAIT were obtained by summing the effects of the markers times the standardised genotypes times a regression coefficient to transform the GEBV from the binary data scale to the continuous data: where the regression coefficient b = Cov(Σ X ij a j ; TBV i )/var(Σ X ij a j ), TBV i is the true breeding value of individual i. The regression b was calculated here using the the TBV i from the simulation. In practice, another method needs to be devised to estimate b, e.g. by regressing the phenotypes onto the EBV. This will reduce the selection accuracy, and this reduction depends on the available number of records to estimate the regression coefficient b. The regression coefficient b also corrects for the fact that genomic selection EBV may be biased in the sense that their variance is too big relative to that of the TBV [3].
Equation [2] implicitly incorporates the group means into the GEBV by using the estimates of the marker effects. In situations, where we have many continuously recorded phenotypes per group, the group means are expected to be more accurately estimated by the mean of the phenotypes of the individuals within the group. In this case, estimated genome-wide breeding values for the selection candidates for SIB_TRAIT were obtained by summing the effects of the markers within the group and adding a group-mean: where μ p is the mean of the SIB_TRAIT-phenotypes of the individuals in group p to which individual i belongs; μ GEBV is the mean of the Σ X ij a j of all individuals in group p; and b is as in Equation [2].
In Equations [2] and [3], family means are implicitly estimated by the marker effects, as part of the total genetic effect. However, if Nfampergroup = 1, i.e. family means and group means coincide, the family means are estimated by the phenotypic averages of the group in Equation [3].
Selection of the candidates consisted of two steps: one pre-selection step, where selection was for GROWTH and one final selection step, where selection was for the SIB_TRAIT.
The accuracy of selection was calculated as the correlation between true and estimated breeding values among the pre-selected candidates for SIB_TRAIT (acc SIB_TRAIT ). Inbreeding coefficients (F) were calculated based on pedigree, assuming that the G0 individuals were unrelated base parents.

Statistics
Selection schemes were run for generations (G0-G5) and summary statistics for each of the schemes are based on 100 replicated simulations. The breeding schemes were compared for the accuracy of selection of the SIB_TRAIT (acc SIB_TRAIT ), rate of inbreeding per generation (ΔF) and genetic gain of the SIB_TRAIT (ΔG SIB_TRAIT ) and GROWTH (ΔG GROWTH ), expressed in genetic standard deviation units of generation G0 (s a )) in generation G5.

Effect of number of markers, families per group and test individuals
Overall, there was an increase in accuracy of selection of the SIB_TRAIT (acc SIB_TRAIT ) with an increasing number of markers especially when Nmarkers increased from 500 to 5000, but less so when it increased from 5000 to 10000 (Table 1). The acc SIB_TRAIT was lower with an increased number of families per group and the change in acc SIB_TRAIT was larger from Nfampergroup = 1 to Nfampergroup = 10 than from Nfampergroup = 10 to Nfampergroup = 100. With Nfampergroup = 1, the estimation of the family mean coincided with the estimation of the group mean such that the family mean was well estimated. With a higher number of families per group, only marker information was used to calculate family means (instead of phenotypic family means), which reduced acc SIB_TRAIT . This effect was larger with more families in the group.
With a lower number of test individuals, i.e. Ntest = 2000, acc SIB_TRAIT was much lower than with larger numbers of test individuals. With the largest numbers of markers, i.e. Nmarkers = 10000, acc SIB_TRAIT was only 0.664, 0.603 and 0.580, respectively, for Nfampergroup = 1, 10 and 100. The difference in acc SIB_TRAIT between Ntest = 20000 and 50000 was small. With Ntest = 50000 and Nmarkers = 10000, acc SIB_TRAIT was 0.877, 0.850 and 0.845, respectively for Nfampergroup = 1, 10 and 100, and thus depended little on Nfampergroup in this case, which indicates that family means were accurately estimated by the markers with such high numbers of test individuals. The latter scheme was the scheme with the overall highest acc SIB_TRAIT .
The genetic gain for the SIB_TRAIT (ΔG SIB_TRAIT ) corresponded well to the patterns of changes in acc SIB_TRAIT . The genetic gain for GROWTH (ΔG GROWTH ) did not vary much between the schemes, except that ΔG GROWTH was somewhat increased with Nfampergroup = 1 and low marker density.
Overall, rates of inbreeding (ΔF) did not differ much between the schemes except that there was a tendency for a higher ΔF with Nfampergroup = 1 than with 10 or 100. With Nfampergroup = 10 or 100, markers are used to estimate family means, which may result in reduced estimates of between-family differences, and thus relatively more within-family selection. There was also a small tendency for higher ΔF with Nmarkers = 500 than Nmarkers = 10000.

Effect of heritability of SIB_TRAIT
With a lower heritability of the SIB_TRAIT, i.e. 0.1, accuracy of selection was reduced, as expected (Table 2). However, acc SIB_TRAIT was still rather high with a large Ntest. For example, with Nfampergroup = 10 and Ntest = 20000 and 50000, acc SIB_TRAIT was 0.557 and 0.701, respectively, for Nmarkers = 500 only. The effect of heritability on acc SIB_TRAIT was smallest for the scheme with Ntest = 50000.
Overall, genetic gain for the SIB-TRAIT (ΔG SIB_TRAIT ) followed the pattern of changes of the accuracy of selection. The genetic gains for GROWTH (ΔG GROWTH ) were generally higher than in Table 1, which is probably due to the lower selection pressure on the SIB_TRAIT when the heritability is reduced. The reduced selection pressure for the SIB_TRAIT results in smaller allele frequency changes of QTL affecting the SIB_TRAIT and of linked positions in the genome. The reduced frequency changes/genetic drift at linked positions implies that the selection pressure for GROWTH results in more response for GROWTH. Rates of inbreeding (ΔF) were somewhat higher than with a higher heritability of the SIB-TRAIT, i.e. 0.4, but showed a similar pattern across the schemes. The ΔF is not much affected by the heritability of the SIB_TRAIT, because selection for the SIB_TRAIT is not based on phenotypes but on marker genotypes.

Effect of preselection and number of families
There was little difference in accuracy of selection with Npresel = 1000 or 2000 (Table 3). For Nmarkers = 500,   With Nfam = 50 instead of 100, acc SIB_TRAIT increased somewhat due to the larger full-sib family sizes, and was 0.694, 0.825 and 0.837 for Nmarkers = 500, 5000 and 10000.
ΔF was as expected much more increased with Nfam = 50 than with Nfam = 100. For example, with Nmarkers = 5000, ΔF increased from 0.010 to 0.020 when Nfam decreased from 100 to 50 (Table 3). Table 4 shows the results with the same parameters as in Table 1, but where group means of the selection candidates were estimated using genetic markers instead of phenotypic means. The latter may be necessary when common environmental group effects occur meaning that the phenotypic group means are not representative of the genetic mean of the group. In general, Table 4 shows an increasing trend for acc SIB_TRAIT with increasing Nmarkers, especially from Nmarkers = 500 to 5000. It also shows that the acc SIB_TRAIT was much lower with Nfampergroup = 1 than with Nfampergroup = 10 and 100, because the family effect cannot be well estimated by the markers since the group and family means are confounded in case of Nfampergroup = 1.

Group means estimated from genetic markers instead of phenotypes
The ΔF increased when Nfampergroup increased from 1 to 10, but not from 10 to 100, e.g. with Ntest = 20000 and Nmarkers = 5000, ΔF was 0.007 with Nfampergroup = 1 and 0.013 with Nfampergroup = 10. With Nfampergroup = 1, markers cannot estimate the family means, in which case selection is for within-family deviations as estimated by the markers, i.e. within-family selection, which is known to result in low rates of inbreeding.
When comparing Tables 1 and 4, acc SIB_TRAIT depends highly on Nfampergroup. If Nfampergroup = 1, acc SIB_-TRAIT was considerably lower when the family means were estimated by markers rather than by phenotypic values only, e.g. 0.610 (Table 4) compared to 0.838 (Table 1) with Ntest = 20000. If Nfampergroup = 10, acc SIB_TRAIT was only somewhat lower when family means were estimated using markers and if Nfam-pergroup=100, acc SIB_TRAIT was equal for both methods. Hence, markers are increasingly more efficient in estimating family effects with increasing Nfampergroup.

Discussion
Implementation of genomic selection in aquaculture breeding schemes is hampered by the large number of individuals that need to be genotyped [4]. Here, we present a method to apply DNA pooling in genomic selection, which dramatically reduces the genotyping costs of the test-population [11]. The DNA pooling further avoids pedigree recording, as is the case in traditional family-based designs, in the test-population, and the dense SNP genotyping also achieves this in the selection candidate groups. In addition, the low genotyping costs of the DNA pools make it very cost-effective to extend the test group to more traits that can only be measured on sibs of the candidates, i.e. towards highly multitrait breeding schemes. A methodology to estimate SNP effects from DNA pooling data was derived and yielded high selection accuracies, i.e. 0.60-0.85 with a large number of test individuals. This was especially the case if Ntest = 20000 or more for the aquaculture breeding schemes used here, even when multiple families were grouped and genotyping of pooled samples was done. The accuracy of selection decreased with an increasing number of families per group. If Ntest was only 2000, selection accuracy was substantially reduced, showing the importance of having large numbers of phenotypic records to accurately estimate marker effects.
The methodology presented here for DNA pooling in genomic selection will be beneficial to most species, where genomic selection is applied. In most species, the cost of genotyping large numbers of test individuals hampers seriously implementation of genomic selection. Genomic selection is currently mostly used in dairy cattle, where the use of accurately progeny tested bulls reduces the size of the test population. Still, Van Raden et al. [12] have had to genotype 3600 test bulls to obtain a high selection accuracy. Furthermore, the use of genomic selection instead of progeny testing for the selection of bulls implies that there will be no progeny tested bulls available in future dairy cattle schemes. Thus, in the future, the test population will consist of very large  Genetic gain for GROWTH was increased in Table 1 when Nfampergroup = 1 and marker density was low. In this situation, the estimation of the marker effects resembles that of a TDT (Transmission Disequilibrium Test) for quantitative traits, where the effect of the marker is also estimated within families but is expected to be the same across all families, i.e. the markers are picking-up LD but are corrected for family effects (spurious associations). If the marker density is low, the markers will show only low LD with the QTL, and since they are also not picking up family effects, marker effects will be small. The latter results in a relatively low efficiency of the marker-assisted selection part of the selection for SIB_TRAIT and thus in relatively small allele frequency changes of positions linked to the largest SIB_TRAIT QTL. The latter implies that the selection for GROWTH is not hindered by such frequency changes and thus may explain why the selection for GROWTH is relatively efficient when Nfampergroup = 1 and marker density is low.
We also investigated the effect of different correlations between GROWTH and SIB_TRAIT. Here we assumed that every QTL had correlated multi-normally distributed effects for GROWTH and SIB_TRAIT with a correlation of 0.3, 0.0 and -0.3 (since we lacked a Multitrait-Laplacian distribution sampler). With group means estimated as the mean of the phenotypes of the individuals within the group and Nfamperpool = 10, ΔG GROWTH was reduced by 18% and ΔG SIB_TRAIT by 24% when the correlation was -0.3 instead of 0.0. With a correlation of 0.3, ΔG GROWTH increased by 20% and ΔG SIB_TRAIT by 24%. The breeding scheme suggested here relies heavily on the success of genotyping pooled samples. Our method assumed accurately estimated allele frequencies in both the L-and H-pools, but estimation errors on the pool mean frequencies have been reported, e.g. variance of the estimation error, i.e. the so-called technical error was estimated by Craig et al. [13] to be 6.8 × 10 -5 . Macgregor et al. [11] have reported that these errors depends on several parameters, such as density of the SNP chip, pooling strategy and array dependent parameters such as number of beadscores per SNP. Baranski et al. [14] have found a correlation between individual and pooled genotypes of 0.98 for a scheme with 60 families, one animal/family/pool, and three replicates per pool. Each pool consisted of susceptible and resistant groups for infectious salmon anemia of Atlantic salmon, where 15 individuals per family had been individually tested for the disease.
The improved results with large numbers of test fish per pool suggest that the accurate estimation of allele frequencies in the high and low pool are crucial to estimate the marker effects. In case the DNA pooling technique does not achieve such a high accuracy, the DNA pooling can be replicated in order to achieve the required accuracy, i.e. the error variance of the average of the allele frequencies estimates over all 'low' ('high') replicated pools is p(1-p)/N + V t /m, where p is the true allele frequency, N is the total number of individuals in all 'low' ('high') pools, m is the number of replicated DNA poolings, and V t is the technical error due to the pooling technique, which we assumed to equal 0. The V t /m term can be reduced by increasing the number of replicates. Our numbers of individuals of 2,000/2, 20,000/2 and 50,000/2 could be interpreted as an effective numbers of individuals, N e , where Given this equation for N e , combinations of N, m, V t and p can be found that result in error variances similar to those presented in this paper.
Selection accuracy for quantitative traits may be further improved by removing individuals around the population mean from the DNA pools, which will increase the differences in allele frequencies. However, the number of individuals within each of the DNA pools will be reduced, which increases the variability of the allele frequency estimates. The former will improve selection accuracy whilst the latter will reduce it. Thus, further research is needed to investigate the optimal phenotypic selection differential between the two DNA pools.
The genotyping costs of the test individuals have been much reduced by the grouping strategy. However, we still require genotyping of the selection candidates. Due to the preselection step for GROWTH, the number of candidates to be genotyped was reduced from 20000 to 1000 or 2000 in this scheme, which hardly affected the acc SIB_TRAIT . Hence, there will still be a considerable number of individuals to be genotyped. The costs of this genotyping could be reduced by applying a low-density SNP chip to these candidates, as suggested by Habier et al. [15].
The grouping strategy may help to correct for the skewed contribution of parents that often occurs in mass spawning populations, see e.g. [16]. The number of families that should be reared per group to reduce the skewedness of parental contributions needs to be optimised per population.
Phenotyping 20000 animals for the sib trait might be very costly but that will depend on the trait. For instance, if the trait was resistance to a disease challenge, the phenotyping might simply consist in sorting the dead and alive fish.

Conclusions
This new selection scheme combines communal rearing of families, pre-selection of candidates, DNA pooling and genomic selection and makes multi-trait selection possible in aquaculture selection schemes without keeping families separately until individual tagging is possible. The new scheme can also be used for other farmed species, for which the cost of genotyping test individuals may be high, e.g. if trait heritability is low.