Fish rearing and disease challenge
All fish work was conducted in accordance with national and international guidelines. The protocol for this study was approved by the Institutional Animal Care and Use Committee (IACUC; Protocol # 053) of the US Department of Agriculture, Agricultural Research Service, the National Center for Cool and Cold Water Aquaculture. All efforts were made to ensure fish welfare and to minimize suffering.
Details of the fish rearing conditions and the 21-day survival trial following intraperitoneal injection with F. psychrophilum (Fp), the causative agent of BCWD, have been reported elsewhere [3, 24]. Mortalities were removed and recorded daily and fin clipped. Fish that survived to day 21 post-infection were euthanized in 200 mg L−1 of tricaine methanesulfonate, MS 222 (Sigma) for at least 10 min prior to sampling of fin clips. Fin clips from all mortalities and survivors were individually kept in 95% ethanol until DNA was extracted using established protocols [25].
Training and testing data
The training sample included 102 pedigreed full-sib (FS) families from year-class (YC) 2013 of the Troutlodge, Inc., all-female, May-spawning population (Fig. 1). The 102 YC 2013 families represented a nucleus breeding population undergoing selection for growth, and thus had not previously been selected for BCWD resistance. The fish from YC 2013 families were evaluated in the laboratory BCWD challenge in two tanks per family, with an initial stocking of 40 fish per tank (total phenotyped fish n = 7893). The original study design was to sample n = 1500 fish with phenotypes and genotypes from 50 FS families. Of the 50 FS families, 25 were full-sibs of the testing sample and 25 were least related to the testing sample families based on pedigree records. We sampled ~40 fish from the 25 FS families that were closely related to the testing sample and ~20 fish per family from the other less related 25 families. In practice, we sampled n = 1473 fish with phenotypes and genotypes from those 50 families (n = 17 to 40 per family). Thus from the 7893 BCWD evaluated fish, 1473 fish had genotype data.
The testing sample included 930 potential breeders or selection candidates (sires and dams) that were disease naïve fish sampled from 25 families (n = 31 to 44 testing fish per family). The testing fish had family-based EBV for survival days (DAYS) and survival status (STATUS) that were estimated with a pedigree-based BLUP model (described below) using BCWD survival records measured on their siblings and any collateral relatives among the 102 FS families (n = 7893). Each of these testing fish also had predicted GEBV from GS models (also described below).
To assess the accuracy of the GEBV, we generated 138 next-generation YC 2015 FS progeny testing families (PTF) from crosses that involved 193 of the YC 2013 testing fish (Fig. 1). These 138 YC 2015 PTF were phenotyped in 2015 for BCWD survival (n = 9968) to calculate the mean progeny phenotype (MPP) for each PTF.
BCWD resistance phenotypes
Survival DAYS, the number of days to death post-challenge, were recorded for a total of 21 days post-challenge, with survivors being assigned a value of 21 days post-challenge. Each fish also had a binary survival STATUS record. The binary STATUS had two classes: 2 for fish that were alive on day 21 post-challenge and 1 for fish that died during the 21 days post challenge evaluation period. In the GS analysis, we used DAYS and STATUS records from training sample fish to estimate marker effects to then predict GEBV for DAYS and STATUS for each of the testing sample fish.
SNP genotyping platform
Genotyping was performed by a commercial genotyping service provider (Neogen, Inc., Lincoln, NE) using the Rainbow Trout Axiom® 57 K SNP array, as previously described in [26]. Our quality control (QC) bioinformatics pipeline filtered out SNPs with significant distortion from the expected Mendelian segregation in each FS family (Bonferroni adjusted to P < 0.10) and also removed two training fish that did not have genotypes that matched the parents based on the pedigree records. After genotype data QC, a total of 41,868 SNPs were included in the genotyping dataset.
Before training the GS models, all genotyped SNPs were further filtered using QC algorithms that are implemented in the computer program BLUPF90 [27]. The QC retained SNPs with a genotype calling rate higher than 0.90, minor allele frequency higher than 0.05, and departures from Hardy–Weinberg equilibrium less than 0.15 (difference between expected and observed frequency of heterozygotes). Parent-progeny pairs were tested for discrepant homozygous SNPs, those SNPS with a conflict rate of more than 1% were discarded. After this final QC step, 35,636 SNPs remained for the GS analysis.
Estimation of pedigree-based EBV
For the testing fish, we estimated EBV for BCWD resistance phenotypes using a pedigree-based BLUP (P-BLUP) model. Family-based EBV were estimated using BCWD survival records measured on siblings of the testing fish and any collateral relatives. The phenotypic dataset included records from n = 7893 fish from 102 FS families (39 paternal half-sib families, no maternal half-sib families, and 24 families not nested within a half-sib family). The pedigree dataset included 32,279 fish from seven generations.
Based on past genetic analyses for estimating EBV for BCWD resistance in rainbow trout [3, 23], we decided to use an animal model that included a population mean, random animal genetic and random residual effects. The records of the BCWD survival phenotypes DAYS and STATUS were fit into P-BLUP linear and threshold models, respectively, using the computer application BLUPF90 [27]. Family was not included in the model because we have often found that genetic variance is downward-biased when the family effect is included in the animal model [23]. The challenge tank effect was also not included in the model because the fish were too small for individual tagging at the time of disease challenge and hence the fish were challenged and reared in individual family tanks, which confounded tank with family effects. Likewise, body weight was not included in the model because, for this high-throughput disease challenge study using non-tagged fish, pre-challenge body weight data are only available as an average body weight for each challenge tank (i.e., bulk weight of fish divided by number of fish) and are, therefore, confounded with family.
Estimates of the heritability for the binary trait STATUS obtained with the pedigree-based BLUP (and GS models below) on the underlying scale of liability were transformed to the observed scale of survival STATUS using this expression:
$$h_{observed}^{2} = \left( {h_{liability}^{2} i^{2} p} \right)/\left( {1 - p} \right);$$
where \(i\) is the mean deviation of affected individuals from their group mean, and \(p\) is the incidence of mortality [28].
Estimation of GEBV with Bayesian variable selection models
The SNP genotype data from the training fish (YC 2013 families), with their corresponding BCWD phenotypic records, were used to train the prediction model and estimate marker effects using the Bayesian variable selection model BayesB implemented in the software GENSEL [29] as previously described in [23]; an animal model was used that included a population mean, random marker and random error effects. The mixture parameter \(\pi\) was assumed to be known and defined to meet the condition \(k \le n\); where \(n\) is the number of training fish. After testing \(\pi = 0.96, 0.97,\;{\text{and}}\; 0.98\) by performing fivefold cross-validation analyses (results not presented), we decided to use \(\pi = 0.97\) in the final GS analysis with BayesB because it yielded the best accuracy predictions.
The software GENSEL uses a Gibbs sampling approach in the BayesB analysis [30]. In this study, DAYS and STATUS were analyzed using 210,000 Markov chain Monte Carlo (MCMC) iterations, of which the first 10,000 samples were discarded as burn-in. From the remaining 200,000 samples, we saved one from every 40 samples, thus a total of 5000 samples were used in the analysis. We assessed the proper mixing and convergence of the MCMC iterations using the R package CODA [31] to ensure that the MCMC samples were drawn from the full posterior distributions.
Estimation of GEBV with single-step GBLUP models
The SNP genotype data from training fish and pedigree information on all fish included in this GS study were used to estimate GEBV for the testing sample fish (n = 930 full-sibs of training fish that were not disease challenged) using two methods: (1) single-step genomic BLUP (ssGBLUP) [20, 32]; and (2) weighted ssGBLUP (wssGBLUP), as previously described [23]. In wssGBLUP, the weights for each SNP are 1s for the first iteration, which means that all SNPs have the same weight (i.e., standard ssGBLUP). For the next iterations (2nd, 3rd, etc.), the weights are individual SNP variances that are calculated using both the SNP effects estimated in the previous iteration and their corresponding allele frequencies [22]. In contrast to the BayesB model, the ssGLUP and wssGBLUP models included also fish from YC 2013 families in the analysis, which had only BCWD resistance phenotype records (n = 6420) without marker genotype data: full-sibs of training fish (50 FS families) and 52 additional FS families from the same breeding population as the training and testing fish (Fig. 1). The linear and threshold models to estimate GEBV for DAYS and STATUS, respectively, included a population mean, random animal genetic effects, and random error effects and were fitted as previously described in [23] using the software BLUPF90 [27].
Before performing the GS analysis with ssGBLUP and wssGBLUP, we estimated genetic parameters to use as priors in the Bayesian analysis of the binary trait STATUS as previously described in [23]. The MCMC Gibbs sampling scheme included a total of 210,000 iterations; the first 10,000 iterations were discarded as burn-in iterations. Then, from the remaining 200,000 samples, one from every 40 samples was saved for analysis. This Gibbs sampling scheme collected 5000 independent samples for analysis. The proper mixing and convergence of these MCMC iterations were also assessed using the R package CODA [31].
Predictive ability and bias of EBV and GEBV
The predictive ability (PA) of EBV and GEBV, which are both estimates of additive genetic effects, was estimated under the assumption that the correlation of mid-parent EBV or GEBV with the mean progeny performance (MPP) for each PTF is an estimate of the accuracy of the estimated breeding values [23, 33, 34]. We used the mid-parent EBV or GEBV instead of the individual EBV or GEBV of each parent, because the testing fish were mated to each other to generate the 138 PTF, rather than mating each testing fish to a large random sample of fish from a common genetic background, as is often done in GS studies with terrestrial agricultural animals and birds.
Bias of the EBV was estimated as the regression coefficient of MPP on predicted mid-parent EBV \(\left( {\beta_{MPP.EBV} } \right)\). Similarly, bias of the GEBV was estimated as the regression coefficient of MPP on predicted mid-parent GEBV \(\left( {\beta_{MPP. GEBV} } \right)\). A value of 1.0 for the regression of true breeding value, performance phenotype or MPP on predicted EBV or GEBV is theoretically expected for unbiased estimates of BV; and a deviation from 1.0 can be interpreted as prediction bias [35]. Before estimating the regression coefficients, the predicted EBV and GEBV for STATUS, which were estimated on the underlying scale of liability, were transformed to the observed scale. Categorical data analysis performed with the software programs BLUPF90 and GENSEL uses a probit link function; therefore, the EBV and GEBV were transformed to the standard normal cumulative distribution function (CDF) to estimate the probability of survival [36, 37].
Impact of GS study design on accuracy of GEBV
To evaluate the impact of sample size and relatedness between the training and testing fish on the accuracy of GEBV predictions, we used five GS schemes that were developed using the genotype and phenotype records collected in this study, as outlined in Fig. 2 and Table 3. The following study design variables were evaluated: size of the training data (~500, ~1000 or ~1500 fish); number of training families (25 or 50 families); size of the training families (20 or 40 fish per family); and proportion of fish in the training data that were full-sibs (FS) of the testing fish (Table 3). For the latter variable, scheme 1 = 0.66 means that 66% of the fish in the training data were FS of fish in the testing data; scheme 2 = 0.50 means that 50% of the fish in the training data were FS of fish in the testing data; schemes 3–4 = 1.0 means that all fish in the training data were FS of fish in the testing data; and scheme 5 = 0.0 means that none of the fish in the training data were FS of fish in the testing data (i.e., fish in the training and testing data were sampled from different families from the same breeding population). In scheme 1, there were two distinct groups of training families: (1) a set of 25 families with ~40 progeny each (n = 979) that also contributed fish to the testing data; and (2) a set of 25 families with ~20 progeny each (n = 494) that did not contribute fish to the testing data (Fig. 2). In scheme 2, we used both groups again, but reduced the number of fish sampled per family in group (1) to ~20 (n = 497). In scheme 3, we only sampled group (1) (n = 979). In scheme 4, we only sampled group (1) again, but reduced the number of fish sampled per family to ~20 (n = 497). In scheme 5, we only sampled group (2).
We used only GEBV that were estimated with the Bayesian variable selection model BayesB for this evaluation of the impact of the GS study design on the accuracy of predictions because it resulted in the highest accuracy of GEBV in scheme 1, which is the scheme with the largest training sample size. The BayesB model was run using three mixture parameters \(\left( {\pi = 0.97, 0.98,\;{\text{and }}\;0.987} \right)\), which were chosen accordingly based on the training data size of the tested GS scheme (Table 3).