Fertilised spawn were collected in 2012 from the Halda, Jamuna and Padma rivers as part of a program to replace inbred broodstock in Bangladeshi hatcheries [6]. Fish from each river were reared separately by two commercial hatcheries in the case of the Halda and Padma, and by one hatchery in the case of the Jamuna. At 1 year of age, approximately three hundred catla individuals were randomly selected from each river as candidate founders of a breeding population. These fish were fin-clipped and samples were archived in 2015, as part of the routine husbandry of the breeding population. Fin-clips were obtained from fish that were anesthetized with clove oil by removing an approximately 2-mm-wide sample from the extremities of the dorsal fin. Subsequently, fish were placed in tanks for monitoring and released back into ponds once they had satisfactorily recovered from anaesthesia. All fish in the breeding population are managed in accordance with the Guiding Principles of the Animal Care, Welfare and Ethics Policy of the WorldFish Center [7].
For the purpose of the current study, in 2016, archived fin-clip samples were genotyped. Genotyping was conducted using the DArTseq platform [8] according to the laboratory procedures and analytical pipelines outlined in Lind et al. [9], except that the complexity reduction method involved a combination of PstI and SphI enzymes (SphI replacing HpaII used in Lind et al. [9]). Raw DArTseq data are available at https://doi.org/10.7910/DVN/6LEU9O [10].
Quality control procedures were implemented to ensure that only high-quality and informative SNPs, in approximate linkage equilibrium, were retained for analysis. First, SNPs with an observed minor allele frequency (MAF) lower than 0.05 or a rate of missing observations higher than 0.05 were excluded. Second, only one randomly-selected SNP was retained from each unique DNA fragment. Third, as a measure of linkage disequilibrium (LD), pairwise squared Pearson’s correlations (r2) of genotypic allele counts were computed, and then a random SNP from the pair with the highest r2 was excluded iteratively until all pairwise r2 values were lower than 0.2. Finally, SNPs that deviated from Hardy–Weinberg equilibrium (HWE) were filtered out. To achieve this, a preliminary analysis (method outlined below) was undertaken to identify and remove close relatives, and thus reduce the risk of false identification of SNPs with genotyping problems (see [11]). Then, data were converted to a ‘genind’ object using the ‘df2genind’ function (‘adegenet’ package, version 2.1.1 [12]) and the deviation from HWE for each SNP and sampled population was tested using the ‘hw.test’ function (‘pegas’ package, version 0.10 [13]). All the SNPs that significantly deviated from HWE in any sampled population were excluded (classical χ2 test; P < 0.05 after Dunn–Šidák correction).
To construct genomic relationship matrices (G), the method of VanRaden [14] was implemented using the code from Gondro [15] that was modified to replace missing observations in SNP data with the average of the observed allele frequency. The G matrix was constructed separately for individuals from each river. Clustering of genomic relationships using the ‘Ward2’ algorithm was implemented using the ‘hclust’ function [16]. Individuals were reordered according to clustering and heatmaps of genomic relationships between individuals were generated. These heatmaps revealed the presence of full-sibling and half-sibling relationships between the sampled individuals from each of the three river populations. To account for sibship in subsequent analyses and produce a pedigree for genetic analyses of the breeding population, sibship was assigned to individuals. Sibship was initially assigned using the program COLONY (version 2.0.6.4 [17]). For the COLONY analyses: (1) only SNPs with a MAF higher than 0.2 were retained, i.e. 571 from Halda, 569 from Jamuna, 518 from Padma; (2) individuals from different rivers were assumed to be unrelated; and (3) SNPs were assumed to be on separate chromosomes (i.e. unlinked). COLONY inputs were generated by means of the ‘write_colony’ function (‘radiator’ package version 0.0.11 [18]), using the default settings except that ‘update allele frequency’ was set to true [19]. Errors noted in the COLONY inputs generated by ‘write_colony’ were manually corrected.
Comparison of the G matrices with the pedigree-based additive (i.e. numerator) relationship matrices (A) [20], derived from COLONY sibship assignments, revealed that a large number of putatively full sibling relationships in the G matrices were assigned as half-siblings by COLONY, particularly in the case of the Padma river fish. This disparity was attributed to COLONY falsely splitting large full-sibship groups into multiple full-sibship groups [19, 21, 22]. However, by using the COLONY sibship assignments, putatively unrelated individuals were identified. For each river, these individuals were identified by (1) generating the A matrix (‘makeA’ function; ‘nadiv’ package version 2.16.0.0 [23]); (2) listing individuals that were unrelated (aij = 0) to other individuals in A and then removing these individuals from A; (3) appending to the list that was generated in step (2) the individual remaining in A with the lowest average relationship with the other individuals and then removing this individual and its relatives (aij > 0) from A; and (4) iteratively repeating step (3) until no individuals remained in A. Then, allele frequencies using data from the listed individuals only were taken as estimates of allele frequencies in the sampled river populations. These allele frequencies were provided as inputs to regenerate the G for each river. These G matrices were then used to manually assign sibship and dummy parents. This was achieved by (1) visually identifying groups of individuals with genomic relationships of approximately 0.5 and assigning these to full-sibling groups, and (2) visually identifying genomic relationships between these full-sibling groups of approximately 0.25 and defining these as half-siblings.
Data from putatively unrelated individuals only were used for the analysis of the genetic architecture of the river populations [11, 24]. Observed (Hobs) and expected (Hexp) heterozygosities were estimated by SNP and river of origin (‘summary’ function of ‘adegenet’). The significance of pairwise river of origin differences in mean Hexp were estimated (‘Hs.test’ function with n.sim = 999; ‘adegenet’ package) as was the difference between Hobs and Hexp within rivers (paired t-tests). Allelic richness and private allelic richness among rivers were compared visually using the rarefaction method implemented in ADZE [25].
After data conversion (‘tidy_genomic_data’ function; ‘radiator’ package), pairwise overall Wright’s [26] FST values between river populations [27], and bootstrap 95% confidence intervals derived from 2000 iterations, were computed (‘fst_WC84’ function default settings; ‘assigner’ package version 0.5.0 [28]). For the analysis of molecular variance (AMOVA), data were converted to genclone format with river of origin defined as the only stratum (‘as.genclone’ function; ‘poppr’ package version 2.7.1 [29]). Analysis of molecular variance was conducted using the ‘poppr.amova’ function (‘poppr’ package); implementing the default settings except that (1) variances within individuals were not calculated (i.e. within = FALSE), (2) the Hamming distance matrix was computed (i.e. dist = bitwise.dist(x)), and (3) the missing data threshold was set at 10% (i.e. cutoff = 0.1).
To investigate the possibility that a population structure other than that due to river of origin might fit the data better, unsupervised (K-means) clustering (‘find.clusters’ function; ‘adegenet’ package) was implemented using output from principal component analyses (PCA; ‘glPca’ function default settings with 500 principal components retained). In the implementation of the ‘find.clusters’ function, the maximum number of clusters was constrained to 20 and the number of randomly chosen starting centroids to be used in each run of the K-means algorithm was set at 1000. Then, the optimum number of clusters was identified as the level of K with the minimum Bayesian information criterion (BIC) [12].