Weighted single-step genomic best linear unbiased prediction integrating variants selected from sequencing data by association and bioinformatics analyses

Background Sequencing data enable the detection of causal loci or single nucleotide polymorphisms (SNPs) highly linked to causal loci to improve genomic prediction. However, until now, studies on integrating such SNPs using a single-step genomic best linear unbiased prediction (ssGBLUP) model are scarce. We investigated the integration of sequencing SNPs selected by association (1262 SNPs) and bioinformatics (2359 SNPs) analyses into the currently used 54K-SNP chip, using three ssGBLUP models which make different assumptions on the distribution of SNP effects: a basic ssGBLUP model, a so-called featured ssGBLUP (ssFGBLUP) model that considered selected sequencing SNPs as a feature genetic component, and a weighted ssGBLUP (ssWGBLUP) model in which the genomic relationship matrix was weighted by the SNP variances estimated from a Bayesian whole-genome regression model, with every 1, 30, or 100 adjacent SNPs within a chromosome region sharing the same variance. We used data on milk production and female fertility in Danish Jersey. In total, 15,823 genotyped and 528,981‬ non-genotyped females born between 1990 and 2013 were used as reference population and 7415 genotyped females and 33,040 non-genotyped females born between 2014 and 2016 were used as validation population. Results With basic ssGBLUP, integrating SNPs selected from sequencing data improved prediction reliabilities for milk and protein yields, but resulted in limited or no improvement for fat yield and female fertility. Model performances depended on the SNP set used. When using ssWGBLUP with the 54K SNPs, reliabilities for milk and protein yields improved by 0.028 for genotyped animals and by 0.006 for non-genotyped animals compared with ssGBLUP. However, with the SNP set that included SNPs selected from sequencing data, no statistically significant difference in prediction reliability was observed between the three ssGBLUP models. Conclusions In summary, when using 54K SNPs, a ssWGBLUP model with a common weight on the SNPs in a given region is a feasible approach for single-trait genetic evaluation. Integrating relevant SNPs selected from sequencing data into the standard SNP chip can improve the reliability of genomic prediction. Based on such SNP data, a basic ssGBLUP model was suggested since no significant improvement was observed from using alternative models such as ssWGBLUP and ssFGBLUP.


Background
A large number of causal loci or single nucleotide polymorphisms (SNPs) highly linked to causal loci have been discovered from sequencing data through association analyses [1,2] and bioinformatics analyses [3,4]. However, many of these SNPs are not part of the standard SNP chips that are commonly used in routine genomic evaluation, e.g. the Illumina Bovine SNP50 chip. By combining SNPs selected from sequencing data with those on the 54K bovine chip, improved reliabilities of genomic predictions for genotyped animals have been reported in various dairy cattle populations, such as Nordic Holsteins [5], Nordic Red [6], and Danish Jersey [7]. Simulations [8] showed that the accuracy of genomic prediction with a single-step genomic best linear unbiased prediction (ssGBLUP) model increased from 0.49 to 0.53 after integrating all the causal variants (i.e., simulated QTL), which were not included in the basic 60,000-SNP panel.
In US Holsteins, with a ssGBLUP model, the prediction reliability for stature increased from 0.753 to 0.760 after integrating 16,648 SNPs selected from sequencing data by association analyses for 33 traits into the 54K-SNP chip [9].
One of the critical factors that impacts the performance of genomic prediction is the statistical model used. The GBLUP and ssGBLUP models assume that all SNPs have effects, which are drawn from a normal distribution assuming they all have the same variance [10][11][12], whereas Bayesian alphabet models assume that SNP effects are drawn from more or less complex mixture distributions and therefore can better accommodate SNPs with large effects [13][14][15]. Bayesian alphabet models usually outperform or are as good as the GBLUP model. However, the GBLUP model is still used in routine genomic evaluation because of its lower computational cost and simpler parameterization compared to Bayesian alphabet models. One strategy to achieve the strengths of both Bayesian alphabet and GBLUP models is to place emphasis on the most relevant markers that are discovered with Bayesian alphabet models when constructing the G matrix for GBLUP models, therefore, each trait can have its specific G matrix [16]. Many efforts have been put to find the optimal weighting strategy for constructing the weighted G matrix. For example, Su et al. [17] compared different weighting strategies for GBLUP models and suggested the use of the posterior SNP variance from the Bayesian R model [15] as the weighting factor and a common weight shared by a group of 30 adjacent SNPs when using the 54K-SNP chip. Similarly, Zhang et al. [18] compared different weighting strategies for ssGBLUP models using simulated datasets and reported that a common weight for a group of 20 adjacent SNPs achieved the highest reliability among all weighting strategies.
Integrating SNPs selected from sequencing data into the currently used 54K-SNP chip has the potential to include causal loci or SNPs in high linkage disequilibrium with causal loci. Therefore, a model that allows the inclusion of some SNPs with larger effects would be closer to the real distribution of SNP effects. For protein content in French dairy goats, which is a quantitative trait influenced by a major gene (explaining 40% of the genetic variance), the increase in accuracy was up to 6 percentage points from using a weighted ssGBLUP model compared to a basic ssGBLUP model when integrating the genotypes of the major gene into the genotype data of the 54K-SNP chip [19]. For stature in US Holsteins, when integrating 16,648 SNPs selected from sequencing data by association analyses, the prediction reliability decreased slightly from using a weighted ssGBLUP model with weights derived from the squared SNP effects compared to using a basic ssGBLUP model [9]. However, to date, for SNPs preselected from sequencing data by bioinformatics analyses (e.g., functional annotation), no comparison between a basic ssGBLUP and a weighted ssGBLUP model has been reported. In addition, placing emphasis on SNPs selected from sequencing data in genomic prediction models could be achieved by considering SNPs on the standard SNP chip and SNPs selected from sequencing data as two separate genetic components, i.e. a so-called featured GBLUP/ssGBLUP model. Results from a previous study on genotyped Danish Jersey animals showed that such a featured GBLUP model outperformed a basic GBLUP model for milk yield when integrating SNPs selected from sequencing data by association analyses and bioinformatics analyses, but no difference was observed between the two models for fat and protein yields, mastitis, and female fertility index [7]. To date, there is no study on genomic prediction using such a featured ssGBLUP model.
Our primary objective was to investigate the effects of integrating SNPs selected from sequencing data in genomic prediction of Danish Jersey cattle using ssG-BLUP models. In addition, we compared a basic ssGB-LUP model with alternative weighted/featured ssGBLUP models. Furthermore, for the ssWGBLUP model, we ssGBLUP model was suggested since no significant improvement was observed from using alternative models such as ssWGBLUP and ssFGBLUP.
investigated the optimal region size for selecting regionspecific weights.

SNPs selected from sequencing data
In this study, we used two sets of SNPs selected from sequencing data included in the EuroGenomics customized Illumina Bovine low-density chip (customized LD, Illumina, Inc.) [20]. One SNP set included 1443 traitassociated SNPs selected from association analyses based on imputed full-sequencing information in bulls of three major dairy breeds of Denmark-Finland-Sweden (DFS) [6]. The other SNP set included 2618 SNPs selected from association analyses and bioinformatics analyses (within genes with a strong effect predicted from the Variant Effect Predictor (VEP) [21], within regulatory regions of genes, and breakpoints of structural variants) in five major dairy breeds from France (FRA) [20]. Details on the selection of DFS and FRA SNPs are in Boichard et al. [20] and Liu et al. [7]. For both DFS and FRA SNPs, the discovery populations used to preselect SNPs from sequencing data were not used in genomic prediction to avoid a false positive conclusion.

Genotype imputation
In total, 5480 Danish Jersey bulls, 1161 US Jersey bulls, and 48,039 Danish Jersey cows were genotyped with versions 1, 2, or 3 of the Illumina Bovine SNP50 chip (54K, Illumina, Inc.), the standard Bovine low-density Chip (standard LD, Illumina, Inc.) [22], or the EuroGenomics customized LD chip [20]. The number of animals genotyped with each of these SNP panels is in Table 1. Animals genotyped with different versions of SNP chips were imputed to 54K + DFS + FRA SNPs directly [23], using a family-and population-based approach implemented in the FImpute software [24]. The imputation accuracies of the DFS and FRA SNPs were assessed by randomly masking 10% of the animals genotyped with the customized LD chip to the standard LD chip, while the accuracy for the imputation from the standard LD chip to the 54K-SNP chip was assessed by randomly masking 10% of the animals genotyped with the 54K-SNP chip to the standard LD chip. The SNP-wise imputation accuracy was measured as the Pearson correlation between imputed and observed genotypes (coded as 0, 1, or 2) and the proportion of correctly imputed genotypes to all imputed genotypes (i.e., concordance rate). Only autosomal SNPs with a minor allele frequency higher than 0.01 and an imputation accuracy with both correlation and concordance rate higher than 0.8 were retained for genomic prediction. Ultimately, 37,074 SNPs from the 54K-SNP chip, 1262 DFS SNPs, and 2359 FRA SNPs were used for genomic prediction, with 28 SNPs shared by the DFS and FRA SNP sets. Regarding the imputation accuracy of these 40,667 SNPs, correlations between imputed and observed genotypes were 96.4% for the FRA SNPs, 96.6% for the DFS SNPs, and 97.0% for the imputation from the standard LD chip to the 54K SNPs, whereas concordance rates were 98.0% for the FRA SNPs, 98.3% for the DFS SNPs, and 98.4% for the imputation from the standard LD chip to the 54K SNPs.

Phenotype
In order to investigate traits with various genetic architectures, three milk production traits including milk, fat, and protein yields and three female fertility traits including the interval from first to last insemination in heifers (IFLh) and cows (IFLc), and the interval from calving to first insemination (ICF) were analysed. Phenotypic data were provided by Nordic Cattle Genetic Evaluation (Aarhus, Denmark). For milk, fat, and protein yields, corrected 305-day yields [25], which were compiled from the data used in routine test-day evaluations of Nordic dairy cattle [26], were available for the first three lactations. For IFLh, raw field records were available for heifers. For IFLc and ICF, raw field records were available for the first three lactations. Details on trait definitions and data editing for the female fertility traits are in the report on routine genetic evaluation of Nordic Dairy cattle [26].

Model
Breeding values were predicted by using single-trait models, including a pedigree-based BLUP (PBLUP)

PBLUP
In the PBLUP model, the relationship matrix for the mixed model equation [27] was constructed using only pedigree information.
The model for IFLh is: and the model for milk, fat, and protein yields, IFLc, and ICF is: where y is the vector of response variables (corrected 305-day yield for milk production traits, and raw phenotypes for female fertility traits); β is the vector of fixed effects, including herd, year-season of insemination (for IFLh and IFLc), or year-season of calving (for milk, fat, and protein yields, and ICF), calving age expressed in months (for milk production traits) or age at first insemination expressed in months (for female fertility traits), and parity (for all traits except for IFLh); a is the vector of additive genetic effects; pe is the vector of permanent environmental effects; hys is the vector of random effects of herd-year-season of insemination (for IFLh and IFLc) or herd-year-season of calving (for milk, fat, protein, and ICF); e is the vector of random residual effects; and X , Z a , Z pe , and Z hys are incidence matrices relating β , a , pe and hys to y . It is assumed that a ∼ N (0, Aσ 2 a ) , pe ∼ N (0, Iσ 2 pe ) , hys ∼ N (0, Iσ 2 hys ) , and e ∼ N (0, Iσ 2 e ) . A is an additive relationship matrix that considers inbreeding, based on a pedigree (without genetic group) including 9595 males and 656,389 females, which was constructed by tracing the animals with phenotypes three generations back. The estimation of the variance components, and the prediction of breeding values with the PBLUP models were performed using the DMU software [28].

ssGBLUP
In the ssGBLUP model, the relationship matrix for the mixed model equation was constructed using both pedigree (without genetic group) and genomic information. The pedigree used for the ssGBLUP model was exactly the same as that used for the PBLUP model. The effects in the ssGBLUP model are also exactly the same as those in the above PBLUP model, but this model assumes (1) y = Xβ + Z a a + Z hys hys + e, (2) y = Xβ + Z a a + Z pe pe + Z hys hys + e, a ∼ N (0, Hσ 2 a ) , where H is a unified genetic relationship matrix with H −1 constructed as [10,11]: where A is the pedigree-based relationship matrix for both genotyped and non-genotyped animals, G is the genomic relationship matrix adjusted to the same scale as A 22 , and A 22 is a subset of A for genotyped animals. Here, inbreeding was considered in all the relationship matrices. G was defined as: where ω a is the fraction of the genetic variance not captured by SNPs and G a is the adjusted genomic relationship matrix. In this study, ω a was set to 0.2 based on [29]. The G a matrix was adjusted for the differences in scale between the original genomic relationship matrix G m and the pedigree-based relationship matrix A 22 as previously described [30].
where α and β are derived from the following equations: Furthermore, G m was constructed using the following method [12]: where M is a matrix with the elements in column j being 0 − 2p j , 1 − 2p j , or 2 − 2p j for genotypes A 1 A 1 , A 1 A 2 , or A 2 A 2 respectively, p j is the allele frequency of A 2 at locus j computed from all genotyped animals in the model, and D is an identity matrix. Prediction of breeding values with the ssGBLUP models was performed using the DMU software [28], but because of the extensive computational cost, this software cannot estimate variance components directly from the ssGBLUP models. Instead, the variance components estimated from the above PBLUP models that included both genotyped and non-genotyped reference animals (see Additional file 1: Table S1) were used to predict the breeding values with the ssGBLUP models. Thanks to this strategy, the variance components of a trait were the same for analyses using different SNP sets. Four SNP sets were used to construct the relationship matrices: (i) 54K (37,074 SNPs), (ii) 54K + DFS (38,336 SNPs), (iii) 54K + FRA (39,433 SNPs), or (iv) 54K + DFS + FRA (40,667 SNPs).
In the ssFGBLUP model, two H −1 matrices were constructed, i.e. one by combining the 54K-SNP chip and the pedigree data, and one by combining the SNPs selected from the sequencing data and the pedigree data. When constructing the relationship matrices for the ssFGB-LUP models, three SNP sets were used: (i) 54K + DFS, (ii) 54K + FRA, or (iii) 54K + (DFS + FRA). The ssFGBLUP model for IFLh is: and the ssFGBLUP model for milk, fat, and protein yields, IFLc, and ICF is: where y , X , β , Z hys , hys , and e are the same as for the above PBLUP models; a 54K is the vector of additive genetic effects captured by the 54K SNPs and the pedigree; a seq is the vector of additive genetic effects captured by the SNPs selected from sequencing data and the pedigree; and Z a is the incidence matrix relating a 54K and a seq to y . It is assumed that a 54K ∼ N (0, H 54K σ 2 a 54K ) and a seq ∼ N (0, H seq σ 2 a seq ) , with no covariance being constructed between a 54K and a seq . The prediction of breeding values with the ssFGBLUP models was performed using the DMU software [28], but because of the extensive computational cost, this software [28] cannot estimate variance components directly from the ssFGBLUP models. Instead, the total additive genetic variance (the sum of σ 2 a 54K and σ 2 a seq ), σ 2 pe , and σ 2 a e used to predict breeding values with the ssFGBLUP models were obtained from the above PBLUP models (see Additional file 1: Table S1), whereas the proportion of additive genetic variance explained by each genetic component was obtained from the FGBLUP model using genotyped animals in the reference population (see Additional file 1: Table S2).

ssWGBLUP
In the ssWGBLUP model, the genomic relationship matrix was weighted by the SNP variances estimated from a Bayesian whole-genome regression model, which used genotyped animals in the reference population. Here, four SNP sets were used to construct the relationship matrices: (i) 54K, (ii) 54K + DFS, (iii) 54K + FRA, or (iv) 54K + DFS + FRA. The effects of the ssWGBLUP model were exactly the same as those in the above ssGBLUP model, but the j th diagonal element of the D matrix (an identity matrix in ssGBLUP) used to construct the G m matrix is the weight on the j th SNP. In this study, the weight on the j th SNP is its (3) y = Xβ + Z a a 54K + Z a a seq + Z hys hys + e, (4) y = Xβ + Z a a 54K + Z a a seq + Z pe pe + Z hys hys + e, posterior variance (standardized by dividing by the average posterior variance over all SNPs) obtained from a Bayesian model that accounts for region-specific variances (hereafter, BayesN0) [31], where every 1, 30, or 100 adjacent SNPs (BayesN0_bin1, BayesN0_bin30, or BayesN0_bin100) on a chromosome region were assumed to form a region that shared the same variance. In order to investigate the advantage of a Bayesian model that accounts for region-specific variances over a model in which all SNPs share the same variance, a BayesN0 model that considers the whole genome as one region (BayesN0_WG, equivalent to GBLUP) was also used for genomic prediction.
The BayesN0 model used for all traits is: where y is the vector of yield deviations (YD) for genotyped animals in the reference population; 1 is a vector of ones; µ is the overall mean; α i is the vector of SNP effects in the ith genomic region; M i is the genotype matrix in the i th genomic region; and e is the vector of random residual effects. It is assumed that i is the variance of the SNP effect for each SNP within the i th genomic region, v α is the degree of freedom, and S 2 α is the scale parameter. It is assumed that e ∼ N (0, Rσ 2 e ) , where R is a diagonal matrix with elements d jj = (1 − r 2 YD )/r 2 YD to account for heterogeneous residual variances ( σ 2 e ) due to different reliabilities of the YD ( r 2 YD ). In this study, YD, which is defined as the sum of the estimated breeding values (EBV) and estimated residuals, i.e., the phenotype adjusted for all effects in the model other than the additive genetic effect, was derived from the above PBLUP model with phenotypes of animals in both the reference and validation populations using the DMU software [28]. Each of the BayesN0 models was run as a single Markov chain with a total length of 50,000 samples and the first 10,000 samples discarded as burn-in. Ultimately, every 20th sample of the remaining 40,000 samples was saved for the posterior analysis. The analyses with the BayesN0 models were performed using in-house scripts written in Julia language [32]. The prediction of breeding values with the ssWGBLUP models was performed using the DMU software [28]. The variance components used for predicting breeding values with the ssWGBLUP models were obtained from the above PBLUP models (see Additional file 1: Table S1) since the DMU software cannot estimate the variance components directly from the ssWGBLUP models because of the extensive computational cost.

Reliability and bias of prediction
For the prediction of breeding values, 15,823 genotyped and 528,981 non-genotyped females born between 1990 and 2013 were used as the reference population. To avoid close relationships between reference and validation populations, 7415 genotyped females and 33,040 non-genotyped females born between 2014 and 2016 that did not have any sib or daughter born before 2014 were used as the validation population. For each trait, the numbers of genotyped and non-genotyped animals in reference and validation populations are in Table 2. By using the young cows as the validation population, the overlap between the validation population and the discovery population of DFS SNPs was elaborately avoided since the discovery population of DFS SNPs consisted of bulls only. In addition, to avoid using the same information for selecting DFS SNPs and investigating them in prediction, the genotyped bulls were excluded from the reference population in the models used to predict breeding values (the PBLUP and three ssGBLUP models) and in the models used to derive weights (BayesN0 models) for the ssW-GBLUP models. From the perspective of phenotypes, given that all the daughters with phenotypes, either genotyped or non-genotyped, have already been included, adding genotyped bulls will not add information for the PBLUP and three ssGBLUP models. However, in practice, a genetic evaluation based on a ssGBLUP model can include the genotypes of bulls when constructing the relationship matrix in order to benefit their non-genotyped daughters. The number of genotyped animals used in genomic prediction was smaller than that involved in genotype imputation since some young genotyped animals did not have any phenotypic record during the period of data collection. However, including these young genotyped animals in genotype imputation can benefit the imputation of SNPs selected from sequencing data for genotyped animals involved in genomic prediction, since most of the young animals were genotyped with the customized LD chip. The reliability of predictions was measured as the squared correlation between EBV and YD divided by the average reliability of YD in the validation population for genotyped and non-genotyped animals, separately. In the validation population, the average reliabilities of YD were consistent for genotyped and non-genotyped animals, and equal to 0.421 for milk yield, 0.308 for fat yield, 0.353 for protein yield, 0.014 for IFLh, 0.033 for IFLc, and 0.055 for ICF. Reliabilities of YD varied little among the validation animals since they comprised young females born between 2014 and 2016, with most of them (> 90% for milk production traits and > 70% for female fertility traits) having only one lactation record until the end of data collection. The bias of prediction was measured as the regression coefficient of YD on the EBV for genotyped and non-genotyped animals in the validation population separately. When calculating reliability and bias, unweighted methods (e.g. unweighted regression for bias) were used because of the small variation in reliability of YD among validation animals. A nonparametric bootstrapping procedure with 10,000 samples as described in Liu et al. [7] was used to test the difference in genomic prediction between models and between SNP sets. The Bonferroni correction was used to control the false positive rate resulting from multiple tests.

Results
With the BayesN0_WG model (equivalent to GBLUP) (see Additional file 1: Table S3), reliabilities increased significantly after integrating all the SNPs selected from sequencing data compared to the use of the 54K SNP set only, for milk yield (0.045) and protein yield (0.023), but not for fat yield and female fertility traits. However, with the Bayesian models other than the BayesN0_WG model, no significant difference was observed between using the 54K SNP set and a set combining the 54K SNPs and the SNPs selected from sequencing data, except when the BayesN0_bin100 model was used for milk yield. Overall, reliabilities did not differ significantly between the three SNP sets that included SNPs selected from sequencing data, except the reliability obtained by using the 54K + DFS + FRA SNP set, which was 0.006 higher than that of the 54K + FRA SNP set averaged over three milk production traits using the BayesN0_WG model. Overall, reliabilities were the same for all region sizes, except for protein yield, for which a region size of 30 SNPs achieved higher reliability than a region size of 1 or 100 SNPs.

Table 2 Number of genotyped and non-genotyped animals in reference and validation populations
IFLh the interval from first to last insemination in heifers, IFLc the interval from first to last insemination in cows, ICF the interval from calving to first insemination Figures 1 and 2 present the weights (standardized SNP variances) for the 54K and 54K + DFS + FRA SNP sets obtained by using a BayesN0_bin1 model. The average weight was equal to 1 for each analysis due to the standardization process. For milk production traits, the largest weight when using the 54K SNP set (1260 for milk yield, 119 for fat yield, and 977 for protein yield) was greater than that obtained when using the 54K + DFS + FRA SNP set (993 for milk yield, 62 for fat yield, and 380 for protein yield). For female fertility traits, the largest weight when using the 54K SNP set (3 for IFLh, 4 for IFLc, and 6 for ICF) was similar to that when using the 54K + DFS + FRA SNP set (3 for IFLh, 4 for IFLc, and 4 for ICF). Generally, a large proportion of the top SNPs for milk and protein yields were SNPs selected from sequencing data, whereas this was much less clear for fat yield and female fertility traits.
Reliabilities from the PBLUP and various ssGBLUP models for both genotyped and non-genotyped animals are in Tables 3 and 4. For all traits, ssGBLUP models outperformed PBLUP models. With the 54K SNP set, reliabilities improved from 0.156 with the PBLUP model to 0.366 with a basic ssGBLUP model for genotyped animals, and from 0.127 to 0.153 for non-genotyped animals, averaged over all traits. In addition, for genotyped animals, all ssGBLUP models outperformed the corresponding Bayesian models with the same bin size, due to the inclusion of performance information from non-genotyped females. The improvements of reliabilities ranged from 0.006 for protein yield to 0.112 for IFLh, averaged over all SNP sets.
With the ssGBLUP models, the integration of SNPs selected from sequencing data resulted in statistically significant improvements of the prediction reliabilities for milk yield (0.039 for genotyped animals and 0.007 for non-genotyped animals) and protein yield (0.019 for genotyped animals and 0.003 for non-genotyped animals), a small improvement for fat yield (0.004 for genotyped animals and 0.001 for non-genotyped animals), but no improvement for female fertility traits. For milk and protein yields, the different SNP sets including SNPs selected from sequencing data had similar performances in terms of reliability. For example, for genotyped animals, the improvements of reliabilities from integrating SNPs selected from sequencing data ranged When using the 54K SNP sets, different single-step genomic prediction models yielded significant differences in reliabilities for milk and protein yields, but no significant difference for fat yield and female fertility traits. When using the 54K + DFS + FRA SNP set, no significant difference in reliability was observed between all the ssGBLUP Table 3 Reliabilities from a pedigree-based BLUP (PBLUP) model and different ssGBLUP models using different SNP sets for milk production traits ssGBLUP models = a single-step genomic BLUP (ssGBLUP) model, a featured ssGBLUP (ssFGBLUP) model, and weighted ssGBLUP models with a region size of 1 (ssWGBLUP_bin1), 30 (ssWGBLUP_bin30), or 100 SNPs (ssWGBLUP_bin100) SNP sets = SNPs on the 54K-SNP chip (54K), SNPs on the 54K-SNP chip together with sequencing SNPs selected by Denmark-Finland-Sweden (54K + DFS), SNPs on the 54K-SNP chip together with sequencing SNPs selected by France (54K + FRA), and SNPs on the 54K-SNP chip together with both sets of selected sequencing SNPs (54K + DFS + FRA) a,b,c subscript letters to the left are for comparisons among models using the same SNP set, and superscript letters to the right are for comparisons among SNP sets using the same model. Reliabilities with no common letter differ significantly (P < 0.05)   Although the differences were very small between region sizes, a region size of 30 SNPs was generally optimal for the ssWGBLUP models, which was consistent with the predictions using a BayesN0 model with different region sizes. For the ssFGBLUP models, the three SNP sets that included SNPs selected from sequencing data were used and compared to the basic ssGBLUP model, no improvement in prediction reliability was observed except for milk production traits with the 54K + DFS SNP set, but this improvement was not significant. Regression coefficients from the Bayesian models for genotyped animals are in Table S4 (see Additional file 1: Table S4). Compared with the BayesN0_WG model, the BayesN0 models with a region size of 1, 30, or 100 SNPs resulted in less bias (the deviation of regression coefficient from 1) for fat yield, IFLh, and IFLc, more bias for milk yield and ICF, and a similar level of bias for protein yield. Regression coefficients obtained by using different SNP sets were generally similar in all BayesN0 models. Regression coefficients from PBLUP and various ssG-BLUP models are in Tables 5 and 6. For genotyped animals, compared to the PBLUP models, the ssGBLUP models resulted in less bias for IFLh and ICF, more bias for milk, fat, and protein yields, and a similar level of bias for IFLc. For non-genotyped animals, compared to the PBLUP models, the ssGBLUP models resulted in less bias for IFLc, more bias for protein yield and ICF, and a similar level of bias for milk and fat yields, and IFLh. With the ssGBLUP models, for both milk production and female fertility traits, less bias was observed for non-genotyped animals than for genotyped animals. Differences in regression coefficients estimated with the different SNPs sets were small when using the ssGBLUP models. Compared to the basic ssGBLUP model, the alternative models, ssWGBLUP and ssFGBLUP, resulted in a similar level of bias or in a significantly increased bias for all SNP sets.

Discussion
Most previous studies that investigated the impact of integrating SNPs selected from sequencing data on genomic prediction focused on genotyped animals only [5][6][7]33]. In this study, we investigated the effects of integrating such SNPs for both genotyped and non-genotyped animals using ssGBLUP models that make different assumptions on SNP effects. The improvement in prediction reliability from integrating SNPs selected from sequencing data varied according to trait, and the best model also varied according to SNP set and trait.
Integrating SNPs selected from sequencing data improved reliabilities for milk production traits, which have heritabilities of about 0.37 but not for female fertility traits, which have heritabilities lower than 0.05. This is in line with a previous study for genotyped Danish Jersey animals [7], which reported improvements of reliabilities from integrating DFS and FRA SNPs for milk, fat, and protein yields but not for the female fertility index. In Danish Jersey, the selected DFS + FRA SNPs contained more top SNPs for milk production traits than for female fertility traits. For example, among the top 20 SNPs from the 54K + DFS + FRA SNP set for each trait using the BayesN0_bin1 model, nine DFS + FRA SNPs were found for milk yield, four for fat yield, and eight for protein yield, but only one for IFLh, three for IFLc, and one for ICF. Among the three milk production traits, for fat yield there were fewer top SNPs among the SNPs selected from sequencing data and, accordingly, improvement in its reliability from integrating these SNPs was relatively small in Danish Jersey, which was also observed in a previous study [7]. This result could be due to the fact that Jersey cattle are characterized by an extremely high level of fat percentage and have large differences in fat profile compared to other breeds (e.g. Holsteins [34]). Since the SNPs selected from sequencing data were obtained from multiple breeds, they probably have limited effects on fat yield in Danish Jersey, especially when the effect of the DGAT1 (diacylglycerol O-acyltransferase 1) gene is already well-accounted for by the 54K-SNP chip (ARS-BFGL-NGS-4939 is the top SNP for fat yield in the 54K-SNP chip and is located within the DGAT1 gene). For milk production traits, the magnitude of improvement in reliability from integrating SNPs selected from sequencing data was smaller in our study than that reported in [7]. One explanation is that, since the reference population size was larger, the improvement from using an additional source of information, such as integrating SNPs selected from sequencing data, was smaller [7,35]. In addition, compared with a GBLUP model, the use of pedigree information for both non-genotyped animals and genotyped animals ( ω a = 0.2 ) in a ssGBLUP model may dilute the impact of integrating SNPs selected from the sequencing data. Compared to the milk production traits, for female fertility traits, the power of QTL detection was expected to be lower due to much lower heritabilities, and the number of SNPs being selected was smaller because of the relatively lower economic importance [7]. For IFLh, we observed slight decreases in reliabilities for genotyped animals after integrating SNPs selected from sequencing data, which could be due to sampling errors resulting from the extremely low heritability of this trait and the relatively small sample size of the validation population (3749 genotyped females with a reliability of YD as low as 0.014). As expected, by using the multiple t-test, none of the decreases in reliability reached the significance threshold. By using a ssGBLUP model, the impact of integrating SNPs selected from sequencing data could be investigated for non-genotyped animals as well. Similar to that observed for genotyped animals, integrating SNPs selected from sequencing data improved reliabilities for milk and protein yields, although the magnitude of improvement was much smaller than that observed for genotyped animals.

Table 5 Regression coefficients of yield deviation (YD) on prediction from a pedigree-based BLUP (PBLUP) model and different ssGBLUP models using different SNP sets for milk production traits
ssGBLUP models = a single-step genomic BLUP (ssGBLUP) model, a featured ssGBLUP (ssFGBLUP) model, and weighted ssGBLUP models with a region size of 1 (ssWGBLUP_bin1), 30 (ssWGBLUP_bin30), or 100 SNPs (ssWGBLUP_bin100) SNP sets = SNPs on the 54K-SNP chip (54K), SNPs on the 54K-SNP chip together with sequencing SNPs selected by Denmark-Finland-Sweden (54K + DFS), SNPs on the 54K-SNP chip together with sequencing SNPs selected by France (54K + FRA), and SNPs on the 54K-SNP chip together with both sets of selected sequencing SNPs (54K + DFS + FRA) a,b,c subscript letters to the left are for comparisons among models using the same SNP set, and superscript letters to the right are for comparisons among SNP sets using the same model. Regression coefficients with no common letter differ significantly (P < 0.05)

Trait
Population The strategy used to select SNPs from sequencing data can highly impact their role in genomic prediction. The DFS and FRA SNPs were selected by association analyses and bioinformatics analyses [6,20]. The benefits of integrating sequencing SNPs selected by association analyses have been verified in various populations, such as Table 6 Regression coefficients of yield deviation (YD) on prediction from a pedigree-based BLUP (PBLUP) model and different ssGBLUP models using different SNP sets for female fertility traits ssGBLUP models = a single-step genomic BLUP (ssGBLUP) model, a featured ssGBLUP (ssFGBLUP) model, and weighted ssGBLUP models with a region size of 1 (ssWGBLUP_bin1), 30 (ssWGBLUP_bin30), or 100 SNPs (ssWGBLUP_bin100) SNP sets = SNPs in the 54K-SNP chip (54K), SNPs in the 54K-SNP chip together with sequencing SNPs selected by Denmark-Finland-Sweden (54K + DFS), SNPs in the 54K-SNP chip together with sequencing SNPs selected by France (54K + FRA), and SNPs in the 54K-SNP chip together with both sets of selected sequencing SNPs (54K + DFS + FRA) IFLh the interval from first to last insemination in heifers, IFLc the interval from first to last insemination in cows, ICF the interval from calving to first insemination a,b,c subscript letters to the left are for comparisons among models using the same SNP set, and superscript letters to the right are for comparisons among SNP sets using the same model. Regression coefficients with no common letter differ significantly (P < 0.05) Nordic Holsteins [5], Nordic Red [6], and Danish Jersey [7]. However, very few studies have examined the effect of sequencing SNPs selected by bioinformatics analyses. Fang et al. [36] showed that reliabilities of genomic prediction improved by constructing a model including an additional feature genetic component based on SNPs associated with specific biologically significant Gene Ontology (GO) terms [37], compared to the use of a basic GBLUP model, especially in an across-breed prediction. VanRaden et al. [38] compared strategies of selecting sequencing variants using the sequencing data from the 1000 Bull Genomes project and showed that reliabilities increased slightly (by 0.004) when both 481,904 candidate SNPs selected from association analyses and 249,966 insertions-deletions were integrated into the genotype data of the HD chip. In our study, by integrating the DFS and FRA SNPs that were selected based on sequencing data in multiple breeds, the information from the breeds with large reference population sizes allowed us to improve genomic prediction of Danish Jersey, which is a numerically small population. The importance of using SNPs selected from sequencing data in multiple breeds was previously suggested in a study on Holstein-Friesian bulls [33], which showed that the use of sequencing SNPs selected from single-breed association analyses decreased the prediction reliability and increased the bias for protein yield, somatic cell score, and IFL, because it is difficult to select top SNPs when extended linkage disequilibrium exists within a single breed. The performance in terms of genomic prediction models depends on the assumptions made about the effects of SNPs, the genetic architecture of the trait, and the SNP set used. In our study, when we used the 54K SNP set, the ssWGBLUP model outperformed the ssGBLUP model for milk and protein yields but not for fat yield and three female fertility traits. This is in line with a previous study that also used the 54K SNP set on genotyped Danish Jersey animals, which showed that a Bayesian R model [15] outperformed the GBLUP model for milk and protein yields, but not for fat yield, mastitis and the female fertility index [7]. Similarly, in French dairy goats, using a 54K-SNP chip, weighted ssGBLUP models outperformed basic ssGBLUP models for traits with identified QTL but not for the traits without known QTL [39]. A similar conclusion was given by Zhang et al. [16] in a simulation study. Tiezzi and Maltecca [40] investigated weighted G matrices for traits with different genetic architectures in US Holsteins and showed that predictive performances increased and bias decreased for traits with moderate to high heritabilities, such as fat and protein percentages. They also showed that, even for lowly heritable traits such as calving ease, gain in reliability was achieved when using the weighted G matrix. Other studies reported that the benefit on reliability from using models that prioritize, select, or weight SNPs is greater for small datasets [5,7]. Compared with the previous study on genotyped Danish Jersey animals [7] that used the 54K SNP set, for genotyped animals, we observed that the benefit on reliability from weighting SNPs decreased for milk and protein yields when a large number of non-genotyped animals were included.
When SNPs selected from sequencing data were integrated in the SNP set, the reliability reached by the best ssWGBLUP model was slightly higher than that from the basic ssGBLUP model for milk and protein yields, but the difference was not statistically significant. In other words, adding SNPs selected from sequencing data to the 54K-SNP chip increased prediction reliability, but no further improvement was obtained by using the weighting strategy. This result does not agree with that of the simulation study by [8], which showed an increase in accuracy by adding causal variants to the standard SNP chip, and a further increase when QTL and SNPs received weights from GWAS. It is possible that the signals of added sequencing SNPs were stolen by the nearby 54K SNPs [38]. In other words, even when the basic ssGB-LUP model is used, more weight could be automatically given to the QTL regions for which the marker density is increased by adding SNPs selected from sequencing data. Similar results were also observed in previous studies. In the study on genotyped Danish Jersey animals of [7], the superiority of a Bayesian R model [15] over a GBLUP model was found to decrease after integrating SNPs selected from sequencing data. In another study on eggshell strength, feed intake, and laying rate in a commercial brown layer line using a HD chip and sequencing data, Ni et al. [41] reported that although the highest predictive ability was achieved when using sequencing SNPs that are located in or around a gene, reliability did not increase with a ssWGBLUP model compared to a basic ssGBLUP model. A similar conclusion was drawn by Fragomeni et al. [9], who reported that a basic ssGBLUP model slightly outperformed a ssWGBLUP model for stature in US Holsteins when sequencing SNPs selected by association analyses were integrated in the SNP set.
In the ssWGBLUP models, we used the posterior SNP variances estimated from the BayesN0 models to weight SNPs in the construction of the G matrices. Previous studies have investigated the weights generated from various parameters, including SNP variances [16,17,19,38,40,[42][43][44], squared SNP effects [16-18, 41, 45], and −log 10 (P-values) [17,41,45]. Su et al. [17] proved that a GBLUP model with a G matrix in which the SNPs are weighted by standardized posterior SNP variances estimated from a Bayesian model was theoretically equivalent to a Bayesian model. The superiority of using the posterior variance from Bayesian models compared to other parameters was also confirmed by simulations [16] and analyses of real data [17].
In our study, the weights on SNPs within each region were generated from region-specific variances, which were estimated by a BayesN0 model [31] that provides the variance of SNPs within each non-overlapping genomic region directly. We also applied a ssWGBLUP model in which the weight on SNPs within a region was the average variance of each SNP within the region estimated from a Bayesian R model [15]. Estimating SNP variances first and then generating the weight for each region based on SNP variances is a strategy applied in various studies [17,18]. Since the performance of the ssWGBLUP model using weights from a Bayesian R model (results not shown) was similar to that observed from a ssWG-BLUP model using weights from a BayesN0 model, we present only the results from the latter. By comparing the results obtained with different region sizes, we found that reliabilities varied little with region size, except for protein yield for which a 30-SNP region resulted in a slightly higher reliability than a 1-or 100-SNP region when SNPs selected from sequencing data were integrated. Su et al. [17] compared weighted GBLUP models with a region size of 1, 5, 10, 30, 50, 70, 100 and 150 SNPs using the 54K-SNP chip in Nordic Holsteins and suggested that a region size of 30 SNPs was optimal, but differences between different region sizes were less than 1percentage point. Teissier et al. [19] compared weights from a single-marker weighting strategy with a group-marker weighting strategy in which the weight was either the sum or the maximum of single-marker weights among all SNPs within that region, and showed that the groupmarker strategy outperformed the single-marker strategy. The reasoning behind the group-marker strategy was that the weight from the estimated parameter of a single SNP has a large uncertainty compared with the average from a group of SNPs. In addition, many studies applied ssWGBLUP models using an iterative process [18,19,42] in which the solutions of the current ssGBLUP model were used to generate quadratic weights (e.g. SNP variance [19] and squared SNP effects [18]) on SNPs in the next run. According to Wang et al. [42] and Teissier et al. [19], the second iteration was optimal if the unweighted ssGBLUP model was considered as the first iteration, since more iterations would cause overweighting of SNPs with large effects and underweighting of SNPs with small effects. The ssWGBLUP model in our study used the variances estimated from the Bayesian model, which avoids the problem with the iterative procedure. Furthermore, under a framework of a single-trait evaluation, the ssW-GBLUP model can take advantage of Bayesian models (weights could be lagged for three years based on Su et al. [17]) while keeping a similar computational cost to that of the basic ssGBLUP model.
Similar to the ssWGBLUP model, the ssFGBLUP model, which we introduce for the first time here, was allowed to place more emphasis on the SNPs selected from sequencing data than on the SNPs of the 54K-SNP chip. One difference between the ssFGBLUP and ssWGB-LUP models is that ssFGBLUP assumes that all the SNPs within a same genetic component share the same variance, whereas ssWGBLUP allows each SNP or each SNP region to have its own variance and therefore does not prioritize all SNPs selected from sequencing data. In our study, no significant difference was found between the reliabilities from ssFGBLUP and a basic ssGBLUP model. On the one hand, the basic ssGBLUP model automatically places more emphasis on the QTL regions because of the increased marker density in QTL regions due to the addition of SNPs selected from sequencing data, which were more likely to be causal SNPs or highly linked to causal SNPs. On the other hand, in the ssFGBLUP model, the proportions of variances accounted for by the two genetic components (54K SNPs vs. selected sequencing SNPs) were approximately derived from variance components estimated using a FGBLUP model. Such an approximation may be not optimal, because FGBLUP uses G matrices, which involve only genotyped animals whereas ssFGBLUP uses H matrices, which involve both genotyped and non-genotyped animals.
Regarding regression coefficients, with the same model, there was little difference among different SNP sets for both genotyped and non-genotyped animals. This was consistent with the finding from a previous study on stature in US Holsteins, where no difference in regression coefficients was observed between 54K and 54K plus 16,648 selected sequencing SNPs for both basic ssGB-LUP and ssWGBLUP (with weights derived from squared SNP effects) models [9]. In our study, considering both the results from regression coefficients and reliabilities, the integration of selected sequencing SNPs into the standard SNP chip was suggested for predicting breeding values since it could improve the prediction reliability without compromising unbiasedness. With the same SNP set, we found that the regression coefficients for all milk production traits decreased with ssWGBLUP model compared to the basic ssGBLUP model. Given that all milk production traits had regression coefficients lower than 1 for both genotyped and non-genotyped animals, the use of the weighting strategy as used in ssWGBLUP models increased the inflation of EBV. A previous study on stature in US Holsteins [9] reported similar results i.e., that for both 54K and 54K plus 16,648 selected sequencing SNPs, the regression coefficient of ssGBLUP models decreased from 0.88 to 0.79 after weighting SNPs by squared SNP effects. However, when weighting of the SNPs was carried out by a nonlinear method [12] that resembled the Bayesian A method, the change in regression coefficients was limited. In our study, considering both the results from regression coefficients and reliabilities, a basic ssGBLUP model was suggested when integrating SNPs selected from sequencing data since the ssWGBLUP model resulted in little improvement in reliabilities but could lead to more inflation in EBV. For IFLh and ICF, unstable regression coefficients were observed for genotyped animals, even when using a PBLUP model which did not involve any genomic information. The reasons causing the unusual regression coefficients are unclear. One possible reason for this could be the consequence of the low heritability of the trait (lower than 0.05) and a relatively small sample size of the validation population (3749 for IFLh and 2833 for ICF), where a small change in additive genetic variance resulting from a sampling error could lead to a large change on the EBV scale. However, this may not be sufficient to account for these unusual regression coefficients. There may be some underlying mechanisms that are difficult to identify.

Conclusions
In summary, we show that when using the genotype data of the 54K-SNP chip, a ssWGBLUP model with a common weight on the SNPs within a specific region (about 30 SNPs) can be a feasible approach for routine genomic evaluation. Integrating relevant SNPs selected from sequencing data by association and bioinformatics analyses into the standard SNP chip slightly improves genomic prediction reliability. With such a SNP set, we recommend the use of a basic ssGBLUP model since no significant improvement is observed from using alternative models such as ssWGBLUP and ssFGBLUP.
Additional file 1: Table S1. Title: Variance components estimated from the pedigree-based BLUP (PBLUP) model. Description: The data provided variance components estimated from the PBLUP model. Table S2. Title: Variance components1 estimated from a featured genomic BLUP (FGBLUP) model, where the selected sequencing SNPs were considered as a feature component. Description: The data provided variance components estimated from the FGBLUP model. Table S3. Title: Reliabilities from Bayesian models for genotyped animals. Description: The data provided prediction reliabilities estimated from Bayesian whole-genome regression models using genotyped animals in the reference population. Table S4. Title: Regression coefficients of yield deviation (YD) on prediction from Bayesian models for genotyped animals. Description: The data provided regression coefficients of YD on prediction from Bayesian whole-genome regression models for genotyped animals in the reference population.