Weighted single-step genomic BLUP improves accuracy of genomic breeding values for protein content in French dairy goats: a quantitative trait influenced by a major gene

Background In 2017, genomic selection was implemented in French dairy goats using the single-step genomic best linear unbiased prediction (ssGBLUP) method, which assumes that all single nucleotide polymorphisms explain the same fraction of genetic variance. However, ssGBLUP is not suitable for protein content, which is controlled by a major gene, i.e. αs1 casein. This gene explains about 40% of the genetic variation in protein content. In this study, we evaluated the accuracy of genomic prediction using different genomic methods to include the effect of the αs1 casein gene. Methods Genomic evaluation for protein content was performed with data from the official genetic evaluation on 2955 animals genotyped with the Illumina goat SNP50 BeadChip, 7202 animals genotyped at the αs1 casein gene and 6,767,490 phenotyped females. Pedigree-based BLUP was compared with regular unweighted ssGBLUP and with three weighted ssGBLUP methods (WssGBLUP, WssGBLUPMax and WssGBLUPSum), which give weights to SNPs according to their effect on protein content. Two other methods were also used: trait-specific marker-derived relationship matrix (TABLUP) using pre-selected SNPs associated with protein content and gene content based on a multiple-trait genomic model that includes αs1 casein genotypes. We estimated accuracies of predicted genomic estimated breeding values (GEBV) in two populations of goats (Alpine and Saanen). Results Accuracies of GEBV with ssGBLUP improved by + 5 to + 7 percent points over accuracies from the pedigree-based BLUP model. With the WssGBLUP methods, SNPs that are located close to the αs1 casein gene had the biggest weights and contributed substantially to the capture of signals from quantitative trait loci. Improvement in accuracy of genomic predictions using the three weighted ssGBLUP methods delivered up to + 6 percent points of accuracy over ssGBLUP. A similar accuracy was obtained for ssGBLUP and TABLUP considering the 20,000 most important SNPs. Incorporating information on the αs1 casein genotypes based on the gene content method gave similar results as ssGBLUP. Conclusions The three weighted ssGBLUP methods were efficient for detecting SNPs associated with protein content and for a better prediction of genomic breeding values than ssGBLUP. They also combined fast computing, simplicity and required ssGBLUP to be run only twice.


Background
The availability of molecular data has enabled the development and commercial application of genomic selection in various livestock species, such as dairy cattle [1,2], dairy sheep [3,4], meat sheep [5,6] and dairy goats [7][8][9]. Meuwissen et al. [10] proposed genomic prediction of animals based on dense single nucleotide polymorphism (SNP) maps, by deriving the effects of SNPs from a reference population, for which animals are both phenotyped and genotyped. Genomic estimated breeding values (GEBV) of selection candidates (i.e., usually young individuals with genotypes but without phenotypes) can be estimated by summing up the effects of the SNP alleles carried by each animal.
Methods such as genomic best linear unbiased prediction (GBLUP) [11][12][13][14][15], are used to predict GEBV by replacing the pedigree relationship matrix used for pedigree-based BLUP with a realized genomic relationship matrix. The GBLUP method was further improved with single-step GBLUP (ssGBLUP) [12], which uses simultaneously all phenotypic, pedigree and genotypic information, including phenotypic information on nongenotyped individuals. Therefore, in ssGBLUP, the relationship between each pair of animals (genotyped and non-genotyped) is estimated with a relationship matrix that combines pedigree and genotype information. Several studies have reported that the accuracy of genomic prediction obtained with these methods is higher than with genetic evaluation using pedigree-based BLUP [16][17][18]. However, the accuracy obtained from genomic information depends on several parameters including reference population size [19,20], extent of linkage disequilibrium (LD), heritability of the trait [20,21], relationship between training and validation populations [10] and the genetic architecture of the trait, which relates to the relative size of allele substitution effects at quantitative trait loci (QTL) [10,22].
The GBLUP and ssGBLUP methods usually assume that each SNP follows the same distribution [11,12,16,[23][24][25], thus, all SNPs have the same variance and the same weight for SNP variance. However, different genomic evaluation methods have been developed to allow the variance of the effect of SNPs to differ between SNPs. A priori information can be used to modify the distribution of SNP effects. Giving more variance to some SNPs allows these methods to take the presence of major genes or QTL that affect the trait of interest into account. For instance, various Bayesian methods, which estimate the effect of SNPs from animals that are both genotyped and phenotyped, have been proposed [10,[26][27][28]. The main difference between these Bayesian methods lies in the definition of an a priori distribution of the effects of SNPs. SNPs can be attributed to different distribution classes, which explain different parts of the total genetic variance, with one class possibly containing the SNPs that have no effect on the trait. Because animals need to be phenotyped and genotyped to apply Bayesian methods, phenotypes from non-genotyped animals cannot be included. In dairy breeding programs, genotypes are mainly determined on the males whereas phenotypes come from the females. Thus, daughter yield deviations (DYD) or de-regressed proofs are calculated to obtain pseudo-phenotypes for the males. However, multi-step methods may create bias in genomic predictions [29].
Other methods based on the ssGBLUP framework such as weighted ssGBLUP (WssGBLUP) or on the traitspecific marker-derived relationship matrix (TABLUP) have been proposed [30]. WssGBLUP is an extension of ssGBLUP in which weights for SNP variances are used when forming the genomic relationship matrix [12]. Wss-GBLUP can set more weight to SNPs that are in high LD with a causal mutation or associated with QTL with a relatively large effect. These weights are estimated from the variance explained by each SNP as presented by Wang et al. [23]. The weighting of SNP variances was also investigated by Zhang et al. [24] who proposed to use the same weight for SNPs that are within a defined window along the genome. The TABLUP method proposes to construct the genomic relationship matrix based on genotypes from a subset of pre-selected SNPs. Selection of SNPs can be performed after GWAS analysis or based on weights that are estimated with WssGBLUP. The selected SNPs are then equally weighted for the analyses [30]. Furthermore, an alternative to the previous methods is the gene content method proposed by Gengler et al. [31], which is based on a multiple trait model and considers the gene content for specific genotypes as a new trait. This method can combine information from SNPs and genotypes for a causal mutation [31,32]. The number of alleles carried by each animal is considered as a second trait correlated to the quantitative trait. Then, the causal mutation is integrated directly in the ssGBLUP multiple-trait model. Its advantage is that it can be extended to multi-allelic genes and used when genotypes for a causal mutation are missing [33].
In French dairy goats, the first step towards genomic selection for milk production traits, udder type traits and somatic cell score was taken by Carillier-Jacquin et al. [8,9] for French Alpine and Saanen dairy goat breeds. Carillier-Jacquin et al. [8,9] compared ssGBLUP and other methods of genomic evaluation that require several steps (GBLUP or Bayesian methods). GBLUP and Bayesian methods usually use performances based on pseudo-phenotypes (DYD) whereas ssGBLUP is based on female performance. These authors found that ssGB-LUP gave more accurate predictions of the genetic merit of selection candidates than the previous official genetic evaluation that did not use genomic information, or the use of multi-step genomic methods. However, the increase in accuracy due to using genomic information was not expected to be high because the reference population was small.
Currently, the next step in the genomic evaluation of French dairy goats is to investigate better ways to use genotyping information to improve the accuracy of genomic evaluation. One possibility is to take prior knowledge about major genes into account. Several major genes have been identified, such as DGAT1 for fat content [34] and α s1 casein for protein content [35]. For protein content, Carillier-Jacquin et al. [33] reported that the genetic variance explained by the α s1 casein gene reached 38% in the Saanen and 43% in the Alpine breed. The caprine α s1 casein gene has six alleles ( A , B , C , E , F and O ) that have been identified in the French dairy goat population. Allele A is predominant in the Alpine breed, whereas alleles A , E and F are the most frequent in the Saanen breed [33]. Carillier-Jacquin et al. [33] showed that integrating the α s1 casein gene for protein content with the gene content method improved the accuracy of genomic evaluation (+ 8 to 14% for Alpine and Saanen populations) compared with ssGBLUP.
In this study, our aim was to investigate different methods of genomic prediction that estimate and integrate the fact that chromosomal regions are strongly associated with a trait. Protein content in French dairy goats was analyzed by applying WssGBLUP, two alternatives of the WssGBLUP method, the TABLUP method and the gene content method. These methods were compared with pedigree-based BLUP and ssGBLUP based on the accuracies of predicted breeding values.

Animals, phenotypes and genotypes
The dataset used in this study was provided by the French national milk records system and included animals from the two main French dairy goat breeds, Alpine and Saanen. Phenotypes for protein content, pedigree data, genotypes and environmental fixed effects used in the ssGBLUP method were obtained from the official genetic evaluation of January 2016 [36]. Analyses were performed with a multi-breed dataset (Alpine and Saanen animals combined) and in two separate within-breed analyses.
The trait analyzed was protein content (g/kg) with measurements from 6,767,490 lactations and 2,458,453 females recorded between 1980 and 2010. Descriptive statistics (animal and record numbers, minimum, mean, maximum, coefficient of variation) for each breed are in Table 1.
The pedigree consisted of 2,543,789 animals (1,449,991 Alpine and 1,093,798 Saanen). In addition, it was completed with 36 unknown parent groups. Unknown parent groups were defined for each breed and for animals born before 1975, and then for cohorts born in 2-year windows up to 2010.
Animals that were genotyped with the Illumina goat SNP50 BeadChip (50K SNP) [37] were also used in the analysis. Quality control (QC) for a dataset of 3347 genotyped animals (2020 Alpine and 1278 Saanen) and 53,347 SNPs was performed independently for each breed. SNPs with a minor allele frequency (MAF) lower than 1% and a call rate lower than 95% were removed. Hardy-Weinberg equilibrium was also tested and the associated Chi squared statistic was calculated for each SNP. SNPs with a Chi squared statistic higher than 24 were removed. Finally, animals with a SNP call rate lower than 99% were discarded from the analyses. After QC, 2955 (1749 Alpine and 1206 Saanen) animals and 46,849 SNPs remained for further analyses. Some SNPs within the α s1 casein gene were present on the 50 K SNP but since they did not pass QC, they were removed [33].
Genotypes for the α s1 casein gene were available for 3696 Alpine individuals (2154 males and 1542 females), and 3506 Saanen individuals (2049 males and 1457 females) born between 1982 and 2012. The α s1 casein gene is located on caprine chromosome 6 at 82 Mb and is multi-allelic in the French dairy goat population, with six different alleles ( A , B , C , E , F and O ) and 19 genotypes detected among the 21 possibilities ( FO and OO genotypes have never been detected in the French dairy goat population) [33]. Genotypes of animals with one missing allele were removed from the analysis. The estimated effects of the 19 α s1 casein genotypes on protein content were computed and reported previously [33]. Table 2 includes the number of animals (males and females for Alpine and Saanen breeds) used in this study with information on their α s1 casein and/or 50 K SNP genotypes.
Genomic prediction with and without considering information on the α s1 casein genotypes ssGBLUP was implemented in 2017 in the official genetic evaluations for the two main French dairy goats. This method and pedigree-based BLUP were used as the reference method in our study and compared with Wss-GBLUP, two alternatives of the WssGBLUP method, TABLUP and the gene content method. Analyses were performed using the blupf90 software [38].

Single-step GBLUP (ssGBLUP) method
For both multi-breed and within-breed scenarios, the following model was applied: where y is a vector of performances (female phenotypes) for protein content (phenotypes are based on standardized 250-day lactation records). β is a vector of fixed effects including herd within year (32 years from 1980 to 2012) and within parity (1, 2 and ≥ 3) (188,933 levels in total); age at delivery within year and within region (four regions in France depending on goat breeding management) (3224 levels in total); month at delivery within year and region (1448 levels in total); and length of dry period within year and region (1107 levels in total); a fifth fixed effect for breed (two levels) was added for multi-breed analyses. u is a vector of random additive genetic effects assumed to be normally distributed N 0, Hσ 2 u , p is a vector of random permanent environmental effects assumed to be normally distributed N 0, Iσ 2 p , e is a vector of random residuals that is normally distributed N 0, Iσ 2 e . X is the incidence matrix relating phenotypes to the fixed effects ( β). Z is the design matrix allocating phenotypes to breeding values ( u ) and W is the incidence matrix relating phenotypes to permanent environmental effects ( p). (1) Matrix H is the genetic relationship matrix combining SNP information and pedigree data, implemented as in Legarra et al. [12]: where A is a pedigree-based relationship matrix with indices 1 for ungenotyped animals and 2 for genotyped animals, and G is the genomic relationship matrix derived as in Christensen and Lund [11]: where m is the number of SNPs, p i is the estimated allele frequency at locus i and M is a centered matrix of SNP genotypes.
Variance components were estimated by using the restricted maximum likelihood (REML) method in the remlf90 software [38].

Weighted ssGBLUP (WssGBLUP) method
Model 1 was also used for WssGBLUP but G was constructed differently. Solutions of genomic breeding values from ssGBLUP (Model 1) can be decomposed into SNP effects as modeled in Wang et al. [23]: where â is a vector of SNP effects, D is a diagonal matrix of weights (initially diagonal of 1 for the ssGBLUP), M is the centered matrix of SNP genotypes and û g the vector of GEBV from genotyped animals only. Variances of the effect of SNP i were estimated as: where p i is the allele frequency of SNP i . The vector of variances of SNP effects was normalized (the normalization process ensured that the sum of the variances remained constant and was equal to the number of SNPs) and used as weights in matrix D to construct the weighted matrix G ( G * ) as described in Wang et al. [23]: GEBV were estimated again with Model 1 by considering weights for each SNP via the G * matrix included in the H matrix. This process was carried out iteratively with weights estimated at each iteration as described in Wang et al. [23]. Wang et al. [23] have shown that WssGLUP with only very few iterations may be sufficient to reach a maximum accuracy of GEBV and SNP effects. In this study, we analyzed the influence of the number of iterations (1-10) on the accuracy of genomic predictions. As proposed by Zhang et al. [24], other methods can be considered to calculate the weight for SNPs in the D matrix. These methods assign the same weight to several consecutive SNPs within a chromosomal region. Modifications of the WssGBLUP method were considered in this study and the individual weights were computed as follows: (1) the maximum weight of SNPs included in the chromosomal region, or (2) the sum of the weights of the SNPs included in the chromosomal region. These weights were calculated based on the weights estimated with the WssGBLUP. In the end, the vector of the weights was normalized in such a way that the sum of all weights remained constant and equal to the number of SNPs. Chromosomal regions of various lengths were tested: 2, 5, 10, 20, 40, 80, 100, 150, 200 and 250 consecutive SNPs with non-overlapping windows. Hereafter, these methods are named WssGBLUP i where i denotes the method used to calculate the weights (Max or Sum).

Trait-specific marker-derived relationship matrix (TABLUP) method
Only a subset of SNPs that are more or less associated with protein content was selected to build the G matrix. One of our objectives was to investigate how the genetic architecture of protein content could be taken into account in the ssGBLUP method. Thus, TABLUP was applied by selecting a subset of SNPs according to their effect on the trait (estimated from the WssGBLUP method described previously). A total of 5000, 10,000, 15,000, 20,000, 25,000, 30,000, 35,000 or 40,000 SNPs were selected to construct G . Two scenarios were tested in which either the most or the least strongly associated SNPs were selected. GEBV were estimated with Model 1 and the G matrix that was built based on the selected SNPs without weights ( D = I).

Gene content method
The gene content method estimates the GEBV for each animal by taking information on the α s1 casein genotype, genotypes from the 50K SNP and pedigree into account through a multiple-trait model. The model used here was the same as in [33]: where y is a vector of female performances for protein content. Fixed effects ( β ), random effects ( u , p and e ) and incidence matrices X , Z and W are the same as in Model 1. y A , y B , y C , y E , y F , and y O are vectors of gene content for alleles A , B , C , E , F and O . This corresponds to the number of copies carried by each animal (i.e., 0, 1 or 2). For ungenotyped animals, the value was set to missing.
where p i is the frequency of allele i at the α s1 casein locus. Covariances between genetic values ( u ) and genetic effects of gene content ( u A , u B , u C , u E , u F and u O ) were modeled as in Carillier-Jacquin et al. [33]. Variance and covariance parameters from this model were estimated using the restricted maximum likelihood (REML) algorithm implemented in the remlf90 software.

Accuracy of genomic predictions
Genomic evaluations were performed from all phenotypes recorded until January 2010, but we were also interested in the prediction of genotyped animals that constituted our reference population. This reference population was composed of 905 sires born between 1993 and 2012 and genotyped with the 50K SNP chip ( Table 2) and was split into a training population of 554 sires born from 1993 to 2007 (307 Alpine and 247 Saanen) with phenotypes of their daughters recorded until January 2010), and 351 validation sires born from 2008 to 2012 (205 Alpine and 146 Saanen) with no daughters in January 2013 (daughters of these animals were removed from the dataset). Then, GEBV and DYD computed from the official genetic evaluation of January 2016 were compared for the 351 animals in the validation set. DYD were average performance values for the daughters corrected (2) for environmental effects and merit of the dam, and they were weighted by effective daughter contributions as described in VanRaden and Wiggans [39]. Accuracy of genomic predictions was assessed as the Pearson correlation between GEBV estimated with each model and DYD. Pearson correlations obtained with different methods were tested using the Hotelling-Williams test [40].

Results and discussion
The most frequent α s1 casein genotypes are AA for the males and AE for the females in the Alpine breed, and AE for the females and EE for the males in the Saanen breed (present in more than 50% of the animals). Allele C is rather rare (less than 5% of the animals carry this allele) in the two breeds. The largest differences in genotype frequency between Alpine and Saanen populations were observed for genotypes AA (49% in Alpine vs. 7% in Saanen), EE (3% in Alpine vs. 32% in Saanen) and AE (49% in Saanen vs. 30% in Alpine). These results were consistent with the previous work of Carillier-Jacquin et al. [33] in which fewer genotypes were available. Protein content was analyzed knowing that this trait is highly heritable in both Alpine and Saanen populations (0.5) [41].

Estimation of weights for SNPs with the WssGBLUP method
We compared different genomic methods. First, we used WssGBLUP because we wanted to identify the weights given to SNPs with this method, in order to determine if the chromosomal region including the α s1 casein gene was considered in the analyses. WssGBLUP is an iterative method, and 10 iterations were performed for multibreed analyses and within-breed analyses. Accuracy of genomic predictions was evaluated at each iteration (results not shown). The highest accuracies were obtained at the second iteration as reported by Wang et al. [23] and then decreased slightly. Thus, all the results presented for the WssGBLUP multi-breed and within-breed analyses are those obtained for the second iteration (see Fig. 1). The top 50 SNPs (with the biggest weights) were compared between the three analyses and were all located on chromosome 6 i.e. the multi-breed (between 71 and 86 Mb), Alpine (between 64 and 101 Mb) and Saanen analyses (between 71 and 92 Mb), and their weights ranged from 24 to 115 for multi-breed, from 23 to 45 for Alpine and from 30 to 108 for Saanen analyses. Among these SNPs, 16 were common to the three analyses and located between 78 and 82 Mb; 11 SNPs were common to the Saanen and multi-breed analyses and located between 79 and 83 Mb; 16 SNPs were common to the Alpine and multi-breed analyses and located between 77 and 86 Mb; and only one SNP was common to both the Alpine and Saanen analyses and located at 76 Mb.
WssGBLUP can be used not only for genomic prediction but also for QTL detection as in GWAS [23,24]. In French dairy goat data, the chromosomal regions detected with WssGBLUP were on caprine chromosome 6, which includes a well-known region that was previously located and described by Martin et al. [34] in a GWAS study. They performed linkage analyses (LA) and linkage disequilibrium (LD) analyses on 1941 dairy goats distributed in 20 half-sib families using all females and their 20 sire genotypes and detected a large QTL between 82.5 and 82.8 Mb on chromosome 6. In our study, SNPs with the biggest weights for SNP variances were located within this region.
The WssGBLUP method developed by Wang et al. [23] has some limitations. Weights for SNP variances are estimated by using a whole-genome regression, which can result in their unstable prediction due to multi-collinearity between SNPs because of LD between SNPs. In our study, we tested common weights for several SNPs instead of individual weights for SNP variances, using WssGBLUP Max or WssGBLUP Sum . These methods are expected to limit the large variation in prediction of weights for SNP variances by smoothing weights of SNPs that are in the same window. In our study, WssGBLUP Max and WssGBLUP Sum gave higher accuracies of genomic prediction than the classical Wss-GBLUP. With WssGBLUP Max or WssGBLUP Sum , window sizes were used to allocate the same weights to consecutive SNPs. Another approach would be to use the LD between SNPs, which could limit the multi-collinearity between the SNPs used in the genomic evaluation. Since the weight of SNPs is included through the D matrix, this matrix can be replaced by the weights derived from the GWAS approach.
Including the effect of the α S1 casein gene in WssGBLUP or gene content methods  in Saanen) were lower than accuracies with ssGBLUP (0.77 in multi-breed, 0.76 in Alpine and 0.73 in Saanen), gene content (0.76 in multi-breed, 0.76 in Alpine and 0.72 in Saanen) or WssGBLUP (0.79 for multi-breed, 0.78 for Alpine and 0.77 for Saanen). The gene content method did not improve accuracy of genomic predictions for the three populations compared to ssGBLUP (accuracy was 1 percent point lower for gene content in the multi-breed and Saanen analyses and identical in the Alpine analysis). In addition, accuracies with WssGBLUP were significantly higher than with ssG-BLUP for the Saanen population (+ 4 percent points). We did not observe any significant difference between ssGBLUP and WssGBLUP for multi-breed and Alpine populations.
Previously, Carillier-Jacquin et al. [33] used the gene content and ssGBLUP methods to analyze protein content in French dairy goats. Accuracies obtained with ssG-BLUP were higher in our study than in Carillier-Jacquin et al. [33] for the multi-breed (+ 5 percent points) and Alpine (+ 8 percent points) analyses, and slightly lower for the Saanen analysis (− 2 percent points). A similar trend was observed with the gene content method, with + 1 percent point for multi-breed, + 8 percent points for Alpine and − 14 percent points for Saanen in our study compared to Carillier-Jacquin et al. [33]. The main difference between our study and that of Carillier-Jacquin et al. [33] was the number of animals genotyped with the 50 K SNP chip, number of α s1 casein genotypes, and the size and composition of the training and validation sets. In our study, 82 males and 2050 females genotyped with the 50 K SNP chip and 50 females and 878 males genotyped for the α s1 casein gene were added. In Carillier-Jacquin et al. [33], The main difference between the Carillier-Jacquin et al. study and that reported here was the size of the validation population (2 versus 5 years in our study). The slightly improved results that we obtained may be explained by the larger reference population (823 animals in Carillier-Jacquin et al. [33] compared to 905 in our study), a wellknown factor in the literature on genomic selection. For instance, VanRaden et al. [42] report a gain of + 5 percent points between genomic prediction and parent average by adding 1000 animals in the training population. These results were consistent with the higher accuracy obtained in the multi-breed analysis compared to the within-breed analyses, especially if the trait has the same genetic determinism in the two breeds that are combined (which is the case for protein content). Accuracy is expected to improve even more the size of the reference population continues to grow over the years.
Carillier-Jacquin et al. [33] showed that the gene content method was more accurate than ssGBLUP (+ 3 percent points for multi-breed, + 5 percent points for Alpine and + 11 percent points for Saanen). However, in our study, accuracies of genomic prediction were the same for the gene content method and ssGBLUP. The goat α s1 casein gene has six alleles in the two main French dairy goats and genotype frequencies vary considerably with some being rare. Predicting α s1 casein genotypes with the gene content method for non-genotyped animals remains difficult in this case, especially in French dairy goats, for which the number of non-genotyped animals is large compared with that of genotyped animals (only 0.3% of the population is genotyped for the α s1 casein gene), and 40% of females have unknown parents. This may explain why the gene content method did not outperform ssGBLUP.
The genetic architecture of protein content is similar between the Alpine and Saanen breeds. However, the gain in accuracy with the genomic evaluation methods (ssGBLUP, gene content and WssGBLUP) compared to pedigree-based BLUP was greater for the Saanen than the Alpine breed. As discussed by Carillier-Jacquin et al. [9], the greater gain observed for the Saanen breed between pedigree-based BLUP and genomic evaluation may be explained by a higher level of inbreeding (2.3% in Saanen and 1.8% in Alpine), and a higher kinship coefficient between the training and validation sets (2.4% in Saanen and 1.1% in Alpine using genomic data).
For prediction of GEBV, WssGBLUP was more efficient than gene content, which may be due to the construction of the 50K SNP chip. The region around the α s1 casein gene was enriched in SNPs in the 1-Mb region at 82 Mb on chromosome 6 (the region that contains the α s1 casein gene). Overall, 40 SNPs are present within this 1-Mb region, whereas on average only 20 SNPs per Mb are located outside of this region on chromosome 6 or on other chromosomes. Moreover, the Chi squared test between α s1 casein genotypes and each SNP on chromosome 6 revealed a very strong correlation between α s1 casein genotypes and SNPs on the 50K SNP chip in this region (results not shown). Giving more weight to SNPs that are more strongly associated with protein content seems to be more efficient to capture the effect of the α s1 casein gene than using genotype data for this gene. Vallejo et al. [18] investigated the efficiency of WssGBLUP for bacterial cold water disease resistance, for which several QTL are identified. They observed an improvement of 4 percent points with WssGBLUP compared to ssGBLUP. In our study, we observed similar gains with WssGBLUP. Su et al. [43] also observed a superiority of the WssGB-LUP over ssGBLUP in dairy cattle for milk traits.

Use of common weights on consecutive SNPs with WssGBLUP
WssGBLUP was significantly more predictive than other genomic evaluation methods for protein content in the Saanen breed but not in multi-breed or the Alpine breed. Zhang et al. [24] reported that WssGBLUP Max and WssGBLUP Sum increase the accuracy of genomic evaluation more efficiently than WssGBLUP. We evaluated these methods and Tables 3, 4 and 5 show the results on the validation population in the multi-breed, Alpine and Saanen populations, respectively using WssGBLUP and the two modified WssGBLUP methods (Max, Sum) according to the size of SNP windows. If identical results were obtained for different window sizes, they were merged in the same column. For the multi-breed population, accuracies of the analyses with WssGBLUP Max and WssGBLUP Sum were very similar and differed only with non-overlapping SNP windows of 40, 80, 100, 150, 200 and 250 SNPs, the accuracy (0.81) of WssGBLUP Sum being slightly higher than that of WssGBLUP Max (0.80). Otherwise, accuracies were equal to 0.79 with a window size of two SNPs and 0.80 for window sizes of five, 10 and 20 SNPs. Finally, accuracies of WssGBLUP Max and WssGBLUP Sum were slightly higher than that of WssGB-LUP (0.79) and higher than that of ssGBLUP (0.77).
For both within-breed analyses, increasing the window size barely influenced accuracies. In the Alpine within-breed analysis, a maximum accuracy of 0.79 was reached with the WssGBLUP Sum method and a window size of 40  WssGBLUP Max and WssGBLUP Sum slightly improved the accuracy of genomic predictions for protein content in French dairy goats compared to WssGBLUP. Similar results were observed by Zhang et al. [24] with WssGBLUP Max and WssGBLUP Sum compared to WssG-BLUP on simulated data for five QTL. Zhang et al. [27] presented their results for a window size of 20 consecutive SNPs because when they used windows with more than 20 SNPs, accuracies decreased when many QTL affected a trait. This is due to most of the weight being assigned to the windows with large SNP effects and less weight to those with small SNP effects, which may introduce bias in the estimates. For the populations in our study, accuracies varied little with window size. However, 20 consecutive SNPs were not sufficient to reach the highest accuracies and 40 consecutive SNPs were more appropriate. Thus, for a trait that is influenced by few QTL, WssGBLUP Max or WssGBLUP Sum were more

TABLUP method
To validate that ssGBLUP does capture the α s1 casein gene information, we used TABLUP that consists in selecting a subset of SNPs for constructing the G matrix, i.e. we selected the SNPs that were the most or the least strongly associated with protein content. Figure 3 shows the accuracies obtained with ssGBLUP and TABLUP for the multi-breed population according to the number of SNPs conserved (5000 to 40,000 SNPs) to construct the G matrix. Since results for both Alpine and Saanen breeds were similar to those for the multi-breed population, they are not shown. First, for the SNPs that were the most strongly associated with protein content, TABLUP with only 5000 such SNPs led to a high accuracy of genomic prediction (0.74), which is close to that obtained with ssGBLUP (0.77). TABLUP reached the 0.77 accuracy of ssGBLUP with 20,000 such SNPs, which were distributed across the whole genome with on average 42% of the SNPs on each chromosome being retained and 54% on chromosome 6. This indicates that SNPs around the α s1 casein gene have been more selected than the others. Increasing the number of SNPs from 20,000 to 40,000, did not increase the accuracy furthermore. Conversely, for the SNPs that were the least strongly associated with protein content, TABLUP with 5000 such SNPs led to a very low accuracy (0.47) and increasing their number to 40,000 led to an increase in accuracy of 24 percent points (0.47 with 5000 SNPs and 0.71 with 40,000 SNPs) but accuracy remained significantly lower than that obtained by using the whole 50K SNP BeadChip (0.71 against 0.77).
Using different subsets of SNPs and the BayesA model, VanRaden et al. [44] compared accuracies of genomic predictions in Holstein breed cattle for 33 traits. They used 60K and high-density (HD) SNP panels, and added specific SNPs selected from whole-genome sequence data, which were SNPs based on their annotation (located on exons, splicing sites, indels, 2 kb upstream, 1 kb downstream, untranslated regions, SNPs with large effects). They showed that the highest accuracies were Validation correlations for 351 validation males in the multi-breed population using the TABLUP approach. TABLUP consists in selecting a subset of the genotypes from the 50K SNPs according to their association with protein content (either the most strongly associated or the least strongly associated with protein content), which is done by selecting SNPs according to their weights estimated with the WssGBLUP approach