Variable selection models for genomic selection using whole-genome sequence data and singular value decomposition

Background Non-linear Bayesian genomic prediction models such as BayesA/B/C/R involve iteration and mostly Markov chain Monte Carlo (MCMC) algorithms, which are computationally expensive, especially when whole-genome sequence (WGS) data are analyzed. Singular value decomposition (SVD) of the genotype matrix can facilitate genomic prediction in large datasets, and can be used to estimate marker effects and their prediction error variances (PEV) in a computationally efficient manner. Here, we developed, implemented, and evaluated a direct, non-iterative method for the estimation of marker effects for the BayesC genomic prediction model. Methods The BayesC model assumes a priori that markers have normally distributed effects with probability \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \uppi $$\end{document}π and no effect with probability (1 − \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \uppi $$\end{document}π). Marker effects and their PEV are estimated by using SVD and the posterior probability of the marker having a non-zero effect is calculated. These posterior probabilities are used to obtain marker-specific effect variances, which are subsequently used to approximate BayesC estimates of marker effects in a linear model. A computer simulation study was conducted to compare alternative genomic prediction methods, where a single reference generation was used to estimate marker effects, which were subsequently used for 10 generations of forward prediction, for which accuracies were evaluated. Results SVD-based posterior probabilities of markers having non-zero effects were generally lower than MCMC-based posterior probabilities, but for some regions the opposite occurred, resulting in clear signals for QTL-rich regions. The accuracies of breeding values estimated using SVD- and MCMC-based BayesC analyses were similar across the 10 generations of forward prediction. For an intermediate number of generations (2 to 5) of forward prediction, accuracies obtained with the BayesC model tended to be slightly higher than accuracies obtained using the best linear unbiased prediction of SNP effects (SNP-BLUP model). When reducing marker density from WGS data to 30 K, SNP-BLUP tended to yield the highest accuracies, at least in the short term. Conclusions Based on SVD of the genotype matrix, we developed a direct method for the calculation of BayesC estimates of marker effects. Although SVD- and MCMC-based marker effects differed slightly, their prediction accuracies were similar. Assuming that the SVD of the marker genotype matrix is already performed for other reasons (e.g. for SNP-BLUP), computation times for the BayesC predictions were comparable to those of SNP-BLUP.


Background
Singular value decomposition (SVD) is commonly used in multi-variate statistics to study the explanatory variables and to decompose the design matrix into independent components [1]. In principal component regression (PCR), only the components with the largest singular values are fitted, i.e. PCR considers the smaller components as noise on the explanatory variables, X, and thus omits them [1]. This reduction in the number of components is essential in the so-called k > N problems, where the number of explanatory variables (k) exceeds the number of records (N), because it is not possible to estimate more than N effects from N records, unless random effects are assumed, in which case more than N effects can be predicted.
In genomic selection (GS), matrix X contains the marker genotypes and the number of marker effects (k) can greatly exceed the number of phenotypic records, especially in the case of whole-genome sequence (WGS) data. In these cases, the k > N problem is tackled mainly by the use of prior information for the marker effects. For instance, the marker effects can be assumed to have a normal distribution, as in the single nucleotide polymorphism best linear unbiased prediction (SNP-BLUP) model, or they can be assumed to come from a mixture of two distributions with one of them having all its probability density at zero [2]. The latter model assumes with some prior probability π that the marker effect comes from a prior distribution (e.g. the t-distribution in BayesB [3] or the normal distribution in BayesC [4]) and with probability (1 − π) that the marker effect has no effect on the trait. These models are called variable selection models because they attempt to select the variables that affect the trait [5]. Especially in sequence data, this makes sense biologically, since the causal variates are expected to be contained in the sequence, among many non-causal variates [6]. For these models, straightforward application of PCR does not seem very sensible because all principle components are assumed to be affected by all variates, i.e. PCR does not reduce the number of genotypes involved in the prediction.
Computationally, variable selection models are mostly implemented by using Markov chain Monte Carlo (MCMC) algorithms [2][3][4], which make them computationally demanding and impractical for large-scale genomic prediction, especially when these involve WGS data. Although some non-MCMC approximations exist, they do not result in quite as accurate predictions as MCMC implementations of these models [7,8]. Here we show that SVD can simplify the BayesC calculations significantly, make them not MCMC-based, and thus make the analysis of large amounts of WGS data possible in practice. We also compared the results of the SVD-based algorithm to those obtained using the MCMC approach. Although the SVD of large amounts of WGS data remains computationally a formidable task, in a companion paper (Ødegård, Indahl, Stranden, Meuwissen: Large-scale genomic prediction using singular value decomposition of the genotype matrix; Accepted in GSE) showed that this task can be performed per chromosome (segment) and thus in parallel.

Applying SVD to the SNP-BLUP model
Generally, we will assume that we are dealing with WGS data. Polymorphisms in the sequence will be called SNPs, although extension to other types of polymorphisms is straightforward, as long as their genotypes can be translated into covariates in a regression model. We will briefly describe the application of SVD to GS, which will also describe our notation. For more details, see [9,10]. In GS, the general regression model is: where y a (N × 1) vector of phenotypes; µ is the overall mean; X is an (N × k) matrix of standardized genotypes (see [11]); b is a (k × 1) vector of random SNP effects with Var(b) = Iσ 2 b (the SNP-BLUP model with equal SNP variances is assumed for now); and e is a vector of model residuals (Var(e) = Iσ 2 e ). Now, the 'economy' version of SVD is applied to the matrix X to yield (assuming k > N): where U is an (N × N) orthonormal matrix of components that describes the family relationships between the animals (U contains the eigenvectors of the genomic relationship matrix G, with properties: U ′ U = UU ′ = I); S is an (N × N) diagonal matrix of singular values; and V is a (k × N) matrix of components that describe the linkage disequilibrium (LD) structure among the SNPs (properties: V ′ V = I). Matrix V contains the eigenvectors of the X ′ X matrix, which contains the LD between the SNPs as the signed square root of their r 2 values [1,12] ("signed" denotes a positive (negative) sign if the correlation between the SNP genotypes is positive (negative)).
Thus, the above regression model can be rewritten as: where s = V ′ b and b = Vs. In the latter model, US can be seen to represent independent components (i.e. linear combinations) of SNP genotypes, while s represents the effects of these components. At this point, it is possible to omit some of the components with small singular values in S, which reflect noise on the estimates of X. The variance of the effects of the components is: When applied to s, Henderson's mixed model equations (MME) [13] become: where S 2 = SU ′ US (since U ′ U = I), X ′ 1 = 0 (a vector of zeros), since the genotypes are standardized, such that the average is 0 for each SNP (i.e. the allele frequencies used are those computed from the data), and b = σ 2 e /σ 2 b . The coefficient matrix of these MME is diagonal, thus (1) computation of the solutions is easy. From these MME, the prediction error variance (PEV) matrix of the components s is: S 2 + I b −1 σ 2 e , which is a diagonal matrix and thus easy to calculate. From ŝ, we obtain the estimates of the SNP effects (i.e. as the mean of the posterior distribution): The PEV of the effect of SNP j are readily obtained as: where V j is the jth row of V, which accounts for simultaneous estimation of all SNP effects.

Application of the BayesC prior
The BayesC prior distribution is a mixture distribution [4]: where σ 2 is approximately the largest variance a SNP effect is expected to have. E.g., the largest SNP effects are expected to have a variance of ~ 0.001* σ 2 g , where σ 2 g is the additive genetic variance. Suitable values for π and/ or σ 2 can be obtained by cross-validation (for appropriate cross-validation schemes see [14]).
Consider estimation of the effect of SNP j, b j . The model is: where x j is the jth column of the genotype matrix X and ∈ is a vector of residuals, which includes the effects of all other SNPs and the environmental effects, e. Thus, Var(∈) = Iσ 2 e + G −j = Rσ 2 e , where G −j is the genetic variance times the genomic relationship matrix based on all SNPs except SNP j, and with R = I + G −j /σ 2 e . Strictly, R depends on SNP j but since the effect of a single SNP on the overall G matrix is expected to be small, we will assume that R is approximately independent of j. Then, the MME for the estimation of the effect of SNP j is: where x j is the jth column of genotype matrix X, and the variance ratio = σ 2 e /σ 2 . Note that these MME are the same as for the SNP-BLUP model, except for the variance ratio, which is b = σ 2 e /σ 2 b for the latter. The PEV of the estimate of the effect of SNP j is: and with prior probability Again, this is the same as for the SNP-BLUP model, except that the variance ratio is b instead of . For the SNP-BLUP model, we can calculate PEV b j using Eq. (3). And, since we assume that b and σ 2 e are known, we can solve for the x ′ j R −1 x j term in Eq. (5), which represents the effective number of records that contribute to the effect estimate for SNP j. By combining x ′ j R −1 x j , the SNP-BLUP estimate b j [from Eq. (2)], and the variance ratio , we can compute the right-hand-side of Eq. (4): The log-likelihood ratio of b j ∼ N 0, σ 2 versus b j = 0 In order to write the likelihood under the model without an effect at SNP j, b j = 0, we write the (co)variance matrix of the records as Var y = Iσ 2 e + G −j = Rσ 2 e . From the multivariate normal density function, the loglikelihood of this model is: For the model with a non-zero SNP effect, i.e. b j ∼ N 0, σ 2 , the variance of the records is:Var (y) = Rσ 2 e + x j x ′ j σ 2 . The inverse of Var y is : And the determinant of Var y is: The log-likelihood of a nonzero effect at SNP j is most conveniently expressed as a deviation from the log-likelihood of the model when b j = 0, i.e. as the loglikelihood where the term (x ′ j R −1 x j + ) is obtained from Eq. (5) and the term [x ′ j R −1 y − 1µ ] from Eq. (4) (see previous Section).

Ratio of posterior probabilities and BayesC estimates of SNP effects
The LLR is combined with the log-prior-ratio, Log(π)− Log(1 − π) into the log-posterior-probability-ratio: The posterior probability of SNP j having a nonzero effect is now: To approximate the BayesC analysis, we remain within the realm of linear models and translate the posterior probabilities into individual variances of SNP effects, D j : The BayesC estimates of SNP effects, b c , can be obtained from a linear model with SNP weights proportional to D j by assuming In the BayesC analysis, the variance of the components s c becomes: And the BayesC estimates ŝ c are obtained from Henderson's MME: Unfortunately, the coefficient matrix of these MME is no longer diagonal. The size of matrix V ′ DV is the number of components squared, thus, as long as the number of components is not too large (e.g. < 10,000), computation of its inverse is reasonably easy. From ŝ c , the SNP effects can be obtained as: A more formal derivation of Eq. (7) is provided in "Appendix". The BayesC estimates of the breeding values of the animals are obtained as USŝ c . Note that the BayesC algorithm described above does not require iteration, which makes it computationally fast.

Analysis of a simulated WGS data
WGS data were simulated to demonstrate the calculations and evaluate their results. The simulated species had 20 chromosomes of 1 Morgan (10 8 bp) each. Simulation of WGS data followed the approach of [6], except that their scaling argument was not applied here, i.e. the computational costs were not scaled down. The historical effective population size was 1000, which also reflects its actual size since simulation of new generations followed Wright's idealized population structure. In order to create LD and mutation-drift equilibrium, the historical population was simulated for 10,000 generations. In the last of the 10,000 generations, population size was increased to 10,000 individuals, which represented the reference population. The per meiosis and per base pair mutation rate was 10 −8 , and mutations followed the infinite sites model. This resulted in 531,836 SNPs with minor allele frequencies (MAF) higher than 0.01 in the reference population, in which SNP effects were estimated. Per chromosome, 200 SNPs with MAF higher than 0.01 were randomly sampled as causative SNPs, i.e. 4000 causative SNPs in total. Genotypes were standardized to the values 2p j 1 − p j j for genotypes '0 0' , '0 1' and '1 1' , respectively, where p j is the frequency of allele 1, and collected in the genotype matrix X.
True genetic values of the animals were obtained as: where t is a (531,836 × 1) vector of SNP effects, which were sampled from a normal distribution for the 4000 causative SNPs and were set to 0 for non-causative SNPs; and α is a scaling factor which was chosen such that the variance of TBV in the reference population was 1. Phenotypes were obtained by adding random noise sampled from the distribution N (0, 1) to TBV, resulting in a heritability of 0.5. To estimate SNP effects, we used the phenotypes on the 10,000 animals in the reference population and their genotype matrix X, and applied SNP-BLUP Eq. (2), our deterministic BayesC method Eq. (7), or a, MCMC based BayesC algorithm [15]. For the BayesC analyses, it was assumed that π = 0.01 and that each causative SNP explained a proportion 0.001 of the total genetic variance. Heritabilities and genetic and environmental variances were assumed known for all analyses. We assumed that the estimates of SNP effects were used in later generations to predict EBV, thus 10 more generations were simulated by applying Wright's idealized population structure. The effective size in these 10 descending generations was reduced to 100 in order to increase genetic drift towards values that are realistic for livestock populations. In these generations, TBV were calculated using Eq. (8) and the correlation between TBV and estimates of the breeding values based on the estimated SNP effects was used as a measure of the accuracy of GS. The results were based on only four replicated simulations because the computational costs of these WGS data simulations and analyses were high, both in terms of computer time and disk space.

SVD of the simulated WGS data
The 'economy' version of SVD was conducted on the standardized genotypes matrices from each of the 20 chromosomes separately (average size 10,000 × 26,592 ), where 'economy' implies that components with singular values equal to zero were not computed. For all chromosomes, the 6000 largest singular values explained more than 95% of the total variance, i.e. Trace S 2 6000 > 0.95 * Trace S 2 , where S 6000 is the diagonal matrix with the 6000 largest singular values. Hence, the 6000 largest singular values were retained for each chromosome and, for each chromosome, we defined T 6000 = U 6000 * S 6000 . Next, an overall SVD was performed for T 120000 = T 6000(1) T 6000 (2) . . . T 6000(20) , where subscript (i) denotes chromosome i: where T 120000 has dimensions 10, 000 × 120, 000, and U 0 , S 0 and V 0 denote the SVD of T 120000 . The overall SVD of all genotypes can be obtained as: where with V 6000(i) denoting the V matrix from the SVD of each chromosome i. Figure 1 compares the posterior probabilities from the SVD-based BayesC analysis and the MCMC-based BayesC analysis. Because there are 4000 QTL, i.e. QTL are regularly distributed along the genome, the QTL positions are not indicated in Fig. 1. Generally, both analyses agreed on where the regions with the highest posterior probability are, but the SVD-based analysis showed much more

Results
pronounced QTL signals than the MCMC analysis. Thus, it appears that the assumption of the linear model involved in the SVD analysis makes it overconfident about some SNP positions. The MCMC analysis implicitly accounts for the mixture distribution of the model, which results in an increase in SNPs with small estimates of effects and a decrease in SNPs with large estimates, which agrees with the results of [8]. Also, the SVD-based posterior probabilities generally seem smaller than those from the MCMC analysis. The sum of the posterior probabilities for the MCMC and SVD-based analyses were 3884 and 67, respectively (result not shown elsewhere). Thus, the sum of the posterior probabilities of the MCMC-based analyses was much closer to the actual number of QTL, i.e. 4000. The profile of the accuracy of GS using WGS data over 10 generations of descendants is in Fig. 2. All three analyses started with about the same accuracy and showed a decline of accuracies as the time between the reference and validation population increased. The accuracies of the SNP-BLUP analysis tended to drop somewhat more during intermediate generations 2 to 5 compared to those of the SVD -BayesC analysis. From generation 5 onwards, the accuracy of all analyses dropped at similar rates. The MCMC-BayesC analysis yielded similar accuracies as the SVD-based analysis, but during the intermediate generations its accuracy was between that of SNP-BLUP and SVD-BayesC. The latter agrees with Fig. 1, where the SNP solutions of MCMC-BayesC are less skewed and thus more like SNP-BLUP solutions. In any case, the SVD-BayesC analysis appeared to be at least as accurate as the MCMC-BayesC analysis. Figure 3 shows the accuracy of GS using ~ 30 k SNPchip data. In this situation, the SNP-BLUP analysis was more accurate, except from generation 8 onwards, for which accuracies were similar across methods. Both BayesC analyses had similar accuracies. When the causative mutations were not included in the genotype data, accurate GS appeared to depend on the prediction of the effects of SNP haplotypes that segregate in the population. The SNP-BLUP method appeared to achieve this better, probably because it uses all genotyped SNPs in the region to define haplotypes. Over generations, the original haplotypes are broken down by recombination and, thus, the accuracies of all methods decreased. Figure 4 shows the estimates of regression coefficients of true on estimated breeding values over time. For unbiased prediction, these regression coefficients should be 1, but all methods showed some bias in the sense that the distribution of the estimated breeding values was too large (regression coefficients less than 1 shrink the GEBV). Although the SNP-BLUP method resulted in the smallest bias, it was also somewhat biased, possibly because the SNP effects were estimated in a dataset with few close relatives (due to the large effective size of the reference population). Thus, the genomic relationship matrix, G, was very similar to the residual covariance matrix, apart from some structure due to small genetic relationships and, thus, the SNP estimates could easily pick-up some covariances due to the randomness of residuals. This effect will be enhanced for the Bayesian variable selection models, which specifically search for SNP genotypes that correlate well with the phenotypes and thus also with residuals. In the case of the SNP-chip data, these biases were even larger due to imperfect LD between the SNPs and the QTL. Table 1 shows the computing times for the alternative GS models. The MCMC BayesC method required almost 2 days and, thus, is impractical, especially when the size of datasets exceeds 10,000 animals and half a million SNPs. SVD was performed by the Lapack library routines (http://www.netlib.org/lapack/), which provides parallel algorithms for this task (10 parallel processors were used here). The SVD was the most time-demanding step in SVD-BayesC, with 8.5 min per chromosome (note that this could be performed for all chromosomes in parallel) and 25 min for the overall SVD. The computing time of the SVD of a matrix with dimension (n × m ) increases proportionally to m * n 2 , where n is the smaller dimension (usually the number of animals) and m the larger dimension (usually the number of SNPs or the number of components involved in the overall SVD). Thus, at a constant number of animals, the chromosome-wise SVD would increase only linearly with the number of SNPs. Since the required number of components per chromosome is expected to increase only marginally as  the number of SNPs increases, computation time of the overall SVD should increase only marginally. Computing times for the overall SVD are expected to increase linearly with the number of chromosomes (assuming the chromosomes are of equal size and the number of components involved in the overall SVD exceeds the number of animals). The memory requirements of the SVD are 8 n 2 + (m + 1)n bytes, assuming double precision calculations, and thus increase in a quadratic manner with the smallest dimension and linearly with the largest dimension of the genotype matrix. Because of the large storage requirements of the results of the SVD of a large matrix, it may be beneficial to store SVD results in a compressed form, although we did not attempt this here.

Discussion
An SVD-based variable selection model was developed that is computationally fast, assuming that the SVD of the genotype matrix X was performed previously. It was shown that SVD of X facilitates the calculation of the PEV of SNP effects, which were used in a BayesC setting to calculate posterior probabilities of a QTL. Although the posterior probabilities were generally lower than those from a MCMC BayesC analysis, the accuracies of prediction were competitive compared to those of an MCMC analysis. The persistency of the accuracies across generations of descendants was tested in a worse-case scenario, i.e. without updating of the reference population. The persistency of accuracy over generations was similar for the SVDand MCMC-based predictions and slightly higher than for the SNP-BLUP analysis, at least during intermediate generations when using WGS data. When 30 K SNP-chip data were used, the MCMC and SVD-BayesC analyses also yielded similar results, but their accuracy of prediction was slightly lower than that of SNP-BLUP. Apparently, the number of SNPs with effects is so large that assuming a priori that all SNPs have effects, as in the SNP-BLUP model, is beneficial, a situation that was also described by Daetwyler et al. [16].
The difference between the posterior probabilities, PP j , from the SVDand the MCMC-based analyses may be due to the fact that the former assumes that the variance at locus j is known without error (resulting in Eq. (3) to predict the PEV of the SNP effect), whereas the MCMC analysis accounts for the uncertainty of the variance at locus j. This appears to result in stronger PP j signals for the SVD-based compared to the MCMCbased analysis in some chromosomal regions. When this is undesirable, higher values of π can be used to spread the genetic effects over more SNPs. Although 4000 QTL were randomly scattered across the genome, the regions with large PP j signals may have a high density of QTL. Nevertheless, since the MCMC analysis may be seen as the 'golden standard' , we consider the stronger QTL signals from SVD-based posterior probabilities to be anti-conservative.
We assumed that the value of π was known and equal to the number of QTL divided by the number of SNPs. This assumes that effectively only one SNP is needed to predict a QTL (although a number of SNPs with reduced PP j might actually pick up the QTL effect), which has been found to result in similar accuracies as using an optimized value of π [17]. However, in real data, the number of QTL is unknown but the optimal value of π can be found by cross-validation [18]. SVD-BayesC is well suited for such cross-validation computations, because of its computational speed and because the SVD of the genotype matrix needs to be performed only once. Note that the choice of π defines the variance of SNP effects σ 2 , based on equation N SNPs πσ 2 = σ 2 g , where N SNPs πσ 2 equals the total genetic variance assumed by the BayesC model, and σ 2 g the (assumed) genetic variance of the trait. The computational speed of the SVD-based BayesC analysis depended heavily on the computation of the SVD. We performed the SVD separately for each chromosome, but different situations may call for different strategies to perform the SVD. For instance, SVD(X) may have been already obtained for other reasons than the BayesC analysis, e.g. to perform large-scale national evaluations by predicting only the components in order to reduce computations (Ødegård, Indahl, Stranden, Meuwissen: Large-scale genomic prediction using singular value decomposition of the genotype matrix; Accepted in GSE). In the case of a real WGS analysis, the number of SNPs is often substantially larger than in our simulation, e.g. due to a very large effective population size in the distant past, which generated many SNPs that are still segregating. If the SVD of a chromosome is too large, it can be performed per chromosome segment instead of per chromosome, which is a straightforward extension of the SVD analysis by chromosome adopted here. In a subsequent study, we intend to perform the SVD on real WGS data.
In situations where the family structure is not strong (as was the case in our simulated data), the per chromosome components are approximately unrelated and the final SVD on the combined components (SVD(T 120000 ) ) can be omitted. In this case, SVD(X) = USV ′ , with U = U (1) U (2) . . . U (l) , S = diag S (i) , and V = diag V (i) , where subscript (i) denotes matrices from the per chromosome SVD of chromosome i; and diag V (i) denotes a block-diagonal matrix with the diagonal blocks containing the V (i) matrices. Since this SVD results only in approximately independent components, Eq. (3) holds only approximately. Whether this approximation is sufficiently accurate can be investigated by checking whether U ′ U ≈ I, and will depend on the family structure of the population. In an ultimate test, one can use Eq. (3) (knowing that it holds only approximately) and check the accuracy of the resulting BayesC analysis by cross-validation. In this analysis, Eqs. (6) and (7) can be used to estimate the SNP-BLUP marker effects [instead of Eq. (1)] in order to account for covariances between the components.

Conclusions
After performing the SVD, the BayesC analysis developed here is computationally fast and comparable to SNP-BLUP calculations, whereas its accuracy is competitive compared to that of MCMC-based BayesC analyses. It may also be noted that the SVD needs to be performed only once across all traits. Thus, when many traits need to be analysed, the computational costs of calculating SVD(X) are relatively small on a per trait basis. The profiles of the accuracies over generations showed that BayesC accuracies were slightly more persistent over an intermediate number of generations/meioses (2-5 generations) than the SNP-BLUP (or, equivalently, GBLUP) accuracies, which enables genomic predictions over longer genetic distances.