# Variable selection models for genomic selection using whole-genome sequence data and singular value decomposition

- Theo H. E. Meuwissen
^{1}Email author, - Ulf G. Indahl
^{1}and - Jørgen Ødegård
^{1, 2}View ORCID ID profile

**Received: **25 January 2017

**Accepted: **12 December 2017

**Published: **27 December 2017

## Abstract

### Background

Non-linear Bayesian genomic prediction models such as BayesA/B/C/R involve iteration and mostly Markov chain Monte Carlo (MCMC) algorithms, which are computationally expensive, especially when whole-genome sequence (WGS) data are analyzed. Singular value decomposition (SVD) of the genotype matrix can facilitate genomic prediction in large datasets, and can be used to estimate marker effects and their prediction error variances (PEV) in a computationally efficient manner. Here, we developed, implemented, and evaluated a direct, non-iterative method for the estimation of marker effects for the BayesC genomic prediction model.

### Methods

The BayesC model assumes a priori that markers have normally distributed effects with probability \( \uppi \) and no effect with probability (1 − \( \uppi \)). Marker effects and their PEV are estimated by using SVD and the posterior probability of the marker having a non-zero effect is calculated. These posterior probabilities are used to obtain marker-specific effect variances, which are subsequently used to approximate BayesC estimates of marker effects in a linear model. A computer simulation study was conducted to compare alternative genomic prediction methods, where a single reference generation was used to estimate marker effects, which were subsequently used for 10 generations of forward prediction, for which accuracies were evaluated.

### Results

SVD-based posterior probabilities of markers having non-zero effects were generally lower than MCMC-based posterior probabilities, but for some regions the opposite occurred, resulting in clear signals for QTL-rich regions. The accuracies of breeding values estimated using SVD- and MCMC-based BayesC analyses were similar across the 10 generations of forward prediction. For an intermediate number of generations (2 to 5) of forward prediction, accuracies obtained with the BayesC model tended to be slightly higher than accuracies obtained using the best linear unbiased prediction of SNP effects (SNP-BLUP model). When reducing marker density from WGS data to 30 K, SNP-BLUP tended to yield the highest accuracies, at least in the short term.

### Conclusions

Based on SVD of the genotype matrix, we developed a direct method for the calculation of BayesC estimates of marker effects. Although SVD- and MCMC-based marker effects differed slightly, their prediction accuracies were similar. Assuming that the SVD of the marker genotype matrix is already performed for other reasons (e.g. for SNP-BLUP), computation times for the BayesC predictions were comparable to those of SNP-BLUP.

## Background

Singular value decomposition (SVD) is commonly used in multi-variate statistics to study the explanatory variables and to decompose the design matrix into independent components [1]. In principal component regression (PCR), only the components with the largest singular values are fitted, i.e. PCR considers the smaller components as noise on the explanatory variables, \( {\mathbf{X}} \), and thus omits them [1]. This reduction in the number of components is essential in the so-called \( k > N \) problems, where the number of explanatory variables (\( k \)) exceeds the number of records (\( N \)), because it is not possible to estimate more than \( N \) effects from \( N \) records, unless random effects are assumed, in which case more than \( N \) effects can be predicted.

In genomic selection (GS), matrix \( {\mathbf{X}} \) contains the marker genotypes and the number of marker effects (\( k \)) can greatly exceed the number of phenotypic records, especially in the case of whole-genome sequence (WGS) data. In these cases, the \( k > N \) problem is tackled mainly by the use of prior information for the marker effects. For instance, the marker effects can be assumed to have a normal distribution, as in the single nucleotide polymorphism best linear unbiased prediction (SNP-BLUP) model, or they can be assumed to come from a mixture of two distributions with one of them having all its probability density at zero [2]. The latter model assumes with some prior probability \( \uppi \) that the marker effect comes from a prior distribution (e.g. the *t*-distribution in BayesB [3] or the normal distribution in BayesC [4]) and with probability (1 − \( \uppi \)) that the marker effect has no effect on the trait. These models are called variable selection models because they attempt to select the variables that affect the trait [5]. Especially in sequence data, this makes sense biologically, since the causal variates are expected to be contained in the sequence, among many non-causal variates [6]. For these models, straightforward application of PCR does not seem very sensible because all principle components are assumed to be affected by all variates, i.e. PCR does not reduce the number of genotypes involved in the prediction.

Computationally, variable selection models are mostly implemented by using Markov chain Monte Carlo (MCMC) algorithms [2–4], which make them computationally demanding and impractical for large-scale genomic prediction, especially when these involve WGS data. Although some non-MCMC approximations exist, they do not result in quite as accurate predictions as MCMC implementations of these models [7, 8]. Here we show that \( {\text{SVD}} \) can simplify the BayesC calculations significantly, make them not MCMC-based, and thus make the analysis of large amounts of WGS data possible in practice. We also compared the results of the \( {\text{SVD}} \)-based algorithm to those obtained using the MCMC approach. Although the \( {\text{SVD}} \) of large amounts of WGS data remains computationally a formidable task, in a companion paper (Ødegård, Indahl, Stranden, Meuwissen: Large-scale genomic prediction using singular value decomposition of the genotype matrix; Accepted in GSE) showed that this task can be performed per chromosome (segment) and thus in parallel.

## Methods

### Applying SVD to the SNP-BLUP model

### Application of the BayesC prior

The BayesC prior distribution is a mixture distribution [4]:

### The log-likelihood ratio of \( b_{j} \sim N\left( {0,\sigma^{2} } \right) \) versus \( b_{j} = 0 \)

### Ratio of posterior probabilities and BayesC estimates of SNP effects

*LLR*is combined with the log-prior-ratio, \( Log(\uppi) - Log(1 -\uppi) \) into the log-posterior-probability-ratio:

### Analysis of a simulated WGS data

WGS data were simulated to demonstrate the calculations and evaluate their results. The simulated species had 20 chromosomes of 1 Morgan (10^{8} bp) each. Simulation of WGS data followed the approach of [6], except that their scaling argument was not applied here, i.e. the computational costs were not scaled down. The historical effective population size was 1000, which also reflects its actual size since simulation of new generations followed Wright’s idealized population structure. In order to create LD and mutation-drift equilibrium, the historical population was simulated for 10,000 generations. In the last of the 10,000 generations, population size was increased to 10,000 individuals, which represented the reference population. The per meiosis and per base pair mutation rate was 10^{−8}, and mutations followed the infinite sites model. This resulted in 531,836 SNPs with minor allele frequencies (MAF) higher than 0.01 in the reference population, in which SNP effects were estimated. Per chromosome, 200 SNPs with MAF higher than 0.01 were randomly sampled as causative SNPs, i.e. 4000 causative SNPs in total. Genotypes were standardized to the values of \( - 2p_{j} /\surd \left( {2p_{j} \left( {1 - p_{j} } \right)} \right) \), \( \left( {1 - 2p_{j} } \right)/\surd \left( {2p_{j} \left( {1 - p_{j} } \right)} \right) \), and \( \left( {2 - 2p_{j} } \right)/\surd \left( {2p_{j} \left( {1 - p_{j} j} \right)} \right) \) for genotypes ‘0 0’, ‘0 1’ and ‘1 1’, respectively, where \( p_{j} \) is the frequency of allele 1, and collected in the genotype matrix \( {\mathbf{X}} \).

We assumed that the estimates of SNP effects were used in later generations to predict EBV, thus 10 more generations were simulated by applying Wright’s idealized population structure. The effective size in these 10 descending generations was reduced to 100 in order to increase genetic drift towards values that are realistic for livestock populations. In these generations, \( {\mathbf{TBV}} \) were calculated using Eq. (8) and the correlation between \( {\mathbf{TBV}} \) and estimates of the breeding values based on the estimated SNP effects was used as a measure of the accuracy of GS. The results were based on only four replicated simulations because the computational costs of these WGS data simulations and analyses were high, both in terms of computer time and disk space.

### SVD of the simulated WGS data

## Results

### Computing times

Computing times with a dual core Xeon(R) CPU E5-2620 v2 (2.10 GHz) machine with 24 processors

Task | Number of times required | Wall-time | Number of processors used | ||
---|---|---|---|---|---|

Days | Hours | Minutes | |||

MCMC_BayesC | 1 | 1 | 19 | 28 | 6 |

SVD (one chromosome) | 20 | 8.5 | 10 | ||

SVD (overall) | 1 | 25.1 | 10 | ||

SVD-based BayesC | 1 | 2.2 | 1 | ||

SVD-based SNPBLUP | 1 | 1.5 | 1 |

## Discussion

An \( {\text{SVD}} \)-based variable selection model was developed that is computationally fast, assuming that the \( {\text{SVD}} \) of the genotype matrix \( {\mathbf{X}} \) was performed previously. It was shown that \( {\text{SVD}} \) of \( {\mathbf{X}} \) facilitates the calculation of the PEV of SNP effects, which were used in a BayesC setting to calculate posterior probabilities of a QTL. Although the posterior probabilities were generally lower than those from a MCMC BayesC analysis, the accuracies of prediction were competitive compared to those of an MCMC analysis. The persistency of the accuracies across generations of descendants was tested in a worse-case scenario, i.e. without updating of the reference population. The persistency of accuracy over generations was similar for the \( {\text{SVD}} \)- and MCMC-based predictions and slightly higher than for the SNP-BLUP analysis, at least during intermediate generations when using WGS data.

When 30 K SNP-chip data were used, the MCMC and \( {\text{SVD}} \)-BayesC analyses also yielded similar results, but their accuracy of prediction was slightly lower than that of SNP-BLUP. Apparently, the number of SNPs with effects is so large that assuming a priori that all SNPs have effects, as in the SNP-BLUP model, is beneficial, a situation that was also described by Daetwyler et al. [16].

The difference between the posterior probabilities, \( PP_{j} \), from the \( {\text{SVD}} \)- and the MCMC-based analyses may be due to the fact that the former assumes that the variance at locus \( j \) is known without error (resulting in Eq. (3) to predict the PEV of the SNP effect), whereas the MCMC analysis accounts for the uncertainty of the variance at locus \( j \). This appears to result in stronger \( PP_{j} \) signals for the \( {\text{SVD}} \)-based compared to the MCMC-based analysis in some chromosomal regions. When this is undesirable, higher values of \( \uppi \) can be used to spread the genetic effects over more SNPs. Although 4000 QTL were randomly scattered across the genome, the regions with large \( PP_{j} \) signals may have a high density of QTL. Nevertheless, since the MCMC analysis may be seen as the ‘golden standard’, we consider the stronger QTL signals from \( {\text{SVD}} \)-based posterior probabilities to be anti-conservative.

We assumed that the value of \( \uppi \) was known and equal to the number of QTL divided by the number of SNPs. This assumes that effectively only one SNP is needed to predict a QTL (although a number of SNPs with reduced \( PP_{j} \) might actually pick up the QTL effect), which has been found to result in similar accuracies as using an optimized value of \( \uppi \) [17]. However, in real data, the number of QTL is unknown but the optimal value of \( \uppi \) can be found by cross-validation [18]. \( {\text{SVD}} \)-BayesC is well suited for such cross-validation computations, because of its computational speed and because the \( {\text{SVD}} \) of the genotype matrix needs to be performed only once. Note that the choice of \( \uppi \) defines the variance of SNP effects \( \sigma^{2} \), based on equation \( N_{SNPs}\uppi\sigma^{2} = \sigma_{g}^{2} \), where \( N_{SNPs}\uppi\sigma^{2} \) equals the total genetic variance assumed by the BayesC model, and \( \sigma_{g}^{2} \) the (assumed) genetic variance of the trait.

The computational speed of the \( {\text{SVD}} \)-based BayesC analysis depended heavily on the computation of the \( {\text{SVD}} \). We performed the \( {\text{SVD}} \) separately for each chromosome, but different situations may call for different strategies to perform the \( {\text{SVD}} \). For instance, \( {\text{SVD}}\left( {\mathbf{X}} \right) \) may have been already obtained for other reasons than the BayesC analysis, e.g. to perform large-scale national evaluations by predicting only the components in order to reduce computations (Ødegård, Indahl, Stranden, Meuwissen: Large-scale genomic prediction using singular value decomposition of the genotype matrix; Accepted in GSE). In the case of a real WGS analysis, the number of SNPs is often substantially larger than in our simulation, e.g. due to a very large effective population size in the distant past, which generated many SNPs that are still segregating. If the SVD of a chromosome is too large, it can be performed per chromosome segment instead of per chromosome, which is a straightforward extension of the \( {\text{SVD}} \) analysis by chromosome adopted here. In a subsequent study, we intend to perform the \( {\text{SVD}} \) on real WGS data.

In situations where the family structure is not strong (as was the case in our simulated data), the per chromosome components are approximately unrelated and the final \( {\text{SVD}} \) on the combined components (\( {\text{SVD}}\left( {{\mathbf{T}}_{120000} } \right) \)) can be omitted. In this case, \( {\text{SVD}}\left( {\mathbf{X}} \right) = {\mathbf{USV}}^\prime \) , with \( {\mathbf{U}} = \left[ {{\mathbf{U}}_{\left( 1 \right)} {\mathbf{U}}_{\left( 2 \right)} \ldots {\mathbf{U}}_{\left( l \right)} } \right] \), \( {\mathbf{S}} = {\text{diag}}\left( {{\mathbf{S}}_{\left( i \right)} } \right) \), and \( {\mathbf{V}} = {\text{diag}}\left( {{\mathbf{V}}_{\left( i \right)} } \right) \), where subscript (\( i \)) denotes matrices from the per chromosome \( {\text{SVD}} \) of chromosome \( i \); and \( {\text{diag}}\left( {{\mathbf{V}}_{\left( i \right)} } \right) \) denotes a block-diagonal matrix with the diagonal blocks containing the \( {\mathbf{V}}_{\left( i \right)} \) matrices. Since this \( {\text{SVD}} \) results only in approximately independent components, Eq. (3) holds only approximately. Whether this approximation is sufficiently accurate can be investigated by checking whether \( {\mathbf{U}}^{{\prime }} {\mathbf{U}} \approx {\mathbf{I}} \), and will depend on the family structure of the population. In an ultimate test, one can use Eq. (3) (knowing that it holds only approximately) and check the accuracy of the resulting BayesC analysis by cross-validation. In this analysis, Eqs. (6) and (7) can be used to estimate the SNP-BLUP marker effects [instead of Eq. (1)] in order to account for covariances between the components.

## Conclusions

After performing the \( {\text{SVD}} \), the BayesC analysis developed here is computationally fast and comparable to SNP-BLUP calculations, whereas its accuracy is competitive compared to that of MCMC-based BayesC analyses. It may also be noted that the \( {\text{SVD}} \) needs to be performed only once across all traits. Thus, when many traits need to be analysed, the computational costs of calculating \( {\text{SVD}}\left( {\mathbf{X}} \right) \) are relatively small on a per trait basis. The profiles of the accuracies over generations showed that BayesC accuracies were slightly more persistent over an intermediate number of generations/meioses (2–5 generations) than the SNP-BLUP (or, equivalently, GBLUP) accuracies, which enables genomic predictions over longer genetic distances.

## Declarations

### Authors’ contributions

THEM performed analyses and wrote manuscript. UI improved computational strategies. JO helped design the analyses, and the computational approach, and helped write the manuscript. All authors read and approved the final manuscript.

### Acknowledgements

We are very grateful for the helpful comments of two anonymous reviewers. Funding from the Norwegian Research Council is gratefully acknowledged.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Martens H, Naes T. Multivariate calibration. New York: Wiley; 1989.Google Scholar
- Meuwissen T, Hayes B, Goddard M. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–29.PubMedPubMed CentralGoogle Scholar
- Yi N, Xu S. Bayesian LASSO for quantitative trait loci mapping. Genetics. 2008;179:1045–55.View ArticlePubMedPubMed CentralGoogle Scholar
- Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics. 2011;12:186.View ArticlePubMedPubMed CentralGoogle Scholar
- Verbyla KL, Hayes BJ, Bowman PJ, Goddard ME. Accuracy of genomic selection using stochastic search variable selection in Australian Holstein Friesian dairy cattle. Genet Res. 2009;91:307–11.View ArticleGoogle Scholar
- Meuwissen T, Goddard M. Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics. 2010;185:623–31.View ArticlePubMedPubMed CentralGoogle Scholar
- Meuwissen TH, Solberg TR, Shepherd R, Woolliams JA. A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol. 2009;41:2.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang T, Chen PY, Goddard ME, Meuwissen TH, Kemper KE, Hayes BJ. A computationally efficient algorithm for genomic prediction using a Bayesian model. Genet Sel Evol. 2014;47:34.View ArticleGoogle Scholar
- Cuevas J, Perez-Elizalde S, Soberanis V, Perez-Rodriguez P, Gianola D, Crossa J. Bayesian genomic-enabled prediction as an inverse problem. G3 (Bethesda). 2014;4:1991–2001.View ArticleGoogle Scholar
- Perez-Elizalde S, Cuevas J, Perez-Rodriguez P, Crossa J. Selection of the bandwidth parameter in a Bayesian kernel regression model for genomic-enabled prediction. J Agric Biol Environ Stat. 2015;20:512–32.View ArticleGoogle Scholar
- VanRaden P. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23.View ArticlePubMedGoogle Scholar
- Goddard ME, Hayes BJ, Meuwissen TH. Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet. 2011;128:409–21.View ArticlePubMedGoogle Scholar
- Henderson CR. Applications of linear models in animal breeding. Guelph: University of Guelph; 1984.Google Scholar
- Daetwyler HD, Calus MPL, Pong-Wong R, de los Campos G, Hickey JM. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics. 2013;193:347–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Luan T, Woolliams JA, Lien S, Kent M, Svendsen M, Meuwissen TH. The accuracy of genomic selection in Norwegian Red cattle assessed by cross-validation. Genetics. 2009;183:1119–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Daetwyler HD, Pong-Wong R, Villanueva B, Woolliams JA. The impact of genetic architecture on genome-wide evaluation methods. Genetics. 2010;185:1021–31.View ArticlePubMedPubMed CentralGoogle Scholar
- Yu X, Meuwissen TH. Using the Pareto principle in genome-wide breeding value estimation. Genet Sel Evol. 2011;43:35.View ArticlePubMedPubMed CentralGoogle Scholar
- Legarra A, Robert-Granié C, Manfredi E, Elsen JM. Performance of genomic selection in mice. Genetics. 2008;180:611–8.View ArticlePubMedPubMed CentralGoogle Scholar