- Research Article
- Open Access

# Large-scale genomic prediction using singular value decomposition of the genotype matrix

- Jørgen Ødegård
^{1}Email authorView ORCID ID profile, - Ulf Indahl
^{2}, - Ismo Strandén
^{3}and - Theo H. E. Meuwissen
^{2}

**Received:**25 January 2017**Accepted:**12 January 2018**Published:**28 February 2018

## Abstract

### Background

For marker effect models and genomic animal models, computational requirements increase with the number of loci and the number of genotyped individuals, respectively. In the latter case, the inverse genomic relationship matrix (GRM) is typically needed, which is computationally demanding to compute for large datasets. Thus, there is a great need for dimensionality-reduction methods that can analyze massive genomic data. For this purpose, we developed reduced-dimension singular value decomposition (SVD) based models for genomic prediction.

### Methods

Fast SVD is performed by analyzing different chromosomes/genome segments in parallel and/or by restricting SVD to a limited core of genotyped individuals, producing chromosome- or segment-specific principal components (PC). Given a limited effective population size, nearly all the genetic variation can be effectively captured by a limited number of PC. Genomic prediction can then be performed either by PC ridge regression (PCRR) or by genomic animal models using an inverse GRM computed from the chosen PC (PCIG). In the latter case, computation of the inverse GRM will be feasible for any number of genotyped individuals and can be readily produced row- or element-wise.

### Results

Using simulated data, we show that PCRR and PCIG models, using chromosome-wise SVD of a core sample of individuals, are appropriate for genomic prediction in a larger population, and results in virtually identical predicted breeding values as the original full-dimension genomic model (r = 1.000). Compared with other algorithms (e.g. algorithm for proven and young animals, APY), the (chromosome-wise SVD-based) PCRR and PCIG models were more robust to size of the core sample, giving nearly identical results even down to 500 core individuals. The method was also successfully tested on a large multi-breed dataset.

### Conclusions

SVD can be used for dimensionality reduction of large genomic datasets. After SVD, genomic prediction using dense genomic data and many genotyped individuals can be done in a computationally efficient manner. Using this method, the resulting genomic estimated breeding values were virtually identical to those computed from a full-dimension genomic model.

## Background

In recent years, genomic prediction [1] has revolutionized animal and plant breeding methods. With decreasing genotyping costs, the number of genotyped individuals has increased exponentially over years, with up to full sequence of genomic information available for prediction. Genomic prediction can be performed using two families of genomic models: marker effects models (MEM) (e.g. SNP-best linear unbiased prediction (BLUP), BayesA, BayesB, BayesC, etc.), and animal models that use a genomic relationship matrix (GRM). The latter can be further divided into genomic models that include genotyped animals only (genomic BLUP, i.e. GBLUP) and single-step GBLUP (ssGBLUP) models [2, 3] that combine genotyped and ungenotyped animals. The advantage of genomic animal models is that they fit nicely within the traditional linear models’ framework, and can essentially be adapted to any kind of linear or generalized linear animal model (single-trait, multi-trait, random regression, etc.).

However, with the increasing number of genotyped individuals and increasing density of genotypes, the computational requirements of genomic prediction models increase accordingly. Hence, MEM analysis of full sequence data, e.g. using Bayesian variable selection models, will be very demanding in terms of computing time. For ssGBLUP [2, 3], the inverse of the GRM is computed prior to analysis, which may be practically impossible when the number of genotyped animals becomes very large (e.g. > 100,000). To address the latter, Misztal et al. [4] proposed the “algorithm for proven and young animals” (APY), which uses a core sample of individuals to compute an approximate inverse of the GRM for all animals. However, in some cases, the total GRM does not have full rank, and thus no inverse. Therefore, Fernando et al. [5] suggested exact methods to obtain ssGBLUP solutions. One of the options that they proposed was to model animal genetic effects as linear combinations of independent factors. In the following section, we propose a related strategy that applies singular value decomposition (SVD) to perform large-scale genomic evaluation, both for MEM and animal genomic models. Thus, our study aims at: (1) using SVD and principal component (PC) ridge regression (PCRR) for genomic prediction as an alternative to MEM, using up to full sequence genomic data, and (2) applying SVD techniques for computation of exact inverses of PC-based GRM, using dimensionality reduction.

## Methods

### Marker effect models

### Gblup

*N*real numbers: \({\mathbf{z^{\prime}Gz}} = \frac{1}{\rho }{\mathbf{z^{\prime}XX^{\prime}z}} = \frac{1}{\rho }{\mathbf{u^{\prime}u}} \ge 0\) \(\left( {{\mathbf{u}} = {\mathbf{X^{\prime}z}}} \right)\). We defined an approximated GRM: \({\tilde{\mathbf{G}}} = \left( {{\mathbf{G}} + {\mathbf{I}}\theta } \right) = \frac{1}{\rho }\left( {{\mathbf{XX^{\prime}}} + {\mathbf{I}}\rho \theta } \right)\), where \(\theta\) is a small number (e.g. 10

^{−3}). The matrix \({\tilde{\mathbf{G}}}\) is positive definite, and thus invertible, as: \({\mathbf{z}^{\prime}\tilde{\mathbf{G}}\mathbf{z}} = \frac{1}{\rho } \cdot {\mathbf{u^{\prime}u}} + \theta \cdot {\mathbf{z^{\prime}z}} > 0\). Adding \(\theta\) to the GRM diagonal elements has a negligible effect on the solutions and may be viewed as fitting a (tiny) fraction of the residual as a part of the additive genetic effects, and thus is essentially equivalent to the original GBLUP model. Although \({\tilde{\mathbf{G}}}^{ - 1}\) exists, computing it by direct “brute-force” inversion will be increasingly challenging, and eventually impossible, as the number of genotyped individuals increases (e.g. for \(N\) > 100,000). Another option is to specify the equation system as [9]:

### Principal component ridge regression (PCRR)

Hence, \({\mathbf{VV}^{\prime}\hat{\mathbf{b}}} = {\hat{\mathbf{b}}}\), although \({\mathbf{VV^{\prime}}} \ne {\mathbf{I}}\).

In this system of equations, there are (at most) \(N\) independent effects to be estimated, rather than \(k\) effects (number of loci), and both \({\mathbf{S}}^{2}\) and **I** are diagonal matrices. Hence, the entire left-hand side of the BLUP equation system is diagonal, with diagonal elements \(\left( {S_{ii}^{2} + \lambda } \right)\). This equation system is extremely easy to solve, even for very large \({\mathbf{y}}\) and many genotypes and animals. The main challenge thus lies in performing SVD of matrix \({\mathbf{X}}\).

### Performing large-scale SVD analyses on genomic data

Now, \({\hat{\mathbf{s}}} = {\mathbf{V}}_{nq} '{\hat{\mathbf{b}}}\) (i.e. \({\mathbf{V}}_{nq}\) has replaced \({\mathbf{V}}_{q}\) from the entire population). Note that \({\mathbf{C}}\varvec{'}{\mathbf{C}}\) is not a diagonal matrix. The dimension of this equation system (genomic effects) is the number of chosen components (based on the core sample), \(q\) \(\left( { \le n} \right)\). Hence, given that an SVD can be performed on the \(n \times k\) genomic dataset of the core sample, a direct solution to the (maximum) \(n \times n\) PCRR equation system would be straightforward.

Now: \({\mathbf{X}} \approx {\mathbf{CV}}_{nq} \varvec{'} = {\hat{\mathbf{T}}\mathbf{V}}_{{\mathbf{C}}}^{\prime} {\mathbf{V}}_{{{\mathbf{nq}}}} \varvec{'} = {\hat{\mathbf{T}}{\hat{\mathbf{V}}}}\varvec{'}\).

Matrix \({\mathbf{C}}\) can be used directly in PCRR. Assume that the five individuals have the following phenotypes (assuming no fixed effects): \({\mathbf{y}} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} { - \,0.5} \\ { - \,0.5} \\ \end{array} } \\ {\begin{array}{*{20}c} {0.0} \\ {1.0} \\ \end{array} } \\ { - \,0.7} \\ \end{array} } \right]\) and \(\lambda = 1\). Then, solving the equation system: \(\left[ {{\mathbf{C^{\prime}C}} + \lambda {\mathbf{I}}} \right]{\hat{\mathbf{s}}} = {\mathbf{C}^{\prime}\mathbf{y}}\), yields \({\hat{\mathbf{s}}} = \left[ {\begin{array}{*{20}c} { - \,0.111} \\ { - \,0.497} \\ {0.025} \\ \end{array} } \right]\) and \({\hat{\mathbf{g}}} = {\mathbf{C}}{\hat{\mathbf{s}}} = \left[ {\begin{array}{*{20}c} { - \,0.522} \\ { - \,0.222} \\ {0.222} \\ {0.789} \\ { - \,0.522} \\ \end{array} } \right]\).

Note that \({\mathbf{C}}'{\mathbf{C}}\) is not a diagonal matrix. Alternatively, a second-stage SVD can be performed, giving \({\mathbf{C}} = {\mathbf{U}}_{{\mathbf{C}}} {\mathbf{S}}_{{\mathbf{C}}} {\mathbf{V}}_{{\mathbf{C}}}^{'} = {\hat{\mathbf{T}}\mathbf{V}}_{{\mathbf{C}}}^{\prime}\). Now: \({\hat{\mathbf{T}}} = {\mathbf{U}}_{{\mathbf{C}}} {\mathbf{S}}_{{\mathbf{C}}} = \left[ {\begin{array}{*{20}c} {0.00} & {1.29} & { - \,0.57} & {0.00} \\ {2.00} & {0.00} & {0.00} & {0.00} \\ { - \,2.00} & {0.00} & {0.00} & {0.00} \\ {0.00} & { - \,1.29} & { - \,1.15} & {0.00} \\ {0.00} & {1.29} & { - \,0.57} & {0.00} \\ \end{array} } \right]\).

Here, \({\hat{\mathbf{T}}}\varvec{'}{\hat{\mathbf{T}}} = {\mathbf{S}}_{{\mathbf{C}}}^{2}\) (diagonal) and solving the equation system \(\left[ {{\mathbf{S}}_{{\mathbf{C}}}^{2} + \lambda {\mathbf{I}}} \right]{\hat{\mathbf{t}}} = {\hat{\mathbf{T}}}^{'} {\mathbf{y}}\) yields \({\hat{\mathbf{t}}} = \left[ {\begin{array}{*{20}c} {0.111} \\ {0.473} \\ { - \,0.154} \\ \end{array} } \right]\), and \({\hat{\mathbf{g}}} = {\hat{\mathbf{T}}\hat{\mathbf{t}}} = \left[{\begin{array}{*{20}c} { - \,0.522} \\ { - \,0.222} \\ {0.222} \\ {0.789} \\ { - \,0.522} \\ \end{array} } \right]\), i.e. exactly the same animal solutions as above.

### Performing SVD in parallel on genome segments

As above, the approximated score matrix \({\hat{\mathbf{T}}}\) of \({\mathbf{X}}\) can be computed in three steps: (1) perform chromosome-wise SVD on a core sample of genomic data for each chromosome (same core individuals for all chromosomes); (2) compute chromosome-specific reduced rank \({\mathbf{C}}_{i} = {\mathbf{X}}_{i} {\mathbf{V}}_{inq}\) for all individuals (core and non-core) and concatenate these into \({\mathbf{C}} = \left[ {\begin{array}{*{20}l} {\begin{array}{*{20}c} {{\mathbf{C}}_{1} } & {{\mathbf{C}}_{2} } \\ \end{array} } & {\begin{array}{*{20}c} \ldots & {{\mathbf{C}}_{c} } \\ \end{array} } \\ \end{array} } \right]\); and (3) perform SVD of \({\mathbf{C}} = {\mathbf{U}}_{{\mathbf{C}}} {\mathbf{S}}_{{\mathbf{C}}} {\mathbf{V}}_{{\mathbf{C}}} \varvec{'}\) and compute the reduced dimension score matrix \({\hat{\mathbf{T}}} = {\mathbf{U}}_{{\mathbf{C}}} {\mathbf{S}}_{{\mathbf{C}}}\) (without further rank reduction).

The model and equation system are then as described above (Eqs. 16 and 17). As above, matrix \({\mathbf{C}}\) can also be used directly in PCRR, although the mixed model coefficient matrix may be dense (but of reduced dimensionality).

For each chromosome, the effective number of segregating loci is much smaller than for the whole genome, implying that fewer PC (\(< n\)) will be needed per chromosome than for the whole genome. The total number of chosen PC (at most \(n \times c\), where \(c\) is the number of chromosomes) is \(\sum q_{i}\), where \(q_{i}\) is the number of chosen PC for chromosome \(i\). Still, since SVD of the core sample genomic data is performed chromosome-wise, the final number of chosen PC may potentially exceed the number of animals in the core subpopulation. This implies that genetic variation of the core and non-core subpopulations is assumed to be explained by a limited number of common components (i.e. haplotype blocks), and that the number of components that segregate in the core may be larger than the number of core individuals. In contrast, the APY algorithm assumes that all genetic variation is explained by the additive genetic effects of the core individuals, rather than by the haplotype blocks that segregate among those individuals.

### Principal component based algorithm for inverting the GRM (PCIG)

Single-step genomic analyses are widely used in the analysis of real data. As mentioned earlier, the (original) single-step equation system requires the inverse of the GRM \(\left( {{\mathbf{G}}^{ - 1} } \right)\) to be computed prior to analysis. If inversion is done by “brute force”, large-scale analyses that potentially include millions of genotyped animals will be virtually impossible to perform. However, in the following section we describe how the GRM for such data can be effectively approximated through SVD techniques and how the exact inverse of an approximated GRM can be obtained.

^{−3}) to ensure that \({\tilde{\mathbf{G}}}\) is positive definite and thus can be inverted. Using the Woodbury formula [10], the exact inverse of \({\tilde{\mathbf{G}}}\) is:

Thus, the only explicit inverse needed here is \(\left( {{\mathbf{C^{\prime}C}} + {\mathbf{I}}_{{\mathbf{p}}} \rho \theta } \right)^{ - 1}\), which is of full rank and has dimension \(\sum q_{i}\). For example, \(\sum q_{i} \le\) 10,000 components may be sufficient to describe essentially all genetic variation, even for a large genotyped population if it has limited \(N_{e}\). Under these assumptions, an inverse of GRM can be computed for any number of genotyped individuals.

### QR-based algorithm for inverting GRM (QRIG)

### Weighted genomic relationship matrix

Using this method, a weighted genomic relationship matrix can be used even for single-step animal models.

### Simulation study

^{−8}and mutations followed the infinite sites model [18]. After the initial 10,000 generations, \(N_{e}\) was reduced to 100 over 10 generations to mimic a livestock population. In the last generation, 10,000 animals were generated and their genotypes and phenotypes were used in genetic analysis. The total number of segregating loci in generation 10,000 was 531,836, of which about half (279,504) were still segregating in the last generation (generation 10,010). Per chromosome, 200 SNPs with a minor allele frequency higher than 0.01 were randomly sampled as causative SNPs, i.e. 4000 causative SNPs in total. Genotypes were standardized to \(\frac{{ - 2p_{j} }}{{\sqrt {2p_{j} \left( {1 - p_{j} } \right)} }}\), \(\frac{{1 - 2p_{j} }}{{\sqrt {2p_{j} \left( {1 - p_{j} } \right)} }}\) and \(\frac{{2 - 2p_{j} }}{{\sqrt {2p_{j} \left( {1 - p_{j} } \right)} }}\) for the genotypes ‘0 0’, ‘0 1’ and ‘1 1’, respectively, where \(p_{j}\) is the frequency of the ‘1’ allele, and collected in the genotype matrix \({\mathbf{X}}\). True genetic values of the animals were obtained as:

- (1)Ordinary GBLUP$${\mathbf{y}} = {\mathbf{1}}\mu + {\mathbf{Zg}} + {\mathbf{e}},$$$${\mathbf{g}} \sim N\left( {0,{\mathbf{G}}\sigma_{g}^{2} } \right).$$
- (2)Reduced-rank PCRR (chromosome-wise SVD)$${\mathbf{y}} = {\mathbf{1}}\mu + {\hat{\mathbf{T}}\mathbf{s}} + {\mathbf{e}},$$$${\mathbf{s}} \sim N\left( {{\mathbf{0}},{\mathbf{I}}\sigma_{m}^{2} } \right),$$$${\mathbf{g}} = {\hat{\mathbf{T}}\mathbf{s}}.$$
- (3)GBLUP using reduced-dimension approximations of GRM
- a.
Chromosome-wise SVD (PCIG-C)

- b.
Genome-wide SVD (PCIG-G)

- c.
QR-based (genome-wide)

- d.
APY (genome-wide)

- a.

The chromosome-wise SVD (PCRR or PCIG-C) was performed independently for each chromosome based on a core sample of 500, 1000 or 2000 individuals. For each chromosome, the number of components was set such that > 99% of the chromosome-specific genomic variation (in the core) was explained by the chosen PC. These PC were then used to compute \({\hat{\mathbf{T}}}\). For the PCIG-G, an economy-sized SVD was performed across all chromosomes for the core sample (500 to 2000 individuals) and, thus, the final number of components was equal to the core sample size. The QR-based algorithm was based on all genotypes of the core sample, while the APY algorithm was based on genomic relationships of core sample individuals.

All models and algorithms were compared based on their accuracy of predicting the true breeding values of validation animals that had masked phenotypes. Validation animals were randomly sampled among non-core animals (with a probability of 10%).

Data preparation and statistical analyses were performed using Julia software scripts (http://julialang.org/). All solutions were obtained by solving the mixed model equations directly.

### Real data analysis

The PCIG-C and APY algorithms were also used in a single-step multi-trait genomic evaluation of a real dataset, which was comprised of data from the Irish beef cattle carcass evaluation and included 8.33 million animals with records on nine traits. The model used was identical (excluding genetic groups) to the standard Irish beef cattle evaluation model [19]. There were 13.35 million animals in the pedigree, of which 163,277 were genotyped. Genotyping was done by using the Illumina Bovine SNP50 Bead Chip (Illumina, San Diego, USA), of which 54,620 SNPs on 29 autosomes were included in the analysis (after quality edits). The population was heterogeneous and included genotypes of animals from 41 breeds. Hence, the dataset was challenging in the sense that a large core sample was needed to capture genetic variation in all breeds. For PCIG-C, the number of components per chromosome was set such that it explained a given percentage (from 90 to 95%) of the chromosome-specific genomic variation and core sample sizes were 30,000 to 50,000. The resulting estimated breeding values (EBV) using the PCIG-C and APY inverse GRM were compared with the original EBV based on direct inversion of \(\left( {{\mathbf{G}} + {\mathbf{I}}\theta } \right)\).

The analysis was conducted using an iterative solver in the MIX99 software (http://www.luke.fi/mix99), using the preconditioned conjugate gradient method and iteration on data. A value of *θ* = 10^{−3} was added to the diagonal elements of the GRM to ensure that the matrix was positive definite.

Two simple Julia scripts are attached, demonstrating 1) how to use SVD methods to compute reduced-dimension approximations of a larger genomic data using a core sample (Additional file 1), and 2) how to combine reduced-dimension genomic data from multiple chromosomes in computation of an inverse approximated genomic relationship matrix (Additional file 2).

## Results

### Simulation study

Differences in accuracy of GBLUP/PCIG-C from the other models were largest at the lowest core samples and highest heritabilities (e.g. for a core sample of 500 and heritability 0.90, accuracy was 0.95 for GBLUP/PCIG-C vs. 0.81 for the other methods). At the lowest core sample size (500), genomic relationships were so crudely described by PCIG-G, APY and QRIG that very little information was obtained by changing the heritability from 0.25 to 0.90. At higher core sample sizes, the differences between GBLUP/PCIG-C and the other methods were smaller, but not negligible, even at core sample sizes up to 2000. As a result, PCIG-C was much more robust to core sample size and achieved comparable results to the full-dimension GBLUP, even at the smallest core sizes tested. Using PCIG-C, the average number of PC needed to capture at least 99% of the genomic variation per chromosome was 239, 298 and 340 for, respectively, 500, 1000 and 2000 animals in the core (4770, 5959 and 6795 components across all chromosomes). Hence, for this data structure, the genomic relationships could be effectively approximated with a limited number of chromosome-specific PC, even when estimated from core sample sizes down to 500 individuals.

With respect to computing time, QR decomposition (QRIG) required down to 18% less computing time than SVD (PCIG models) when applied to genomic data on single chromosomes (~ 25 k loci). The relative difference in computation time between QR and SVD was largest at smaller core samples. However, at small core samples, both methods were fast, making the relative difference in computing time less important.

### Multi-breed beef cattle data

For the real data analysis of the multiple-breed beef cattle population using PCIG-C, the results were essentially identical for core sample sizes of 30,000 and 50,000, hence only results of the latter are presented. Correlations of EBV from the PCIG-C single step models with EBV from the original GBLUP (direct inversion) model were high for all traits (0.995 to 0.999, 0.998 to 0.999, and 1.000 when the chosen components explained ≥ 85, ≥ 90 and ≥ 95% of chromosomal genomic variance, respectively). The corresponding numbers of chosen components were 30,208 (≥ 85%), 34,655 (≥ 90%), and 40,140 (≥ 95%). APY based on a core sample size of 50,000 individuals resulted in almost identical ranking of animals based on EBV as the original model (rank correlations ranging from 0.999 to 1.000), while the rank of animals based on an APY of a core sample of 30,000 individuals (corresponding in rank (i.e. no of PC) with PCIG-C ≥ 85%) had a slightly lower correlation with the rank from the original model (0.952 to 0.996). For a similar rank (~ 30,000) of the GRM (number of chosen PC in PCIG and number of core animals in APY), PCIG-C needed a smaller number of iterations to converge (1351 to 1385 vs. 1619 to 1756, for PCIG-C and APY, respectively). Computing times could not be compared directly, as these may have been influenced by other jobs running simultaneously on the computer cluster.

## Discussion

When based on the same PC, the PCRR and PCIG algorithms gave identical EBV. However, the PCIG algorithm is more flexible in that it can easily be incorporated into existing single-step genomic animal models. The results of the current study show that all reduced-dimension algorithms (PCRR, PCIG-C, PCIG-G, APY and QRIG) approach the GBLUP solutions when core sample size becomes large. However, the PCRR and PCIG-C algorithms were, by far, the most robust to reductions in core sample size. For the simulated data, the EBV were virtually identical to the EBV obtained with full-dimension GBLUP for all core sample sizes (even down to 500 individuals) and heritabilities, with correlations between EBV ranging from 0.997 to 1.000. For the other methods, accuracy of selection dropped considerably at smaller core sample sizes (500 and 1000), especially with high heritability. In the real data analysis of a multi-breed beef cattle population, core sample size was generally large and differences between methods were thus smaller, but still in the favor of PCIG-C compared with APY.

The PCIG algorithm can be used to calculate the exact inverse of an approximated GRM, even for extremely large genomic datasets that potentially contain millions of individuals and loci, using a limited number of PC per chromosome. The SVD-based PCIG-C uses all genetic data from the core sample to identify the more important PC for each chromosome, and the GRM is based on these. The method can be heavily parallelized, since SVD is performed separately for each chromosome. The number of PC needed to describe the relationship structure of a population depends on the effective number of segregating genomic segments in the population, which for large populations of limited \(N_{e}\) is typically much smaller than the actual population size (\(N\)). After SVD, the inverse of GRM (using PCIG) can be computed easily, and potentially row- or element-wise, which gives room for further parallelization. Hence, computing time can be reduced substantially. Using iteration on data, rows of the inverse of GRM can be computed directly during iteration and, thus, the entire inverse GRM does not need to be stored explicitly. In contrast, when performing “brute force” inversion of the entire GRM, memory requirements increase quadratically and numbers of computations increase cubically with the number of animals in the population [20]. Compared with PCIG, QRIG algorithm based on QR-decomposition was slightly faster and has potential for parallelization (by chromosome). However, this model is less well suited for dimensionality reduction below core sample size (e.g. per chromosome) and is more sensitive to size of the core sample. Thus, we prefer the PCIG-C over the QRIG algorithm.

The PCIG algorithm proposed in this study is related to the APY algorithm [4], since both methods use genomic data in a core sample to approximate the (inverse) GRM of all animals. In APY, the core sample must be sufficiently small such that the inverse of the core GRM can be computed directly, and the remaining elements of the entire inverse of GRM are computed based on the inverse relationships of the core individuals and the relationships between core and non-core individuals. Furthermore, APY assumes that the non-core part of the inverse GRM is diagonal, while PCIG makes no such assumptions. Using PCIG, the GRM is approximated by a limited number of PC and by adding a small number to the diagonal elements, while the inverse of this matrix is computed by exact methods. Hence, given that the GRM can be appropriately approximated using PC estimated from the core sample, the computed inverse of GRM from PCIG will necessarily also be appropriate, which explains why solutions from reduced-rank PCIG-C were nearly identical to those obtained from full-dimension GBLUP in this study, even at the smallest core sample sizes. The genome-wide PCIG-G gave similar solutions as APY and (genome-wide) QRIG, which can be explained by the fact that the maximum number of components in genome-wide analysis is limited by the size of the core sample, while the maximum number of components in chromosome-wise PCIG-C is larger (size of core sample x number of chromosomes). For PCIG-G, APY and QRIG this is an especially limiting factor in smaller core samples, as observed with the simulated dataset, e.g. for these genome-wide methods a core size of 500 imply that the GRM is approximated by, at most, 500 “components” (PC or animal effects) while up to 10,000 PC may be used in the PCRR/PCIG-C models.

Genetic analyses based on chromosome-wise SVD of a core sample assumes that genetic variation associated with each chromosome can be explained by the chosen chromosome-specific components (i.e. haplotype blocks), and that the same components are present and responsible for genetic variation in the entire population. In contrast, the APY algorithm assumes that all genetic variation in the population is explained by the additive genetic effects of individuals in the core sample, i.e. that breeding values of non-core individuals are merely functions of breeding values of the core individuals. This implies that, if accuracies of core individuals approach unity (e.g. bulls with large daughter groups), accuracy of the entire genotyped population is also assumed to approach unity, even for newly born genotyped individuals, which is not likely to be true. Even if thousands of historical bulls with progeny are included in the core sample, the EBV of a genotyped calf is not expected to be perfect. In PCIG-C, a more realistic approach is taken, since the accuracy of non-core animals depends on the precision of the estimated effects of the underlying PC, rather than on the accuracy of the EBV of core animals. The number of underlying components may exceed the number of core individuals and, thus, a high accuracy of the EBV of core animals does not imply high accuracies for all underlying components. Thus, as the EBV of non-core animals are functions of these components, genotyped newborn animals are not necessarily assumed to be predicted accurately, even if the core animals are accurate.

In real data, population structures may be more complex and stratified. Hence, real data analyses of complex populations may require larger core samples, e.g. as in the real multi-population dataset analyzed here.

The methods used herein, only consider simple SNP-BLUP or genomic animal models, where, a priori, genetic variance is evenly distributed across the genome. However, such simplistic models likely do not use the full potential of high-density or sequence data, which may include genotypes of the causative mutations themselves. One alternative is to combine SVD techniques with methods that allow for different weighting of the SNPs in the model (i.e. approximating Bayesian variable selection models). This approach is described and evaluated in a separate study [21].

## Conclusions

We propose SVD-based methods for genomic prediction. Although SVD may be computationally demanding, the analysis can be performed on a reduced core sample of individuals and/or in parallel on different genome segments, making fast computation possible. After SVD, large-scale genomic analysis can be performed either by PC ridge regression (PCRR) or by a genomic animal model (GBLUP), with the GRM and its inverse defined by the chosen PC (PCIG). The principal component-based GRM is not of full rank but can be made invertible by adding a small number to the diagonal of the entire matrix, and its exact inverse can be easily obtained using the Woodbury formula. The inverse of the SVD-based GRM can be computed row- or element-wise, and the entire matrix does not need to be stored explicitly, e.g. when applying iteration on data. Based on simulated data, PCRR/PCIG models based on chromosome-wise SVD of genomic data from a limited core sample resulted in essentially identical solutions for the entire population as the full-dimension GBLUP model (correlations between EBV = 1.000), while other methods (genome-wide SVD, QRIG and APY) were less accurate, especially at smaller core sample sizes.

## Declarations

### Authors’ contributions

JØ performed the statistical analysis and wrote the manuscript. THEM produced the simulated data set, participated in developing the computational approach, and helped write the manuscript. UI improved computational strategies. IS performed statistical analysis of the large-scale real data set. All authors read and approved the final manuscript.

### Acknowledgements

The study was partly funded by The Research Council of Norway through Project No. 255297: ”From whole genome sequence to precision breeding”.

### Competing interests

The authors declare that they have no competing interests.

### Ethics approval and consent to participate

Not applicable.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–29.PubMedPubMed CentralGoogle Scholar
- Legarra A, Aguilar I, Misztal I. A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009;92:4656–63.View ArticlePubMedGoogle Scholar
- Christensen OF, Lund MS. Genomic prediction when some animals are not genotyped. Genet Sel Evol. 2010;42:2.View ArticlePubMedPubMed CentralGoogle Scholar
- Misztal I, Legarra A, Aguilar I. Using recursion to compute the inverse of the genomic relationship matrix. J Dairy Sci. 2014;97:3943–52.View ArticlePubMedGoogle Scholar
- Fernando RL, Cheng H, Garrick DJ. An efficient exact method to obtain GBLUP and single-step GBLUP when the genomic relationship matrix is singular. Genet Sel Evol. 2016;48:80.View ArticlePubMedPubMed CentralGoogle Scholar
- Henderson CR. Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975;31:423–47.View ArticlePubMedGoogle Scholar
- VanRaden P. Genomic measures of relationship and inbreeding. Interbull Bull. 2007;37:33–6.Google Scholar
- VanRaden PM. Efficient estimation of breeding values from dense genomic data. J Dairy Sci. 2007;90:374–5.View ArticleGoogle Scholar
- Henderson CR. Applications of linear models in animal breeding. In: Schaeffer LR, editor. Applications of linear models in animal breeding. Guelph: University of Guelph; 1984. ISBN-10: 0889550301, ISBN-13: 978-0889550308.Google Scholar
- Woodbury MA. Inverting modified matrices. Memorandum Report 42, Statistical Research Group, Princeton, New Jersey; 1950.Google Scholar
- Lay DC. Linear algebra and its applications. Reading: Addison-Wesley; 1994.Google Scholar
- de los Campos G, Gianola D, Rosa GJM, Weigel KA, Crossa J. Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet Res (Camb). 2010;92:295–308.View ArticleGoogle Scholar
- Tusell L, Perez-Rodriguez P, Forni S, Wu XL, Gianola D. Genome-enabled methods for predicting litter size in pigs: a comparison. Animal. 2013;7:1739–49.View ArticlePubMedGoogle Scholar
- Hastie T, Tibshirani R. Efficient quadratic regularization for expression arrays. Biostatistics. 2004;5:329–40.View ArticlePubMedGoogle Scholar
- Meuwissen T, Hayes B, Goddard M. Accelerating improvement of livestock with genomic selection. Annu Rev Anim Biosci. 2013;1:221–37.View ArticlePubMedGoogle Scholar
- VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23.View ArticlePubMedGoogle Scholar
- Meuwissen T, Goddard M. Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics. 2010;185:623–31.View ArticlePubMedPubMed CentralGoogle Scholar
- Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61:893–903.PubMedPubMed CentralGoogle Scholar
- Evans RD, Kearney JF, McCarthy J, Cromie A, Pabiou T. Beef performance evaluations in a multi-layered and mainly crossbred population. In: Proceedings of the 10th world congress on genetics applied to livestock production: 17–22 August 2014; Vancouver; 2014.Google Scholar
- Misztal I. Inexpensive computation of the inverse of the genomic relationship matrix in populations with small effective population size. Genetics. 2016;202:401–9.View ArticlePubMedGoogle Scholar
- Meuwissen THE, Indahl UG, Ødegård J. Variable selection models for genomic selection using whole-genome sequence data and singular value decomposition. Genet Sel Evol. 2017;49:94.View ArticlePubMedPubMed CentralGoogle Scholar