Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data
- Gota Morota^{1}Email author,
- Masanori Koyama^{2},
- Guilherme J M Rosa^{1, 3},
- Kent A Weigel^{4} and
- Daniel Gianola^{1, 3, 4}
https://doi.org/10.1186/1297-9686-45-17
© Morota et al.; licensee BioMed Central Ltd. 2013
Received: 6 December 2012
Accepted: 31 May 2013
Published: 13 June 2013
Abstract
Background
Arguably, genotypes and phenotypes may be linked in functional forms that are not well addressed by the linear additive models that are standard in quantitative genetics. Therefore, developing statistical learning models for predicting phenotypic values from all available molecular information that are capable of capturing complex genetic network architectures is of great importance. Bayesian kernel ridge regression is a non-parametric prediction model proposed for this purpose. Its essence is to create a spatial distance-based relationship matrix called a kernel. Although the set of all single nucleotide polymorphism genotype configurations on which a model is built is finite, past research has mainly used a Gaussian kernel.
Results
We sought to investigate the performance of a diffusion kernel, which was specifically developed to model discrete marker inputs, using Holstein cattle and wheat data. This kernel can be viewed as a discretization of the Gaussian kernel. The predictive ability of the diffusion kernel was similar to that of non-spatial distance-based additive genomic relationship kernels in the Holstein data, but outperformed the latter in the wheat data. However, the difference in performance between the diffusion and Gaussian kernels was negligible.
Conclusions
It is concluded that the ability of a diffusion kernel to capture the total genetic variance is not better than that of a Gaussian kernel, at least for these data. Although the diffusion kernel as a choice of basis function may have potential for use in whole-genome prediction, our results imply that embedding genetic markers into a non-Euclidean metric space has very small impact on prediction. Our results suggest that use of the black box Gaussian kernel is justified, given its connection to the diffusion kernel and its similar predictive performance.
Background
Prediction of yet-to-be observed phenotypes for complex quantitative traits in agricultural species[1, 2] or for disease status in medicine[3] exploits connections between phenotypes, genealogies, and DNA variations potentially representing functional diversity of organisms. Systems biology approaches have uncovered abundant epistasis in model organisms, including the mouse and the rat[4], Drosophila melanogaster[5], and Saccharomyces cerevisiae[6]. In this context, Loewe[7] proposed an evolutionary systems biology framework for arriving at a better understanding of molecular interactions, given that epistatic interactions between mutations are commonly observed. Therefore, it seems reasonable to argue that genotypes and phenotypes may be connected in forms that are not well addressed by the linear additive models that are standard in quantitative genetics. Bayesian regularized parametric linear additive smoothers, e.g.,[8, 9] may not be fully adequate for capturing genetic signals under epistatic scenarios[10, 11]. Furthermore, attempts to account for epistasis by including interactions in a linear model produces a highly parameterized model structure, possibly yielding a poor predictive ability in cross-validation, and which does not scale well if high-order interactions are included in the model.
Genetic risk prediction in medicine relies on using genomic information to predict the chance of contracting a disease, for example, in personalized medicine for preventive treatment and clinical health care. Prediction of genetic risk derived from pre-selected marker variants is mainstream in this domain, as opposed to prediction based on fitting whole-genome markers simultaneously, as done with great success in animal and plant breeding[8–11]. However, the variants detected in this way are typically not useful for genetic risk prediction, because they explain only a small fraction of the total genetic variance as estimated from covariances between relatives, for example using twin and family studies. Moreover, it has been shown that a large number of variants that do not reach genome-wide statistical significance contribute to the total additive genetic variance[12].
Development of statistical models to predict phenotypic outcomes from all available molecular information that are capable of capturing complex genetic network architectures is therefore important. Arguably, a good predictive model should account for most of the genetic variability, as well as reflect the underlying genetic architecture properly. Also, a predictive model should be flexible with respect to type of input data, e.g., high-throughput chip-based genotypes or whole-genome sequences, and mode of gene action.
An appealing alternative is provided by a kernel-based parametric method known as BLUP (Best Linear Unbiased Prediction) of genetic effects, developed in the 1950’s by C. R. Henderson, an animal breeder[13]. BLUP can also be viewed as a regression of a phenotype on a pedigree-based relationship matrix A (when the model is additive), and it has been used for genetic improvement of livestock species for decades. This method was recently extended to incorporate SNP (single nucleotide polymorphisms) by replacing A by a genomic relationship matrix G[14], although there is no impediment to using A and G together[15]. BLUP is suited for handling a massive amount of genetic information, because the computational burden can be proportional to the number of data points rather than the number of predictor variables (e.g., markers), and this is particularly true if a common weight is assigned to a each marker. Recently, kernel-based non-parametric models e.g.,[15–18] have been proposed. A non-parametric treatment can accommodate nonlinear dependencies of phenotypes on predictor variables without explicitly modeling them. This suggests that these procedures can potentially pick up various forms of gene action without posing richly parametrized structures that require making strong distribution and genetic architecture assumptions a priori[10, 15]. For example, Long et al.[16] used a computer simulation and found that the predictive ability of a non-parametric smoother was superior to that of a parametric linear counterpart when non-additive effects were strong. These authors also gave evidence that non-parametric smoothing is competitive to linear smoothing, even when additivity accounts for most of the total genetic variability.
Kernel ridge regression[19], a kernel generalization of standard ridge regression[20], is also a non-parametric smoothing method. Ridge regression has received some attention in quantitative genetics in the context of mixed linear models[10, 15, 21–24], and the non-parametric version is carried out by constructing a spatial distance-based relationship matrix called kernel, as opposed to using additive genomic relationship kernels, A or G, which only embed correlations due to additive genetic effects. The choice of a kernel is equivalent to modeling covariance structure among individuals, and phenotypes are regressed on this kernel to obtain estimates of non-parametric regression coefficients.
A simulation study[18] found that in the presence of non-additive effects, a spatial distance-based kernel can outperform an additive genomic relationship kernel in predictive performance, but this has not been explored thoroughly with real data. Furthermore, while the set of all SNP genotype configurations on which a model is built is finite, past research has employed spatial distance-based kernels with infinite, unbounded domains, such as the Gaussian kernel. Our first objective in this study is to compare a spatial distance kernel with a non-spatial distance kernel. Secondly, we assess the performance of a non-Gaussian spatial distance kernel by deploying kernels on graphs as the choice of a basis function, a procedure that is suitable for discrete input data structures. Instead of encoding SNP data in a continuous Euclidean space, as in the case of the Gaussian kernel, we investigated kernels on a non-Euclidean space. We examined a diffusion kernel proposed by Kondor and Lafferty[25], Smola and Kondor[26] and Lafferty and Lebanon[27], which is a kernel defined for functions on discrete spaces, such as a graph. A brief review on ’kernels on graphs’ is given by[28], and “graph kernels” are discussed in[29]. As shown later, the diffusion kernel can be viewed as a discretization of the Gaussian kernel. We also tested the sensitivity of applying the same bandwidth parameter to autosomes and allosomes in the spatial distance kernels.
This paper investigates the use of several kinds of kernels in a kernel ridge regression framework for genome-assisted prediction of quantitative traits. Two data sets representing dairy cattle and wheat were employed for this purpose. The paper is organized as follows. In the Methods section, we describe the data and introduce basic notions of kernel ridge regression. We then apply the diffusion kernel to strings of dummy variable marker sequences; the motivation of the non-Euclidean metric space is followed by an introduction of the diffusion kernel. In the next section, main results are presented. Finally, in the Discussion section, we address the implication of our results and make concluding remarks.
Methods
Data
Dairy cattle and wheat data were used. The dairy data was provided by the USDA-ARS Animal Improvement Programs Laboratory (Beltsville, MD) and represented 7902 Holstein bulls, each with 43 134 SNPs (minor allele frequency > 0.025) spanning across the whole genome. The target response variable was progeny test predicted transmitting ability (PTA) of productive life (PL). PL is a measure of the observed length of time that a cow stays in the herd, from first calving to culling, and PTA is an estimate of half of the breeding value of a bull, which is a smoothed average assuming additive inheritance. PL is lowly heritable, with heritability estimated at 0.1[30]. The genotype for each of 42 438 SNP loci on autosomes was coded as 0 (homozygous for allele “a”), 1 (heterozygous), or 2 (homozygous for allele “A”), according to the number of copies of the “A” allele. The remaining 696 loci on the X chromosome were coded as either 0 or 2, representing absence or presence of the “A” allele, respectively. Missing genotypes, due to either low call rates for some SNPs or poor DNA quality, were imputed via random sampling of genotypes with probabilities corresponding to observed genotype frequencies at each locus. Note that other more precise methods are available but were not used here.
The wheat data included 599 inbred lines collected by the International Maize and Wheat improvement Center in Mexico (CIMMYT). Each line was genotyped with 1279 Diversity Array Technology (DArt) markers generated by Triticarte Pty. Ltd. These binary markers take the form of presence (1) or absence (-1). The phenotype here was average grain yield of each line in the first out of four environments represented in the data set, scaled to have zero mean and variance one. Missing genotypes were imputed as for the cattle data above. This data set has been also analyzed with support vector regression and neural network methods[17, 31].
Kernel ridge regression
where y_{ i } is a response variable on case i(i=1,2,…,n), x_{ i } is a p×1 vector of genotypes obtained on i, g(x_{ i }) is a genetic effect interpretable as the conditional expectation function g(x_{ i })=E(y_{ i }|x=x_{ i }), and ϵ_{ i } is a residual.
with respect to G, where the first term is the residual sum of squares,$\parallel g{\parallel}_{\mathcal{\mathscr{H}}}^{2}$ is the squared norm of g under a Hilbert space, and λ is a regularization parameter. The representer theorem[32] is used to find the optimal g.
and then maximizing a penalized likelihood. This penalized likelihood is obtained by assuming that p(y|α,σ e 2) and that α follows$N(0,{\mathbf{K}}^{-1}{\sigma}_{\alpha}^{2})$, where${\sigma}_{e}^{2}$ is the variance of the residuals, and${\sigma}_{\alpha}^{2}$ is a variance component.
Next, we review additive genomic relationship kernels and the Gaussian kernel, and then present how one can build a kernel on a graph with discrete inputs. Hereafter, we denote K as the kernel matrix indexed by the observed covariate; and K(i,j) indicates particular elements of K;$\mathcal{K}$ is the infinite dimensional Gaussian kernel, or the 3^{ p }×3^{ p } dimensional kernel for the diffusion kernel.
Additive genomic relationship kernels
Here p_{ j } is the frequency of allele “A” computed from a base population. The denominator of G 1 is a scaling parameter. In practice, the allele frequencies are estimated from the data at hand, but this yields semi-positive definite matrices as discussed by Strandén and Christensen[33].
where z_{.j} is the j th column of Z and p represents the number of SNPs.
Since the Holstein data set led to non-positive G 1 and G 2 matrices because of numerical issues, G_{ i }(i=1,2) was modified to${\mathbf{G}}_{i}^{\ast}=0.95{\mathbf{G}}_{i}+0.05\mathbf{I}$, yielding G^{∗} matrices that provided valid kernels. This may also avoid numerical instability in the eigenvalue decomposition of the kernel as explained later. The wheat data produced semi-positive definite genomic relationship kernel matrices. The kernel G 1 has been applied to several inbred line populations in the past, e.g., Ober et al.[35].
Gaussian kernel
where${d}_{\mathit{\text{ij}}}=\phantom{\rule{0.3em}{0ex}}\sqrt{{({x}_{i1}\phantom{\rule{0.3em}{0ex}}-{x}_{j1})}^{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{({x}_{\mathit{\text{ik}}}-{x}_{\mathit{\text{jk}}})}^{2}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{({x}_{\mathit{\text{ip}}}-{x}_{\mathit{\text{jp}}})}^{2}}$, and x_{ i k } (i,j=1,…,n,k=1,…,p) is the SNP genotype for individual i at SNP k. A small Euclidean distance between two individuals reflects a strong similarity in state between their genotypes. On the one hand, as θ increases, the kernel evaluation approaches K(x_{ i },x_{ j })=0, producing a “sharp” or “local” kernel. On the other hand, as θ→0, the kernel approaches 1, that is, a situation where the two individuals “match” perfectly, providing a “global” kernel.
Non-Euclidean metric space
Example of diffusion on a graph
α= 0.1 | α = 0.2 | α = 0.2 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
x = | 0 | 1 | 2 | x = | 0 | 1 | 2 | x = | 0 | 1 | 2 |
k_{1}(0,x) | 0 | 1 | 0 | k_{1}(0,x) | 0 | 1 | 0 | k_{2}(0,x) | 0 | 0 | 1 |
k_{1}(1,x) | 0.1 | 0.8 | 0.1 | k_{1}(1,x) | 0.2 | 0.6 | 0.2 | k_{2}(1,x) | 0 | 0.2 | 0.8 |
k_{1}(2,x) | 0.17 | 0.66 | 0.17 | k_{1}(2,x) | 0.28 | 0.44 | 0.28 | k_{2}(2,x) | 0.04 | 0.28 | 0.68 |
k_{1}(3,x) | 0.219 | 0.562 | 0.219 | k_{1}(3,x) | 0.312 | 0.376 | 0.312 | k_{2}(3,x) | 0.171 | 0.330 | 0.498 |
k_{1}(15,x) | 0.331 | 0.336 | 0.331 | k_{1}(15,x) | 0.333 | 0.333 | 0.333 | k_{2}(15,x) | 0.324 | 0.333 | 0.342 |
As shown in Table1, as t increases the ’influence’ spreads over all genotypes more evenly; also, the larger α is, the faster the diffusion is with respect to time t.
which shows that H(Γ) is a negative semi-definite matrix.
On this graph, the distance from genotype 0 (’aa’) to genotype 2 (’AA’) is the same as that from 0 (’aa’) to 1 (’ A a’). Since genotype ’aa’ has no copies of the ’A’ allele, it may be more reasonable to assume that genotype ’ A a’ is closer to ’AA’, which has two copies of the ’A’ allele. This can be viewed from a mutational perspective as well. Genotype 0 (’aa’) requires two mutations to become genotype 2 (’AA’), while genotype 1 (’ A a’) requires only one mutation. Thus, the graph of interest would be given by (6). The latter is a path graph for SNP data, which will be taken as a minimal basis for our graph. In a path graph, all vertices are on a straight line, as in (6).
Diffusion kernel on a non-Euclidean metric space
which is a discrete diffusion equation (heat equation) on a graph with H=−L(Γ). Note that diffusion kernels always need to be associated with a graph.
A Gaussian kernel is obtained by making this diffusion kernel “space” continuous. The connection between the two kernels is provided in Appendix A.
Diffusion kernel indexed by observed covariates
When a graph Γ is large and asymmetric, the computation of the diffusion kernel$\mathcal{K}(\mathbf{\Gamma})$ can be extremely difficult. For instance, for a SNP grid with 43 134 loci, the dimension of$\mathcal{K}$ is 3^{43134} by 3^{43134}. However, symmetry helps. If a closed form of$\mathcal{K}$ can be arrived at, there is no need to compute$\mathcal{K}(\mathbf{x},{\mathbf{x}}^{\prime})$ for all pairs of genotype sequences x,x^{′}. This is indeed the case for the Gaussian kernel, where the dimension of$\mathcal{K}$ is infinite. With Kondor and Lafferty’s result given in[25], we may obtain the closed form of the diffusion kernel from the sample for our SNP grid.
To this end, we just need to compute${\mathcal{K}}_{\theta}({\mathbf{\Gamma}}_{0})=exp(\theta \mathbf{H})$ with H in (5).
with proportionality constant (e^{−3θ}+3e^{−θ}+2)^{ q }, where q=n_{1}+n_{2}+m_{11}. The last term is a contribution from heterozygosity. We refer to this as SNP grid kernel, specifically developed to model SNP data in this study. A proof of this result is given in Appendix B.
Diffusion kernel for binary genotypes
where d(x,x^{′}) is the Hamming distance, that is, number of coordinates at which x and x^{′} differ[25]. Following Kondor and Lafferty[25], this diffusion kernel for binary markers will be referred to as the hypercube kernel.
Combining SNP grid kernels and hypercube kernels
This result also holds for the Gaussian kernel, but not necessarily for every kernel, e.g., the exponential kernel defined with the Euclidean distance (||x_{ i }−x_{ j }||) does not hold this property.
Bayesian treatment of kernel ridge regression
Once a prior is assigned to${\sigma}_{e}^{2}$ and${\sigma}_{\alpha}^{2}$, a Monte Carlo Markov Chain (MCMC) scheme can be used to infer all unknown parameters, including λ. Scaled inverse chi-square prior distributions were assigned to${\sigma}_{e}^{2}$ and${\sigma}_{\alpha}^{2}$, each with 3 degrees of freedom and a scale parameter equal to 1. Samples from posterior distributions were obtained by the Gibbs sampler in[36], and each of the analyses was based on 100 000 MCMC samples with the first 60 000 samples discarded as burn-in. After burn-in, samples were thinned at a rate of 10, resulting in 4000 mildly correlated samples for posterior inference. Convergence was monitored by inspecting trace plots of the variance parameters. A bandwidth parameter θ yielding high predictive ability is needed as well. However, sampling of the bandwidth parameter in MCMC sampling requires computation of kernels at each iteration, which is very demanding given the number of individuals and SNP considered in our study. For this reason, evaluation of the diffusion kernel was performed over a fixed grid of values of θ. The range of θ considered provided average values of$K(\mathbf{x},{\mathbf{x}}^{{}^{\prime}})$ that were evenly spaced, approximately, between 0.13 to 0.99. Computation of kernels and Gibbs sampling was carried out in Fortran and in R, respectively.
Assessment of predictive ability
where${\widehat{\mathbf{y}}}^{\mathit{\text{test}}}$ is the 2499 × 1 vector of predicted responses of bulls in the testing set; 1 is a 2499×1 vector of ones;${\widehat{\mu}}^{\mathit{\text{train}}}$ is the posterior mean of the intercept estimated from the training set; K^{t e s t⇔t r a i n} is a 2499 × 5403 matrix with elements k(j,i)^{t e s t⇔t r a i n} representing the allelic similarity between bulls in the testing (j=1,…,2499) and training (i=1,…,5403) sets, with the same bandwidth parameter employed in the training set, and${\widehat{\mathit{\alpha}}}^{\mathit{\text{train}}}$ is the vector of posterior means of 5403 non-parametric regression coefficients obtained from the training set. In the equation above, K^{t e s t⇔t r a i n} was either the diffusion or the Gaussian kernel.
In a Bayesian setting, however, one can embed all the above steps in a convenient way. Prior to Gibbs sampling, first we construct a full kernel matrix containing both training and testing data sets. We treat the responses of testing set individuals as unobserved, and these values are predicted via a predictive distribution. This is easy to incorporate in the Gibbs sampling scheme. Pearson’s correlation between the predicted values (mean of the predictive distribution) and the observed PTA, Cor(${\widehat{\mathbf{y}}}^{\mathit{\text{test}}},{\mathbf{y}}^{\mathit{\text{PTA}}})$, and predictive mean-squared error (MSE) defined as$\sum _{i=1}^{2499}{({\u0177}_{i}^{\mathit{\text{test}}}-{y}_{i}^{\mathit{\text{PTA}}})}^{2}/n$ were computed to evaluate the predictive ability of the two kernels. Here,${\u0177}_{i}^{\mathit{\text{test}}}$ is the mean of the predictive distribution of response i in the testing data set, which is the i th element of the${\mathbf{K}}^{\mathit{\text{test}}\leftrightarrow \mathit{\text{train}}}{\widehat{\mathit{\alpha}}}^{\mathit{\text{train}}}$.
Results
Averages of kernel elements and their predictive correlations for the Holstein data
Kernel | θ | k(x_{ i },x_{ i }) | k(x_{ i },x_{ j }) | Cor (ŷ^{ t e s t },y^{ P T A }) | MSE |
---|---|---|---|---|---|
Diffusion | 10 | 0.369 (0.369) | 0.138 (0.134) | 0.727 (0.726) | 215.93 (216.61) |
11 | 0.693 (0.693) | 0.483 (0.477) | 0.745 (0.741) | 204.36 (208.68) | |
11.5 | 0.801 (0.801) | 0.644 (0.639) | 0.739 (0.732) | 207.93 (212.97) | |
12 | 0.874 (0.874) | 0.765 (0.762) | 0.739 (0.728) | 210.54 (215.08) | |
13 | 0.952 (0.952) | 0.907 (0.906) | 0.734 (0.725) | 211.50 (217.61) | |
14 | 0.982 (0.982) | 0.966 (0.965) | 0.729 (0.723) | 214.29 (218.70) | |
Gaussian | 5×10^{−5} | 1 (1) | 0.237 (0.225) | 0.721 (0.702) | 220.675 (233.21) |
2×10^{−5} | 1 (1) | 0.551 (0.542) | 0.736 (0.733) | 213.41 (213.95) | |
1×10^{−5} | 1 (1) | 0.749 (0.742) | 0.742 (0.736) | 210.14 (211.24) | |
5×10^{−6} | 1 (1) | 0.866 (0.861) | 0.736 (0.729) | 210.24 (214.47) | |
3×10^{−6} | 1 (1) | 0.917 (0.914) | 0.734 (0.726) | 211.51 (216.42) | |
1×10^{−6} | 1 (1) | 0.971 (0.971) | 0.729 (0.724) | 214.37 (217.93) | |
G 1 ^{∗} | NA | 0.992 (1.009) | -0.000126 (-0.000128) | 0.729 (0.722) | 214.36 (219.27) |
G 2 ^{∗} | NA | 0.894 (0.909) | -0.000113 (-0.00012) | 0.730 (0.723) | 213.64 (218.31) |
As shown in Table2, the average of off-diagonal elements of the diffusion kernel was smaller than that of diagonal elements. This is because the first two terms of (13) will be different from zero (n_{1},n_{2}>0) for a pair of individuals. Diffusion kernel evaluations between an individual and itself were always larger than kernels evaluated between pairs, that is, diagonal elements had the largest values for each row of K. In the Gaussian kernel, diagonal elements are always equal to 1 and a smaller θ value produces a stronger prior correlation. The first type of additive genomic relationship kernel (G 1^{∗}) had the average diagonal and off-diagonal elements close to 1 and 0, respectively, as expected. Similarly, G 2^{∗} had an average off-diagonal close to 0 but it had smaller average diagonal elements than those of G 1^{∗}.
The right-most columns of Table2 give the evaluation of predictive ability of the kernels measured as the correlation between predicted values and observed PTA, and MSE of prediction, for several different bandwidth parameters (G 1^{∗} and G 2^{∗} do not involve this parameter). The predictive correlation of the diffusion (SNP grid) kernel was best at θ = 11, while with the Gaussian kernel this was achieved at θ = 10^{−5}. Although the averages of diagonal and off-diagonal elements varied substantially with different bandwidth parameters in the diffusion and Gaussian kernels, the influence of this variability on predictive correlations was small. Importantly, no major difference was observed between the diffusion and the Gaussian kernels in terms of predictive performance. Differences among kernels were very minor, probably due to the fact that the response (PTA) is already a smoothed mean based on a large number of daughters of each bull. There was consistency between the correlation and the MSE, in the sense that the value of θ with the highest predictive correlation had the smallest MSE. Predictive performance of G 1^{∗} was only slightly worse than that of the spatial distance kernels with the best bandwidth parameters.
Values in parentheses in Table2 were obtained by combining the SNP grid kernel from autosomes and the hypercube kernel from allosomes by applying the same bandwidth parameter. Incorporation of X-chromosome information reduced the average off-diagonal elements slightly and deteriorated predictive performance to some extent. The average diagonal and off-diagonal elements remained the same in G 1^{∗} and G 2^{∗}, but a minor reduction in their predictive abilities was observed.
Averages of kernel elements and their predictive correlations for the wheat data
Kernel | θ | k(x_{ i },x_{ i }) | k(x_{ i },x_{ j }) | Cor(ŷ^{ t e s t },y^{ t r a i n }) | MSE |
---|---|---|---|---|---|
Diffusion | 3 | 1 | 0.136 | 0.586 | 0.685 |
3.25 | 1 | 0.289 | 0.580 | 0.673 | |
3.5 | 1 | 0.466 | 0.577 | 0.681 | |
4 | 1 | 0.752 | 0.547 | 0.704 | |
5 | 1 | 0.962 | 0.522 | 0.721 | |
Gaussian | 0.005 | 1 | 0.134 | 0.582 | 0.686 |
0.003 | 1 | 0.290 | 0.579 | 0.697 | |
0.002 | 1 | 0.434 | 0.562 | 0.697 | |
0.001 | 1 | 0.655 | 0.558 | 0.703 | |
0.0005 | 1 | 0.809 | 0.556 | 0.673 | |
G 1 | NA | 2 | -0.003 | 0.518 | 0.709 |
G 2 | NA | 2 | -0.003 | 0.521 | 0.708 |
Discussion
Arguably, relationships between phenotypes and genotypes may be non-linear and complex[10, 15, 31]. For this reason, ignoring non-additive effects such as dominance and epistasis in a model may lead to inferior predictive ability of individual phenotypes.
A spatial distance-based kernel non-parametric regression is capable of mapping genotypes to phenotypes in a way that accurately reflects underlying, albeit unknown, relationships. These kernel methods incorporate non-linearity of a predictor set x through a nonlinear transformation of x, subsequently allowing analysis of the response in terms of features ϕ(x) in a linear way. This is particularly useful when the response has a linear relationship with respect to the parameters, but is non-linear on covariates, such as in the case of polynomial regression.
The predictive ability of kernel-based genetic models depends on the choice of a kernel and associated bandwidth parameter(s). If the two data points lie in the real line,$x,{x}^{{}^{\prime}}\in \mathbb{R}$, it seems reasonable to compute their distance in terms of Euclidean distance. However, SNP genotypes, coded as dummy variables, take a discrete form. Therefore, it may be worthwhile to consider a kernel designed to capture the discrete structure of the input variables. The best predictive kernel and its performance may vary depending on the underlying genetic architecture, quantitative trait loci (QTL) numbers and distribution of effects, data set used, and kernel method applied. Here, we investigated the use of ridge regression with a diffusion kernel to assess if this would enhance predictive ability over that of the Gaussian kernel and two additive genomic relationship counterparts. Kondor and Lafferty[25] obtained promising results when the diffusion kernel was compared with several kernels in classification problems with a set of discrete predictors, and this kernel has been used in a microarray-based gene function prediction problem[38]. Ober et al.[18] used the Matérn covariance function, which contains the Gaussian and the exponential kernel as particular cases. Therein, the smoothing parameter controls the actual form of a kernel, and this is directly driven by sample data. Although they obtained a Gaussian form as a choice of the covariance function, the Matérn function is bounded by the Euclidean norm by definition, which may not be suited for discrete genomic data.
A strength of kernels for structured data is their ability to address similarities between two data points$x,{x}^{\prime}\notin \mathbb{R}$[39]. The diffusion kernel defines the distance between two data points on graphs, namely vertices, and projects this information into a more interpretable space. As shown in the context of modeling linkage disequilibrium[40], various graph structures can be used to represent sets of discrete random variables, such as genotypes. Coupled with the representer theorem, the diffusion kernel allows casting underlying graph structures into a regression on the real line under a Hilbert space. The main idea behind this kernel is the matrix exponentiation of the graph Laplacian. The p-dimensional grid graph with vertices representing a vector of genotypes was chosen for the graph structure. Each grid conveys information on similarity in terms of the Manhattan distance. Two vertices x and${\mathbf{x}}^{{}^{\prime}}$ are connected if${x}_{i}={x}_{i}^{{}^{\prime}}$ for all i, except at one coordinate. In the Holstein data, with n=7902 and p = 42 438, it is unlikely that any of two vertices present in our data are connected. However, what grid graphs embrace is how many “steps” separate a vector of genotypes observed in individual i from an observed vector of genotypes in individual j. The diffusion kernel and its associated graph structure are free of parametric structures. They are constructed without posing genetic architecture assumptions a priori. For illustration purposes, we used 0, 1, and 2 for allele coding, but these should be interpreted as mere strings. Kernel computation still remains the same no matter what allele coding method is adopted here. The parametric component in our study is the construction of the path graph in equation (6), but it is not relevant to gene action modes. This allows us to build a flexible non-parametric model without making strong assumptions a priori. This is appealing, because we seldom know the underlying genetic architectures of complex traits. As shown in past studies including animals[31] and plants[41], a non-parametric method stands out when prediction of phenotypes is the primal focus.
Our motivation for applying the diffusion kernel stemmed from the assumption that a non-Euclidean distance may be able to more clearly represent genomic similarities. We carried out a matrix exponentiation of two graph Laplacians created from two path graphs (one for SNP and one for binary markers) for this purpose. This yields a kernel based on the Manhattan distance accounting for the heterozygosity that two individuals share. The two spatial distance kernels resulted in better predictive performance than the two additive genomic relationship kernels in the wheat data. This agrees with the previous simulation study of Ober et al.[18], in which the Gaussian kernel outperformed G 1 in the presence of non-additive effects. Superiority of the spatial distance kernels was less obvious in the Holstein data. This may be due to the phenotype we chose for this study, since the PTA response variable is a smoothed average using linear mixed models. Although the differences were small in cattle, two non-parametric kernels applied in this study outperformed the additive relationship kernels in two of the datasets used. This suggests that unknown cryptic genetic architectures are likely to be intrinsic to complex traits and, hence, kernels that can accommodate such structure yielded better predictions.
As for the difference between the diffusion and the Gaussian kernels in terms of predictive ability, the diffusion kernel had the highest predictive correlation and the lowest MSE with θ=11 in the Holstein data, but the difference with the Gaussian kernel was negligible. The same result was seen in the wheat data. This implies that the Gaussian kernel is robust, even if it incorporates genotypes on the real line such as 1.25 or -12.3. Our objective to properly incorporate genotypes into a kernel had a small impact on predictive ability of yet-to-be observed phenotypes. Although the distance between genotypes is certainly not continuous, additional efforts to discretize the Euclidean distance may not be needed. Another possible reason might be that genotypes do not reside in the Euclidean or non-Euclidean spaces explored here, but in a manifold[27].
Incorporation of X chromosome genotypes for building a kernel led to a smaller average diagonal and off-diagonal elements (to some extent) in spatial distance kernels, but no change was observed in the additive genomic relationship kernels. In both types of spatial distance kernels, however, the predictive correlations were worse than when kernels were constructed purely from autosomes. This suggests that applying specific bandwidth parameters to autosomes and allosomes in the spatial distance kernels might be important. A similar decline in predictive performance was observed in the two additive genomic relationship kernels, which do not involve any bandwidth parameter. Further research is needed to investigate what produces this drop in predictive performance, although if no markers contribute to PL on chromosome X, this would add extra noise.
To our knowledge, this study involves one of the largest data sets employed for spatial kernel-based genome-enabled selection of agricultural species. The challenge here was computation of the diffusion kernel, rather than the Gibbs sampler. Approximately, it took four days to compute one diffusion kernel on a Linux workstation equipped with the Intel(R) Xeon(R) CPU E5450 3.00GHz and 16GB of RAM. The Gaussian kernel required slightly less time for building, but with several candidates over a grid of values of the bandwidth parameter θ, this was an expensive task for both kernels. One useful approach might be that of multiple kernel learning (MKL)[36, 42], which uses a few kernels with different covariance structure in a single RKHS model. Finally, the SNP grid graph and the hypercube graph used in this study are naive graph structures for modeling discrete inputs. Perhaps developing a graph structure that is more suitable for SNP data might increase predictive correlations.
Conclusions
We investigated the performance of a diffusion kernel, which was specifically developed to model discrete marker inputs, using Holstein cattle and wheat data. On the one hand, the predictive ability of the diffusion kernel was similar to that of non-spatial distance-based additive genomic relationship kernels in the Holstein data, due to the fact that the response (PTA) is already a smoothed mean based on a large number of daughters of each bull, but outperformed the latter in the wheat data. On the other hand, only minor difference was observed between the diffusion and the Gaussian kernels in terms of predictive performance. Although the diffusion kernel as a choice of basis function may have potential for use in whole-genome prediction, the results of this study suggest that the simple Gaussian kernel is robust enough, and that the scope for enhancing predictive ability via kernel refinement may be limited.
Appendix A
Connection between a diffusion kernel and a Gaussian kernel
where g_{ θ }(x,x^{′})=G(x−x^{′}) is called a Gaussian kernel with bandwidth θ. Thus, the Gaussian kernel is the ‘space’ continuous analogue of the diffusion kernel as described on the graph. This analogy works exactly the same in higher dimensions.
Appendix B
Proof of equation (13)
so that:
Note that one does not need to count n_{0}.
Declarations
Acknowledgements
The authors thank the associate editor and two anonymous reviewers for their valuable comments. This work was supported by the Wisconsin Agriculture Experiment Station and by a Hatch grant from the United States Department of Agriculture.
Authors’ Affiliations
References
- Zhang Z, Zhang Q, Ding XD: Advances in genomic selection in domestic animals. Chin Sci Bull. 2011, 56: 2655-2663. 10.1007/s11434-011-4632-7.View ArticleGoogle Scholar
- Lorenz AJ, Chao S, Asoro FG, Heffner EL, Hayashi T, Iwata H, Smith KP, Sorrells ME, Jannink JL: Genomic selection in plant breeding: knowledge and prospects. Adv Agron. 2011, 110: 77-123.View ArticleGoogle Scholar
- de los Campos G, Gianola D, Allison DB: Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat Genet Rev. 2010, 11: 880-886. 10.1038/nrg2898.View ArticleGoogle Scholar
- Shao H, Burrage LC, Sinasac DS, Hill AE, Ernest SR, O’Brien W, Courtland HW, Jepsen KJ, Kirby A, Kulbokas EJ, Daly MJ, Broman KW, Lander ES, Nadeau JH: Genetic architecture of complex traits: Large phenotypic effects and pervasive epistasis. Proc Natl Acad Sci. 2008, 105: 19910-19914. 10.1073/pnas.0810388105.PubMed CentralView ArticlePubMedGoogle Scholar
- Mackay TFC, Stone EA, Ayroles JF: The genetics of quantitative traits: challenges and prospects. Nat Rev Genet. 2009, 10: 565-577.View ArticlePubMedGoogle Scholar
- Xu L, Jiang H, Chen H, Gu Z: Genetic architecture of growth traits revealed by global epistatic interactions. Genome Biol Evol. 2011, 3: 909-914. 10.1093/gbe/evr065.PubMed CentralView ArticlePubMedGoogle Scholar
- Loewe L: A framework for evolutionary systems biology. BMC Syst Biol. 2009, 3: 27-10.1186/1752-0509-3-27.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.PubMed CentralPubMedGoogle Scholar
- Habier D, Fernando RL, Kizilkaya K, Garrick DJ: Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics. 2011, 12: 186-10.1186/1471-2105-12-186.PubMed CentralView ArticlePubMedGoogle Scholar
- Gianola D, Fernando RL, Stella A: Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics. 2006, 173: 1761-1776. 10.1534/genetics.105.049510.PubMed CentralView ArticlePubMedGoogle Scholar
- Gianola D, Hill WG, Manfredi E, Fernando RL, de los Campos G: Additive genetic variability and the Bayesian alphabet. Genetics. 2009, 183: 347-363. 10.1534/genetics.109.103952.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM: Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010, 42: 565-569. 10.1038/ng.608.PubMed CentralView ArticlePubMedGoogle Scholar
- Henderson CR: Applications of linear models in animal breeding. 1984, Guelph: University of GuelphGoogle Scholar
- VanRaden PM: Efficient methods to compute genomic predictions. J Dairy Sci. 2008, 91: 4414-4423. 10.3168/jds.2007-0980.View ArticlePubMedGoogle Scholar
- Gianola D, van Kaam JBCHM: Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics. 2008, 178: 2289-2303. 10.1534/genetics.107.084285.PubMed CentralView ArticlePubMedGoogle Scholar
- Long N, Gianola D, Rosa GJ, Weigel KA, Kranis A, González-Recio O: Radial basis function regression methods for predicting quantitative traits using SNP markers. Genet Res. 2010, 92: 209-225. 10.1017/S0016672310000157.View ArticleGoogle Scholar
- Long N, Gianola D, Rosa GJ, Weigel KA: Application of support vector regression to genome-assisted prediction of quantitative traits. Theor Appl Genet. 2011, 123: 1065-1074. 10.1007/s00122-011-1648-y.View ArticlePubMedGoogle Scholar
- Ober U, Erbe M, Long N, Porcu E, Schlather M, Simianer H: Predicting genetic values: a kernel-based best linear unbiased prediction with genomic data. Genetics. 2011, 188: 695-708. 10.1534/genetics.111.128694.PubMed CentralView ArticlePubMedGoogle Scholar
- Saunders C, Gammerman A, Vovk V: Ridge regression learning algorithm in dual variables. Proceedings of the 15th International Conference on Machine Learning. 1998, Madison, Wisconsin: Morgan Kaufmann, 515-521.Google Scholar
- Hoerl AE, Kennard RW: Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970, 12: 55-67. 10.1080/00401706.1970.10488634.View ArticleGoogle Scholar
- Gianola D, de los Campos: Inferring genetic values for quantitative traits non-parametrically. Genet Res. 2008, 90: 525-540. 10.1017/S0016672308009890.View ArticleGoogle Scholar
- González-Recio O, Gianola D, Long N, Weigel KA, Rosa GJM, Avendaño S: Nonparametric methods for incorporating genomic information into genetic evaluations: an application to mortality in broilers. Genetics. 2008, 178: 2305-2313. 10.1534/genetics.107.084293.PubMed CentralView ArticlePubMedGoogle Scholar
- González-Recio O, Gianola D, Rosa GJ, Weigel KA, Kranis A: Genome-assisted prediction of a quantitative trait measured in parents and progeny: application to food conversion rate in chickens. Genet Sel Evol. 2009, 41: 3-10.1186/1297-9686-41-3.PubMed CentralView ArticlePubMedGoogle Scholar
- de los Campos G, Gianola D, Rosa GJ: Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. J Anim Sci. 2009, 87: 1883-1887. 10.2527/jas.2008-1259.View ArticlePubMedGoogle Scholar
- Kondor IR, Lafferty J: Diffusion kernels on graphs and other discrete input spaces. Proceedings of 19th International Conference on Machine Learning. 2002, Sydney, NSW, Australia: Morgan Kaufmann, 315-322.Google Scholar
- Smola AJ, Kondor R: Kernels and regularization on graphs. COLT/Kernel 2003, LNAI 2777. Edited by: Schölkopf B, Schölkopf B, Warmuth MK. 2003, Heidelberg, Berlin: Springer Verlag, 144-158.Google Scholar
- Lafferty J, Lebanon G: Diffusion kernels on statistical manifolds. J Mach Learn Res. 2005, 6: 129-163.Google Scholar
- Fouss F, Francoisse K, Yen L, Pirotte A, Saerens M: An experimental investigation of graph kernels on collaborative recommendation and semi-supervised classification. Neural Net. 2008, 31: 53-72.View ArticleGoogle Scholar
- Vishwanathan SVN, Schraudolph NN, Kondor IR, Borgwardt KM: Graph kernels. J Mach Learn Res. 2010, 11: 1201-1242.Google Scholar
- Tsuruta S, Misztal I, Lawlor TJ: Changing definition of productive life in US Holsteins: effect on genetic correlations. J Dairy Sci. 2005, 88: 1156-1165. 10.3168/jds.S0022-0302(05)72782-X.View ArticlePubMedGoogle Scholar
- Gianola D, Okut H, Weigel KA, Rosa GJM: Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat. BMC Genetics. 2011, 12: 87-PubMed CentralView ArticlePubMedGoogle Scholar
- Kimeldorf G, Wahba G: Some results on Tchebycheffian spline functions. J Math Anal Appl. 1971, 33: 82-95. 10.1016/0022-247X(71)90184-3.View ArticleGoogle Scholar
- Strandén I, Christensen OF: Allele coding in genomic evaluation. Genet Sel Evol. 2011, 43: 25-10.1186/1297-9686-43-25.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen TH, Solberg TR, Shepherd R, Woolliams JA: A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol. 2009, 41: 2-10.1186/1297-9686-41-2.PubMed CentralView ArticlePubMedGoogle Scholar
- Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA, Stricker C, Gianola D, Schlather M, Mackay TF, Simianer H: Using whole-genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster. PLoS Genet. 2012, 8: e1002685-10.1371/journal.pgen.1002685.PubMed CentralView ArticlePubMedGoogle Scholar
- Gianola D, Rosa GJ, Weigel KA, Crossa J, de los Campos G: Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet Res. 2010, 92: 295-308. 10.1017/S0016672310000285.View ArticleGoogle Scholar
- Kimeldorf G, Wahba G: A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat. 1970, 41: 495-502. 10.1214/aoms/1177697089.View ArticleGoogle Scholar
- Vert JP, Kanehisa M: Graph-driven features extraction from microarray data using diffusion kernels and kernel CCA. Proceedings of the Sixteenth Annual Neural Information Processing Systems (2002). Edited by: Becker S, Thrun S, Obermayer K. 2003, British Columbia, Canada: MIT Press, 1425-1432.Google Scholar
- Gärtner T: A survey of kernels for structured data. SIGKDD Explorations. 2002, 5: S268-275.Google Scholar
- Morota G, Valente BD, Rosa GJM, Weigel KA, Gianola D: An assessment of linkage disequilibrium in Holstein cattle using a Bayesian network. J Anim Breed Genet. 2012, 129: 474-487.PubMedGoogle Scholar
- Crossa J, de los Campos G, Pérez P, Gianola D, Burguen̈o J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, Arief V, Banziger M, Braun HJ: Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics. 2010, 186: 713-724. 10.1534/genetics.110.118521.PubMed CentralView ArticlePubMedGoogle Scholar
- Gönen M, Alpaydın E: Multiple kernel learning algorithms. J Mach Learn Res. 2011, 12: 2211-2268.Google Scholar
- Evans LC: Partial Differential Equations. 2nd Edition. 2010, Providence, Rhode Island: American Mathematical SocietyGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.