Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data

Morota, Gota; Koyama, Masanori; M Rosa, Guilherme J; Weigel, Kent A; Gianola, Daniel

doi:10.1186/1297-9686-45-17

Research
Open access
Published: 13 June 2013

Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data

Gota Morota¹,
Masanori Koyama²,
Guilherme J M Rosa^1,3,
Kent A Weigel⁴ &
…
Daniel Gianola^1,3,4

Genetics Selection Evolution volume 45, Article number: 17 (2013) Cite this article

6023 Accesses
32 Citations
3 Altmetric
Metrics details

Abstract

Background

Arguably, genotypes and phenotypes may be linked in functional forms that are not well addressed by the linear additive models that are standard in quantitative genetics. Therefore, developing statistical learning models for predicting phenotypic values from all available molecular information that are capable of capturing complex genetic network architectures is of great importance. Bayesian kernel ridge regression is a non-parametric prediction model proposed for this purpose. Its essence is to create a spatial distance-based relationship matrix called a kernel. Although the set of all single nucleotide polymorphism genotype configurations on which a model is built is finite, past research has mainly used a Gaussian kernel.

Results

We sought to investigate the performance of a diffusion kernel, which was specifically developed to model discrete marker inputs, using Holstein cattle and wheat data. This kernel can be viewed as a discretization of the Gaussian kernel. The predictive ability of the diffusion kernel was similar to that of non-spatial distance-based additive genomic relationship kernels in the Holstein data, but outperformed the latter in the wheat data. However, the difference in performance between the diffusion and Gaussian kernels was negligible.

Conclusions

It is concluded that the ability of a diffusion kernel to capture the total genetic variance is not better than that of a Gaussian kernel, at least for these data. Although the diffusion kernel as a choice of basis function may have potential for use in whole-genome prediction, our results imply that embedding genetic markers into a non-Euclidean metric space has very small impact on prediction. Our results suggest that use of the black box Gaussian kernel is justified, given its connection to the diffusion kernel and its similar predictive performance.

Background

Prediction of yet-to-be observed phenotypes for complex quantitative traits in agricultural species[1, 2] or for disease status in medicine[3] exploits connections between phenotypes, genealogies, and DNA variations potentially representing functional diversity of organisms. Systems biology approaches have uncovered abundant epistasis in model organisms, including the mouse and the rat[4], Drosophila melanogaster[5], and Saccharomyces cerevisiae[6]. In this context, Loewe[7] proposed an evolutionary systems biology framework for arriving at a better understanding of molecular interactions, given that epistatic interactions between mutations are commonly observed. Therefore, it seems reasonable to argue that genotypes and phenotypes may be connected in forms that are not well addressed by the linear additive models that are standard in quantitative genetics. Bayesian regularized parametric linear additive smoothers, e.g.,[8, 9] may not be fully adequate for capturing genetic signals under epistatic scenarios[10, 11]. Furthermore, attempts to account for epistasis by including interactions in a linear model produces a highly parameterized model structure, possibly yielding a poor predictive ability in cross-validation, and which does not scale well if high-order interactions are included in the model.

Genetic risk prediction in medicine relies on using genomic information to predict the chance of contracting a disease, for example, in personalized medicine for preventive treatment and clinical health care. Prediction of genetic risk derived from pre-selected marker variants is mainstream in this domain, as opposed to prediction based on fitting whole-genome markers simultaneously, as done with great success in animal and plant breeding[8–11]. However, the variants detected in this way are typically not useful for genetic risk prediction, because they explain only a small fraction of the total genetic variance as estimated from covariances between relatives, for example using twin and family studies. Moreover, it has been shown that a large number of variants that do not reach genome-wide statistical significance contribute to the total additive genetic variance[12].

Development of statistical models to predict phenotypic outcomes from all available molecular information that are capable of capturing complex genetic network architectures is therefore important. Arguably, a good predictive model should account for most of the genetic variability, as well as reflect the underlying genetic architecture properly. Also, a predictive model should be flexible with respect to type of input data, e.g., high-throughput chip-based genotypes or whole-genome sequences, and mode of gene action.

An appealing alternative is provided by a kernel-based parametric method known as BLUP (Best Linear Unbiased Prediction) of genetic effects, developed in the 1950’s by C. R. Henderson, an animal breeder[13]. BLUP can also be viewed as a regression of a phenotype on a pedigree-based relationship matrix A (when the model is additive), and it has been used for genetic improvement of livestock species for decades. This method was recently extended to incorporate SNP (single nucleotide polymorphisms) by replacing A by a genomic relationship matrix G[14], although there is no impediment to using A and G together[15]. BLUP is suited for handling a massive amount of genetic information, because the computational burden can be proportional to the number of data points rather than the number of predictor variables (e.g., markers), and this is particularly true if a common weight is assigned to a each marker. Recently, kernel-based non-parametric models e.g.,[15–18] have been proposed. A non-parametric treatment can accommodate nonlinear dependencies of phenotypes on predictor variables without explicitly modeling them. This suggests that these procedures can potentially pick up various forms of gene action without posing richly parametrized structures that require making strong distribution and genetic architecture assumptions a priori[10, 15]. For example, Long et al.[16] used a computer simulation and found that the predictive ability of a non-parametric smoother was superior to that of a parametric linear counterpart when non-additive effects were strong. These authors also gave evidence that non-parametric smoothing is competitive to linear smoothing, even when additivity accounts for most of the total genetic variability.

Kernel ridge regression[19], a kernel generalization of standard ridge regression[20], is also a non-parametric smoothing method. Ridge regression has received some attention in quantitative genetics in the context of mixed linear models[10, 15, 21–24], and the non-parametric version is carried out by constructing a spatial distance-based relationship matrix called kernel, as opposed to using additive genomic relationship kernels, A or G, which only embed correlations due to additive genetic effects. The choice of a kernel is equivalent to modeling covariance structure among individuals, and phenotypes are regressed on this kernel to obtain estimates of non-parametric regression coefficients.

A simulation study[18] found that in the presence of non-additive effects, a spatial distance-based kernel can outperform an additive genomic relationship kernel in predictive performance, but this has not been explored thoroughly with real data. Furthermore, while the set of all SNP genotype configurations on which a model is built is finite, past research has employed spatial distance-based kernels with infinite, unbounded domains, such as the Gaussian kernel. Our first objective in this study is to compare a spatial distance kernel with a non-spatial distance kernel. Secondly, we assess the performance of a non-Gaussian spatial distance kernel by deploying kernels on graphs as the choice of a basis function, a procedure that is suitable for discrete input data structures. Instead of encoding SNP data in a continuous Euclidean space, as in the case of the Gaussian kernel, we investigated kernels on a non-Euclidean space. We examined a diffusion kernel proposed by Kondor and Lafferty[25], Smola and Kondor[26] and Lafferty and Lebanon[27], which is a kernel defined for functions on discrete spaces, such as a graph. A brief review on ’kernels on graphs’ is given by[28], and “graph kernels” are discussed in[29]. As shown later, the diffusion kernel can be viewed as a discretization of the Gaussian kernel. We also tested the sensitivity of applying the same bandwidth parameter to autosomes and allosomes in the spatial distance kernels.

This paper investigates the use of several kinds of kernels in a kernel ridge regression framework for genome-assisted prediction of quantitative traits. Two data sets representing dairy cattle and wheat were employed for this purpose. The paper is organized as follows. In the Methods section, we describe the data and introduce basic notions of kernel ridge regression. We then apply the diffusion kernel to strings of dummy variable marker sequences; the motivation of the non-Euclidean metric space is followed by an introduction of the diffusion kernel. In the next section, main results are presented. Finally, in the Discussion section, we address the implication of our results and make concluding remarks.

Methods

Data

Dairy cattle and wheat data were used. The dairy data was provided by the USDA-ARS Animal Improvement Programs Laboratory (Beltsville, MD) and represented 7902 Holstein bulls, each with 43 134 SNPs (minor allele frequency > 0.025) spanning across the whole genome. The target response variable was progeny test predicted transmitting ability (PTA) of productive life (PL). PL is a measure of the observed length of time that a cow stays in the herd, from first calving to culling, and PTA is an estimate of half of the breeding value of a bull, which is a smoothed average assuming additive inheritance. PL is lowly heritable, with heritability estimated at 0.1[30]. The genotype for each of 42 438 SNP loci on autosomes was coded as 0 (homozygous for allele “a”), 1 (heterozygous), or 2 (homozygous for allele “A”), according to the number of copies of the “A” allele. The remaining 696 loci on the X chromosome were coded as either 0 or 2, representing absence or presence of the “A” allele, respectively. Missing genotypes, due to either low call rates for some SNPs or poor DNA quality, were imputed via random sampling of genotypes with probabilities corresponding to observed genotype frequencies at each locus. Note that other more precise methods are available but were not used here.

The wheat data included 599 inbred lines collected by the International Maize and Wheat improvement Center in Mexico (CIMMYT). Each line was genotyped with 1279 Diversity Array Technology (DArt) markers generated by Triticarte Pty. Ltd. These binary markers take the form of presence (1) or absence (-1). The phenotype here was average grain yield of each line in the first out of four environments represented in the data set, scaled to have zero mean and variance one. Missing genotypes were imputed as for the cattle data above. This data set has been also analyzed with support vector regression and neural network methods[17, 31].

Kernel ridge regression

Our goal is to predict an unobserved response y, for example PL in $R$ from a vector of genotypes x at a large number of SNP loci; when p SNP are considered, x is in $Z_{3}^{p}$ . To this end, we would like to establish a function $g : Z_{3}^{p} \to R$ mapping sequences of SNP genotypes onto the real line. A general setting is:

y_{i} = g (x_{i}) + ϵ_{i},

where y_i is a response variable on case i(i=1,2,…,n), x_i is a p×1 vector of genotypes obtained on i, g(x_i) is a genetic effect interpretable as the conditional expectation function g(x_i)=E(y_i|x=x_i), and ϵ_i is a residual.

We use kernel ridge regression to infer the unknown function g, and select an appropriate kernel $K$ via a reproducing kernel Hilbert space $ℋ$ (RKHS) of functions on $Z_{3}^{p}$ , and optimize:

∥ y - g ∥^{2} + λ ∥ g ∥_{ℋ}^{2}

(1)

with respect to G, where the first term is the residual sum of squares, $∥ g ∥_{ℋ}^{2}$ is the squared norm of g under a Hilbert space, and λ is a regularization parameter. The representer theorem[32] is used to find the optimal g.

In non-parametric regression, the search space is infinite, but the representer theorem allows confining the search to a specific set of functions. It has been shown[10, 15, 24, 32] that the optimizer will be in the span of the functions indexed by the observed covariates, and that the problem simplifies to optimization of:

ℓ (α | λ) = ∥ y - K α ∥^{2} + λ ∥ K α ∥_{ℋ}^{2},

where K={K(i,j)=K(x_i,x_j)} is a n×n symmetric positive (semi) definite matrix; α is an unknown n×1 vector of non-parametric regression coefficients; and g=K α, is the function that minimizes (1). By properties of a reproducing kernel, $∥ K α ∥_{ℋ}^{2} = α^{'} K α$ , so that the function to be minimized with respect to α is:

\begin{align} ℓ (α | λ) = {(y - K α)}^{'} (y - K α) + λ α^{'} K α . \end{align}

(2)

This is equivalent to writing:

y = K α + ϵ

and then maximizing a penalized likelihood. This penalized likelihood is obtained by assuming that p(y|α,σ e 2) and that α follows $N (0, K^{- 1} σ_{α}^{2})$ , where $σ_{e}^{2}$ is the variance of the residuals, and $σ_{α}^{2}$ is a variance component.

Next, we review additive genomic relationship kernels and the Gaussian kernel, and then present how one can build a kernel on a graph with discrete inputs. Hereafter, we denote K as the kernel matrix indexed by the observed covariate; and K(i,j) indicates particular elements of K; $K$ is the infinite dimensional Gaussian kernel, or the 3^p×3^p dimensional kernel for the diffusion kernel.

Additive genomic relationship kernels

Two types of additive genomic relationship kernels were tested in this study. First, an additive genomic relationship matrix (G 1) was constructed following VanRaden[14] as:

\begin{align} G 1 & = \frac{{ZZ}^{'}}{2 \sum p_{j} (1 - p_{j})}, \end{align}

where Z={Z_{i
j}} is a n×p matrix of centered SNP marker codes, with the entry for i th individual and the j th marker taking the form

\begin{align} Z_{ij} = \{\begin{array}{l} 0 - 2 p_{j} & if homozygous for “a” \\ 1 - 2 p_{j} & if heterozygous \\ 2 - 2 p_{j} & if homozygous for “A” . \end{array} \end{align}

Here p_j is the frequency of allele “A” computed from a base population. The denominator of G 1 is a scaling parameter. In practice, the allele frequencies are estimated from the data at hand, but this yields semi-positive definite matrices as discussed by Strandén and Christensen[33].

A second additive genomic relationship matrix (G 2) was also as in VanRaden[14]

\begin{align} G 2 = \frac{{WW}^{'}}{p}, \end{align}

where W is a matrix of standardized genotypes[34] with its j th column being

\begin{align} w_{.j} & = \frac{z_{.j}}{\sqrt{2 p_{j} (1 - p_{j})}}, \end{align}

where z_.j is the j th column of Z and p represents the number of SNPs.

Since the Holstein data set led to non-positive G 1 and G 2 matrices because of numerical issues, G_i(i=1,2) was modified to $G_{i}^{*} = 0.95 G_{i} + 0.05 I$ , yielding G^∗ matrices that provided valid kernels. This may also avoid numerical instability in the eigenvalue decomposition of the kernel as explained later. The wheat data produced semi-positive definite genomic relationship kernel matrices. The kernel G 1 has been applied to several inbred line populations in the past, e.g., Ober et al.[35].

Gaussian kernel

In a Gaussian kernel, the distance between a pair (i,j) of genotypes is represented as a squared Euclidean norm. Given a positive bandwidth parameter θ, the kernel takes the form

\begin{align} K (x_{i}, x_{j}) & = exp (- θ d_{ij}^{2}) \\ = \prod_{k = 1}^{p} exp (- θ {(x_{ik} - x_{jk})}^{2}), \end{align}

where $d_{ij} = \sqrt{{(x_{i 1} - x_{j 1})}^{2} + \dots + {(x_{ik} - x_{jk})}^{2} + \dots + {(x_{ip} - x_{jp})}^{2}}$ , and x_{i
k} (i,j=1,…,n,k=1,…,p) is the SNP genotype for individual i at SNP k. A small Euclidean distance between two individuals reflects a strong similarity in state between their genotypes. On the one hand, as θ increases, the kernel evaluation approaches K(x_i,x_j)=0, producing a “sharp” or “local” kernel. On the other hand, as θ→0, the kernel approaches 1, that is, a situation where the two individuals “match” perfectly, providing a “global” kernel.

Non-Euclidean metric space

The SNP data on p loci on some individual often come as $x = (x_{1}, x_{2}, \dots, x_{p}) \in Z_{3}^{p}$ , which is clearly a discrete space, as there are 3^p possible configurations of genotypes (not all of which are observable). Before defining the diffusion kernel, consider the meaning of ’diffusion on a graph’. Suppose p=1, and consider a function k_x that measures the spread of ’influence’ of the genotype at this locus over the other possible genotypes by assuming that the ’influence’ diffuses like heat dissipates. Let $k_{\tilde{x}} (0, x) = 1_{x = \tilde{x}} (x)$ , be the indicator function for genotype $\tilde{x}$ on $Z_{3}$ . We call this the time 0 diffusion, since in this case $\tilde{x}$ has absolutely no influence on other genotypes; that is, the influence of $\tilde{x}$ does not diffuse out to its neighbors. Now, define the time t diffusion of the ’influence’ of genotype $\tilde{x}$ on genotype x to be:

\begin{align} k_{\tilde{x}} (t, x) = k_{\tilde{x}} (t - 1, x) + \sum_{| x - x^{'} | = 1} α [k_{\tilde{x}} (t - 1, x^{'}) - k_{\tilde{x}} (t - 1, x)], \end{align}

(3)

where α is a constant rate of diffusion and each summand is the differential gradient of the ’influence’ between genotypes x and x^′. This is illustrated in Table1. As stated above, there is no diffusion at t=0. Subsequently, the time 1 diffusion with α=0.1 when $\tilde{x} = 1$ is computed as:

\begin{array}{l} k_{1} (1, x = 0) = k_{1} (0, x = 0) + α [k_{1} (0, x^{^{'}} = 1) - k_{1} (0, x = 0)] \\ = 0 + 0.1 [1 - 0] \\ = 0.1 \\ k_{1} (1, x = 1) = k_{1} (0, x = 1) + α [k_{1} (0, x^{^{'}} = 0) - k_{1} (0, x = 1)] \\ + α [k_{1} (0, x^{^{'}} = 2) - k_{1} (0, x = 1)] \\ = 1 + 0.1 [0 - 1] + 0.1 [0 - 1] \\ = 0.8 \\ k_{1} (1, x = 2) = k_{1} (0, x = 2) + α [k_{1} (0, x^{x^{'}} = 1) - k_{1} (0, x = 2)] \\ = 0 + 0.1 [1 - 0] \\ = 0.1 \end{array}

Table 1 Example of diffusion on a graph

Full size table

As shown in Table1, as t increases the ’influence’ spreads over all genotypes more evenly; also, the larger α is, the faster the diffusion is with respect to time t.

Writing (3) in vector form, with $k_{\tilde{x}} (t, x) = k_{\tilde{x}} (t)$ , we get:

\begin{align} k_{\tilde{x}} (t) & = k_{\tilde{x}} (t - 1) + α H k_{\tilde{x}} (t - 1) \\ = (I + α H) k_{\tilde{x}} (t - 1) \\ = {(I + α H)}^{t} k_{\tilde{x}} (0), \end{align}

(4)

where I is a 3×3 identity matrix; $k_{\tilde{x}} (0)$ is a constant 3×1 matrix of initial values, and

\begin{align} H = [\begin{array}{l} - 1 & 1 & 0 \\ 1 & - 2 & 1 \\ 0 & 1 & - 1 \end{array}] \end{align}

(5)

with the first, second, and third rows of the H matrix corresponding to k₀,k₁, and k₂ respectively. The negative of this matrix is called the Laplacian of a graph Γ, given by:

\begin{align} 0 - 1 - 2 . \end{align}

(6)

Let Γ be an undirected graph with vertex set V(Γ). In general, the Laplacian of a graph Γ, L(Γ), is a V(Γ) dimensional square matrix given by

\begin{align} L (Γ) & = - H (Γ) \\ = - A (Γ) + Λ, \end{align}

where A is an adjacency matrix and Λ is a diagonal matrix with $Λ_{ii} = \sum_{j = 1}^{n} A_{ij}$ . We can therefore generalize this ’diffusion’ for any graph Γ by using H(Γ)=−L(Γ). Under this definition, given any V(Γ) dimensional vector W,

w^{t} H (Γ) w = - \sum_{i \sim j} {(w_{i} - w_{j})}^{2} \leq 0,

which shows that H(Γ) is a negative semi-definite matrix.

The most naive way of constructing a graph on $Z_{3}^{p}$ is a Hamming graph. For the case p=1, a Hamming graph is simply a complete graph of size 3, and has the form

\begin{align} \begin{array}{l} 0 & - & 1 \\ ∖ & / \\ 2 \end{array} \end{align}

(7)

On this graph, the distance from genotype 0 (’aa’) to genotype 2 (’AA’) is the same as that from 0 (’aa’) to 1 (’ A a’). Since genotype ’aa’ has no copies of the ’A’ allele, it may be more reasonable to assume that genotype ’ A a’ is closer to ’AA’, which has two copies of the ’A’ allele. This can be viewed from a mutational perspective as well. Genotype 0 (’aa’) requires two mutations to become genotype 2 (’AA’), while genotype 1 (’ A a’) requires only one mutation. Thus, the graph of interest would be given by (6). The latter is a path graph for SNP data, which will be taken as a minimal basis for our graph. In a path graph, all vertices are on a straight line, as in (6).

A SNP grid of p loci is a p dimensional grid with vertices in $Z_{3}^{p}$ , with two vertices x and x^′ being adjacent if and only if

\sum_{i = 1}^{p} | x_{i} - x_{i}^{'} | = 1 .

For example, the graph below is the grid for 2 loci derived from the Cartesian graph product of two path graphs as in (6):

\begin{align} \begin{array}{l} 02 & - & 12 & - & 22 \\ | & | & | \\ 01 & - & 11 & - & 21 \\ | & | & | \\ 00 & - & 10 & - & 20 \end{array} \end{align}

(8)

The graph Laplacian for graph (8) is a square matrix of dimension 3²×3²:

\begin{align} L (Γ) & = - H (Γ) \\ = [\begin{array}{l} 2_{00} & - 1 & 0 & - 1 & 0 & 0 & 0 & 0 & 0 \\ - 1 & 3_{01} & - 1 & 0 & - 1 & 0 & 0 & 0 & 0 \\ 0 & - 1 & 2_{02} & 0 & 0 & - 1 & 0 & 0 & 0 \\ - 1 & 0 & 0 & 3_{10} & - 1 & 0 & - 1 & 0 & 0 \\ 0 & - 1 & 0 & - 1 & 4_{11} & - 1 & 0 & - 1 & 0 \\ 0 & 0 & - 1 & 0 & - 1 & 3_{12} & 0 & 0 & - 1 \\ 0 & 0 & 0 & - 1 & 0 & 0 & 2_{20} & - 1 & 0 \\ 0 & 0 & 0 & 0 & - 1 & 0 & - 1 & 3_{21} & - 1 \\ 0 & 0 & 0 & 0 & 0 & - 1 & 0 & - 1 & 2_{22} \end{array}] \end{align}

where the subscripts denote the vertices of graph (8). When there are p loci, the p-dimensional grid graph has 3^p vertices corresponding to sequences of genotypes, such that two vertices are adjacent if and only if just one SNP locus differs by 1. Now, suppose p=3. The Cartesian graph product of (6) and (8) yields a 3 dimensional grid graph with 3³ vertices, as shown in Figure1. The diffusion kernel computes a similarity between two vertices on this graph, and projects this information into a more interpretable space.

Diffusion kernel on a non-Euclidean metric space

Consider now the continuous analog of the diffusion scheme above. This can be done by making ‘time’ or ‘space’ continuous, and ‘time’ will be made continuous first. Let α=θ h (θ>0) and t=1/h. By using a small h, we can achieve a discretization of the ‘diffusion time’ on a much finer scale, and the coefficient matrix is:

\begin{align} {(I + θh H (Γ))}^{1 / h} . \end{align}

(9)

If an infinitesimal scale is considered by taking h→0, (9) converges to:

\begin{align} lim_{h \to 0} {(I + θh H (Γ))}^{1 / h} & = exp (θ H) \\ = \sum_{k = 0}^{\infty} \frac{θ^{k}}{k!} H^{k} = I + θ H + \frac{θ^{2}}{2} H^{2} \\ + \frac{θ^{3}}{3!} H^{3} + \dots + \frac{θ^{n}}{n!} H^{n} + \dots \end{align}

(10)

If a graph Γ with a Laplacian L(Γ) is considered, then exp(−θ L(Γ)) is called the diffusion kernel or heat kernel for graph Γ, where θ is a rate of diffusion[25]. Here, putting $K = exp (θ H)$ and taking the derivative with respect to θ gives:

\begin{align} \frac{d}{dθ} K = H K, \end{align}

(11)

which is a discrete diffusion equation (heat equation) on a graph with H=−L(Γ). Note that diffusion kernels always need to be associated with a graph.

A Gaussian kernel is obtained by making this diffusion kernel “space” continuous. The connection between the two kernels is provided in Appendix A.

Diffusion kernel indexed by observed covariates

When a graph Γ is large and asymmetric, the computation of the diffusion kernel $K (Γ)$ can be extremely difficult. For instance, for a SNP grid with 43 134 loci, the dimension of $K$ is 3⁴³¹³⁴ by 3⁴³¹³⁴. However, symmetry helps. If a closed form of $K$ can be arrived at, there is no need to compute $K (x, x^{'})$ for all pairs of genotype sequences x,x^′. This is indeed the case for the Gaussian kernel, where the dimension of $K$ is infinite. With Kondor and Lafferty’s result given in[25], we may obtain the closed form of the diffusion kernel from the sample for our SNP grid.

First, one needs to consider the Cartesian graph product for the diffusion kernel of a graph. Let $K_{1} (θ)$ and $K_{2} (θ)$ be the kernels for graphs Γ₁ and Γ₂, respectively. The diffusion kernel for Γ=Γ₁□Γ₂ is given by[25]:

\begin{align} K_{1} (θ) \otimes K_{2} (θ), \end{align}

(12)

where □ denotes the Cartesian graph product and ⊗ is the tensor product (infinite dimensional Kronecker product). Consider a graph with one locus, Γ₀, with form 0−1−2. Then, we see that the diffusion kernel of the SNP grid on p loci with bandwidth parameter θ is given by:

K_{θ}^{\otimes p} = \otimes_{i = 1}^{p} K_{θ} (Γ_{0}) .

To this end, we just need to compute $K_{θ} (Γ_{0}) = exp (θ H)$ with H in (5).

With this result, one can create the H matrix for a SNP grid as follows. Let x and x^′ be SNP data for p loci; n_s be the number of loci for which |x_i−x i′|=s, and m₁₁ be the number of loci for which x_i=x i′=1. In other words, n₁ is the number of loci at which two individuals differ by 1, and m₁₁ is the number of loci at which two individuals share heterozygous states. Then:

\begin{align} K_{θ}^{snpgrid} (x, x^{^{'}}) \propto & {(\frac{- 2 e^{- 3 θ} + 2}{e^{- 3 θ} + 3 e^{- θ} + 2})}^{n_{1}} \\ \times & {(\frac{e^{- 3 θ} - 3 e^{- θ} + 2}{e^{- 3 θ} + 3 e^{- θ} + 2})}^{n_{2}} \\ \times & {(\frac{4 e^{- 3 θ} + 2}{e^{- 3 θ} + 3 e^{- θ} + 2})}^{m_{11}}, \end{align}

(13)

with proportionality constant (e^−3θ+3e^−θ+2)^q, where q=n₁+n₂+m₁₁. The last term is a contribution from heterozygosity. We refer to this as SNP grid kernel, specifically developed to model SNP data in this study. A proof of this result is given in Appendix B.

Diffusion kernel for binary genotypes

Another diffusion kernel tailored for binary genotypes is required for the chromosome X of sires or for the wheat inbred lines. In this setting, instead of (6), the path graph for one locus (p=1) is:

0 - 2

and the corresponding graph Laplacian is given by:

\begin{align} L (Γ) & = - H (Γ) \\ = [\begin{array}{l} 1 & - 1 \\ - 1 & 1 \end{array}], \end{align}

(14)

as opposed to (5). For two loci (p=2), the Cartesian product of graphs Γ₁(0−2) and Γ₂(0−2) yields the graph:

\begin{align} \begin{array}{l} 00 & - & 01 \\ | & | \\ 10 & - & 11 \end{array} \end{align}

(15)

where the first digits ∈V(Γ₁) and the second digits ∈V(Γ₂). Then, the associated graph Laplacian is:

\begin{align} L (Γ) & = - H (Γ) \\ = [\begin{array}{l} 2_{00} & - 1 & - 1 & 0 \\ - 1 & 2_{01} & - 1 & 0 \\ - 1 & 0 & 2_{10} & - 1 \\ 0 & - 1 & - 1 & 2_{11} \end{array}], \end{align}

where the subscripts denote the rows and columns of vertices of graph (15). Specifically, we compute K_θ= exp(θ H) with H defined in (14) and perform the tensor product p times. With this, the kernel is given by:

\begin{align} K_{θ}^{hypercube} (x, x^{'}) \propto {(\frac{1 - exp (- 2 θ)}{1 + exp (- 2 θ)})}^{d (x, x^{'})}, \end{align}

(16)

where d(x,x^′) is the Hamming distance, that is, number of coordinates at which x and x^′ differ[25]. Following Kondor and Lafferty[25], this diffusion kernel for binary markers will be referred to as the hypercube kernel.

Combining SNP grid kernels and hypercube kernels

In addition in the Holstein data, we combined the two kernels derived from autosomes and from chromosome X to see the influence of applying the same value of the bandwidth parameter to different types of chromosomes. This is given by:

\begin{align} K^{all} = K^{snpgrid} # K^{hypercube}, \end{align}

(17)

where # is a Hadamard product of matrices. In general, given a set of n individuals, we may partition SNP into several subsets, say x=(x₁,x₂,....,x_r). If K^q is the diffusion kernel corresponding to subset x_q, then the diffusion kernel for all sets can be computed as:

K^{all} = K^{1} # K^{2} # \dots # K^{r} .

This result also holds for the Gaussian kernel, but not necessarily for every kernel, e.g., the exponential kernel defined with the Euclidean distance (||x_i−x_j||) does not hold this property.

Bayesian treatment of kernel ridge regression

Once the choice of the kernel is determined, (2) can be maximized by taking the derivative of ℓ(α) with respect to α to obtain:

\begin{align} \hat{α} = {(K + λ I)}^{- 1} y, \end{align}

where λ is a regularization parameter. Here, implementation of kernel ridge regression was cast in a Bayesian framework with $λ = \frac{σ_{ϵ}^{2}}{σ_{α}^{2}}$ , where $σ_{ϵ}^{2}$ and $σ_{α}^{2}$ are the residual variance and the variance attached to α respectively. Then, note that[36, 37]:

\begin{align} exp (- \frac{1}{2} ℓ (α)) = & exp \{- \frac{1}{2} [{(y - K α)}^{'} (y - K α) + λ α^{'} K α]\} \\ \propto & exp (- \frac{1}{2 σ_{ϵ}^{2}} {(y - K α)}^{'} (y - K α)) \\ \times & exp (- \frac{1}{2 σ_{α}^{2}} α^{'} K α) . \end{align}

This is proportional to $p (α | y, σ_{e}^{2}, σ_{α}^{2}) \propto p (y | α, σ_{e}^{2}) p (α | σ_{α}^{2})$ , that is, the posterior density of α (given $σ_{e}^{2}$ and $σ_{α}^{2}$ ) for the linear model:

\begin{align} y = K α + ϵ, \end{align}

with ϵ∼N(0,I σ e 2) and with prior α∼N(0,K⁻¹σ α 2). Minimizing ℓ(α) will maximize $exp (- \frac{1}{2} ℓ (α))$ , so $\hat{α}$ is the conditional posterior mode of α. One may change the basis K using the eigenvalue decomposition K=ΛΨΛ^′, where Λ is the matrix of eigenvectors of K and Ψ is a diagonal matrix in which diagonals are the eigenvalues, as shown in de los Campos et al.[36], such that, for δ=ΨΛ^′ one gets, in a fully Bayesian model,

\{\begin{array}{l} y = Λ δ + ϵ, \\ p (ϵ, δ, σ_{ϵ}^{2}, σ_{α}^{2}) \propto N (ϵ | 0, I σ_{ϵ}^{2}) N (δ | 0, Ψ σ_{α}^{2}) p (σ_{ϵ}^{2}, σ_{α}^{2}) \end{array}

Once a prior is assigned to $σ_{e}^{2}$ and $σ_{α}^{2}$ , a Monte Carlo Markov Chain (MCMC) scheme can be used to infer all unknown parameters, including λ. Scaled inverse chi-square prior distributions were assigned to $σ_{e}^{2}$ and $σ_{α}^{2}$ , each with 3 degrees of freedom and a scale parameter equal to 1. Samples from posterior distributions were obtained by the Gibbs sampler in[36], and each of the analyses was based on 100 000 MCMC samples with the first 60 000 samples discarded as burn-in. After burn-in, samples were thinned at a rate of 10, resulting in 4000 mildly correlated samples for posterior inference. Convergence was monitored by inspecting trace plots of the variance parameters. A bandwidth parameter θ yielding high predictive ability is needed as well. However, sampling of the bandwidth parameter in MCMC sampling requires computation of kernels at each iteration, which is very demanding given the number of individuals and SNP considered in our study. For this reason, evaluation of the diffusion kernel was performed over a fixed grid of values of θ. The range of θ considered provided average values of $K (x, x^{^{'}})$ that were evenly spaced, approximately, between 0.13 to 0.99. Computation of kernels and Gibbs sampling was carried out in Fortran and in R, respectively.

Assessment of predictive ability

The predictive ability of RKHS models with either a diffusion kernel or a Gaussian kernel was assessed by cross-validation. A subset of 5403 bulls born from 1952 through 2003 was used as the training set for the Holstein data. A testing set of 2499 bulls born from 2004 through 2006 was used to evaluate predictive ability. For the wheat data, a 10 fold cross-validation scheme was applied by assigning 599 lines randomly to one of 10 disjoint subsets. Each set was used for validation in turn, while the other 9 subsets were used to train the model. To illustrate, we estimated α in the Holstein data using the training set y=(y₁,⋯,y₅₄₀₃)^′ and their corresponding SNP genotypes x₁,⋯,x₅₄₀₃, and then predicted responses in the testing set as:

\begin{align} {\hat{y}}^{test} & = 1 {\hat{μ}}^{train} + K^{test \leftrightarrow train} {\hat{α}}^{train}, \end{align}

where ${\hat{y}}^{test}$ is the 2499 × 1 vector of predicted responses of bulls in the testing set; 1 is a 2499×1 vector of ones; ${\hat{μ}}^{train}$ is the posterior mean of the intercept estimated from the training set; K^test⇔train is a 2499 × 5403 matrix with elements k(j,i)^test⇔train representing the allelic similarity between bulls in the testing (j=1,…,2499) and training (i=1,…,5403) sets, with the same bandwidth parameter employed in the training set, and ${\hat{α}}^{train}$ is the vector of posterior means of 5403 non-parametric regression coefficients obtained from the training set. In the equation above, K^test⇔train was either the diffusion or the Gaussian kernel.

In a Bayesian setting, however, one can embed all the above steps in a convenient way. Prior to Gibbs sampling, first we construct a full kernel matrix containing both training and testing data sets. We treat the responses of testing set individuals as unobserved, and these values are predicted via a predictive distribution. This is easy to incorporate in the Gibbs sampling scheme. Pearson’s correlation between the predicted values (mean of the predictive distribution) and the observed PTA, Cor( ${\hat{y}}^{test}, y^{PTA})$ , and predictive mean-squared error (MSE) defined as $\sum_{i = 1}^{2499} {(ŷ_{i}^{test} - y_{i}^{PTA})}^{2} / n$ were computed to evaluate the predictive ability of the two kernels. Here, $ŷ_{i}^{test}$ is the mean of the predictive distribution of response i in the testing data set, which is the i th element of the $K^{test \leftrightarrow train} {\hat{α}}^{train}$ .

Results

To illustrate the effect of the bandwidth parameter (θ) on the SNP grid kernel, Figure2 contains histograms showing how θ controls similarities among individuals based on evaluating the kernel on the SNP data. The larger θ is, the stronger the prior inter-correlation structure. It is important to note that the diagonal elements in our SNP grid kernel matrices are not necessarily equal to one, as opposed to what happens in a Gaussian kernel; here, K is a correlation matrix. Table2 shows the average of diagonal, K(x_i,x_i), and off-diagonal, K(x_i,x_j), elements for diffusion, Gaussian and two additive genomic relationship kernels at varying bandwidth values. The mean values of the diagonal elements of the four diffusion kernels shown in Figure2 (see Table2) were 0.369, 0.693, 0.874, and 0.952 for θ=10,11,12, and 13, respectively. This is because in equation (13), even when x=x^′, so that n₁=n₂=0, m₁₁ (the number of ‘Aa’ genotypes shared by x and x^′) may not be zero. This implies that our diffusion kernel accounts for the degree of heterozygosity in a sample. From the perspective of the kernel as a smoothing function, the diffusion kernel performs smoothing for all elements based on heterozygosity as well as allelic similarity. As explained below, the larger the heterozygosity, the weaker the smoothing, leading to a smaller penalty; this is not the case, however, in the Gaussian kernel. In the kernel computation, each factor in (13) is < 1, and in particular, the factor corresponding to m₁₁ is the largest. Henceforth, if the sample contains few heterozygotes, our K will be large. Consequently, the penalty from the optimizer function f, $| | f_{ℋ} | | = α^{T} K α$ , will tend to be large. This is interpretable as imposing stronger smoothing for samples with low heterozygosity. As for the “correlation” with itself, an individual with low heterozygosity will have diagonal elements close to one, as in the case of a Gaussian kernel. Therefore, in addition to the ‘distance’ between genotypes of two individuals, the diffusion kernel takes into account the extent of heterozygosity, while the Gaussian kernel incorporates only the former. Also, the two kernels differ in their definition of distance. The diffusion kernel on the SNP grid is based on the Manhattan distance, while the Gaussian kernel is defined on the Euclidean distance. The Manhattan distance is the distance between two points measured by the the sum of the absolute differences of their coordinates.

Table 2 Averages of kernel elements and their predictive correlations for the Holstein data

Full size table

As shown in Table2, the average of off-diagonal elements of the diffusion kernel was smaller than that of diagonal elements. This is because the first two terms of (13) will be different from zero (n₁,n₂>0) for a pair of individuals. Diffusion kernel evaluations between an individual and itself were always larger than kernels evaluated between pairs, that is, diagonal elements had the largest values for each row of K. In the Gaussian kernel, diagonal elements are always equal to 1 and a smaller θ value produces a stronger prior correlation. The first type of additive genomic relationship kernel (G 1^∗) had the average diagonal and off-diagonal elements close to 1 and 0, respectively, as expected. Similarly, G 2^∗ had an average off-diagonal close to 0 but it had smaller average diagonal elements than those of G 1^∗.

The right-most columns of Table2 give the evaluation of predictive ability of the kernels measured as the correlation between predicted values and observed PTA, and MSE of prediction, for several different bandwidth parameters (G 1^∗ and G 2^∗ do not involve this parameter). The predictive correlation of the diffusion (SNP grid) kernel was best at θ = 11, while with the Gaussian kernel this was achieved at θ = 10⁻⁵. Although the averages of diagonal and off-diagonal elements varied substantially with different bandwidth parameters in the diffusion and Gaussian kernels, the influence of this variability on predictive correlations was small. Importantly, no major difference was observed between the diffusion and the Gaussian kernels in terms of predictive performance. Differences among kernels were very minor, probably due to the fact that the response (PTA) is already a smoothed mean based on a large number of daughters of each bull. There was consistency between the correlation and the MSE, in the sense that the value of θ with the highest predictive correlation had the smallest MSE. Predictive performance of G 1^∗ was only slightly worse than that of the spatial distance kernels with the best bandwidth parameters.

Values in parentheses in Table2 were obtained by combining the SNP grid kernel from autosomes and the hypercube kernel from allosomes by applying the same bandwidth parameter. Incorporation of X-chromosome information reduced the average off-diagonal elements slightly and deteriorated predictive performance to some extent. The average diagonal and off-diagonal elements remained the same in G 1^∗ and G 2^∗, but a minor reduction in their predictive abilities was observed.

In the wheat data, the superiority of the spatial distance-based kernels over the additive genomic relationship kernels was clear. Table3 indicates that the diffusion and Gaussian kernels had the best predictive correlations (MSE) at 0.586 (0.685) and 0.582 (0.686), respectively, whereas those of G 1 and G 2 were 0.518 (0.709) and 0.521 (0.708). This is likely due to picking up non-additive genetic variation that this wheat data harbors. With binary markers, the diagonal elements of the diffusion kernel are always 1, since in equation (16) the Hamming distance d(x,x^′) is always zero. As with the Holstein data, no apparent difference was observed between the diffusion and the Gaussian kernels.

Table 3 Averages of kernel elements and their predictive correlations for the wheat data

Full size table

Discussion

Arguably, relationships between phenotypes and genotypes may be non-linear and complex[10, 15, 31]. For this reason, ignoring non-additive effects such as dominance and epistasis in a model may lead to inferior predictive ability of individual phenotypes.

A spatial distance-based kernel non-parametric regression is capable of mapping genotypes to phenotypes in a way that accurately reflects underlying, albeit unknown, relationships. These kernel methods incorporate non-linearity of a predictor set x through a nonlinear transformation of x, subsequently allowing analysis of the response in terms of features ϕ(x) in a linear way. This is particularly useful when the response has a linear relationship with respect to the parameters, but is non-linear on covariates, such as in the case of polynomial regression.

The predictive ability of kernel-based genetic models depends on the choice of a kernel and associated bandwidth parameter(s). If the two data points lie in the real line, $x, x^{^{'}} \in R$ , it seems reasonable to compute their distance in terms of Euclidean distance. However, SNP genotypes, coded as dummy variables, take a discrete form. Therefore, it may be worthwhile to consider a kernel designed to capture the discrete structure of the input variables. The best predictive kernel and its performance may vary depending on the underlying genetic architecture, quantitative trait loci (QTL) numbers and distribution of effects, data set used, and kernel method applied. Here, we investigated the use of ridge regression with a diffusion kernel to assess if this would enhance predictive ability over that of the Gaussian kernel and two additive genomic relationship counterparts. Kondor and Lafferty[25] obtained promising results when the diffusion kernel was compared with several kernels in classification problems with a set of discrete predictors, and this kernel has been used in a microarray-based gene function prediction problem[38]. Ober et al.[18] used the Matérn covariance function, which contains the Gaussian and the exponential kernel as particular cases. Therein, the smoothing parameter controls the actual form of a kernel, and this is directly driven by sample data. Although they obtained a Gaussian form as a choice of the covariance function, the Matérn function is bounded by the Euclidean norm by definition, which may not be suited for discrete genomic data.

A strength of kernels for structured data is their ability to address similarities between two data points $x, x^{'} \notin R$ [39]. The diffusion kernel defines the distance between two data points on graphs, namely vertices, and projects this information into a more interpretable space. As shown in the context of modeling linkage disequilibrium[40], various graph structures can be used to represent sets of discrete random variables, such as genotypes. Coupled with the representer theorem, the diffusion kernel allows casting underlying graph structures into a regression on the real line under a Hilbert space. The main idea behind this kernel is the matrix exponentiation of the graph Laplacian. The p-dimensional grid graph with vertices representing a vector of genotypes was chosen for the graph structure. Each grid conveys information on similarity in terms of the Manhattan distance. Two vertices x and $x^{^{'}}$ are connected if $x_{i} = x_{i}^{^{'}}$ for all i, except at one coordinate. In the Holstein data, with n=7902 and p = 42 438, it is unlikely that any of two vertices present in our data are connected. However, what grid graphs embrace is how many “steps” separate a vector of genotypes observed in individual i from an observed vector of genotypes in individual j. The diffusion kernel and its associated graph structure are free of parametric structures. They are constructed without posing genetic architecture assumptions a priori. For illustration purposes, we used 0, 1, and 2 for allele coding, but these should be interpreted as mere strings. Kernel computation still remains the same no matter what allele coding method is adopted here. The parametric component in our study is the construction of the path graph in equation (6), but it is not relevant to gene action modes. This allows us to build a flexible non-parametric model without making strong assumptions a priori. This is appealing, because we seldom know the underlying genetic architectures of complex traits. As shown in past studies including animals[31] and plants[41], a non-parametric method stands out when prediction of phenotypes is the primal focus.

Our motivation for applying the diffusion kernel stemmed from the assumption that a non-Euclidean distance may be able to more clearly represent genomic similarities. We carried out a matrix exponentiation of two graph Laplacians created from two path graphs (one for SNP and one for binary markers) for this purpose. This yields a kernel based on the Manhattan distance accounting for the heterozygosity that two individuals share. The two spatial distance kernels resulted in better predictive performance than the two additive genomic relationship kernels in the wheat data. This agrees with the previous simulation study of Ober et al.[18], in which the Gaussian kernel outperformed G 1 in the presence of non-additive effects. Superiority of the spatial distance kernels was less obvious in the Holstein data. This may be due to the phenotype we chose for this study, since the PTA response variable is a smoothed average using linear mixed models. Although the differences were small in cattle, two non-parametric kernels applied in this study outperformed the additive relationship kernels in two of the datasets used. This suggests that unknown cryptic genetic architectures are likely to be intrinsic to complex traits and, hence, kernels that can accommodate such structure yielded better predictions.

As for the difference between the diffusion and the Gaussian kernels in terms of predictive ability, the diffusion kernel had the highest predictive correlation and the lowest MSE with θ=11 in the Holstein data, but the difference with the Gaussian kernel was negligible. The same result was seen in the wheat data. This implies that the Gaussian kernel is robust, even if it incorporates genotypes on the real line such as 1.25 or -12.3. Our objective to properly incorporate genotypes into a kernel had a small impact on predictive ability of yet-to-be observed phenotypes. Although the distance between genotypes is certainly not continuous, additional efforts to discretize the Euclidean distance may not be needed. Another possible reason might be that genotypes do not reside in the Euclidean or non-Euclidean spaces explored here, but in a manifold[27].

Incorporation of X chromosome genotypes for building a kernel led to a smaller average diagonal and off-diagonal elements (to some extent) in spatial distance kernels, but no change was observed in the additive genomic relationship kernels. In both types of spatial distance kernels, however, the predictive correlations were worse than when kernels were constructed purely from autosomes. This suggests that applying specific bandwidth parameters to autosomes and allosomes in the spatial distance kernels might be important. A similar decline in predictive performance was observed in the two additive genomic relationship kernels, which do not involve any bandwidth parameter. Further research is needed to investigate what produces this drop in predictive performance, although if no markers contribute to PL on chromosome X, this would add extra noise.

To our knowledge, this study involves one of the largest data sets employed for spatial kernel-based genome-enabled selection of agricultural species. The challenge here was computation of the diffusion kernel, rather than the Gibbs sampler. Approximately, it took four days to compute one diffusion kernel on a Linux workstation equipped with the Intel(R) Xeon(R) CPU E5450 3.00GHz and 16GB of RAM. The Gaussian kernel required slightly less time for building, but with several candidates over a grid of values of the bandwidth parameter θ, this was an expensive task for both kernels. One useful approach might be that of multiple kernel learning (MKL)[36, 42], which uses a few kernels with different covariance structure in a single RKHS model. Finally, the SNP grid graph and the hypercube graph used in this study are naive graph structures for modeling discrete inputs. Perhaps developing a graph structure that is more suitable for SNP data might increase predictive correlations.

Conclusions

We investigated the performance of a diffusion kernel, which was specifically developed to model discrete marker inputs, using Holstein cattle and wheat data. On the one hand, the predictive ability of the diffusion kernel was similar to that of non-spatial distance-based additive genomic relationship kernels in the Holstein data, due to the fact that the response (PTA) is already a smoothed mean based on a large number of daughters of each bull, but outperformed the latter in the wheat data. On the other hand, only minor difference was observed between the diffusion and the Gaussian kernels in terms of predictive performance. Although the diffusion kernel as a choice of basis function may have potential for use in whole-genome prediction, the results of this study suggest that the simple Gaussian kernel is robust enough, and that the scope for enhancing predictive ability via kernel refinement may be limited.

Appendix A

Connection between a diffusion kernel and a Gaussian kernel

Intuitively, consider again (4) with a one locus case. In order to make the space continuous, an infinite number of ‘fake’ genotypes between and outside of 0 and 2 are needed. That is, instead of the discrete graph 0−1−2, the interval between 0 and 2, and also outside of it, will be viewed as a ‘continuous’ graph containing genotypes such as 1.23 or −10.5, for example. While the fundamental structure of the graph remains the same, each genotype is connected only to its immediate neighbors, that is, each genotype x is connected to only two genotypes, x+d x and x−d x for some infinitesimal dx. Then, H in (5) becomes an infinite-dimensional matrix, and H(x,x^′) is −2 for x^′=x and 1 for x+d x, x−d x, because each genotype is connected to its neighboring genotypes on both sides. With the vector of genotypes being now infinite-dimensional, x=(−∞,⋯,x−d x,x,x+d x,⋯,∞), define a function f that returns an “influence” of genotypes, f=(f(−∞),⋯,f(x−d x),f(x),f(x+d x),⋯,f(∞)). Approximating dx by h, it can be seen that:

\begin{align} \frac{1}{h^{2}} [H (x, \cdot) \cdot f] = \frac{f (x + h) - 2 f (x) + f (x - h)}{h^{2}} \\ = \frac{\frac{f (x + h) - f (x)}{h} - \frac{f (x) - f (x - h)}{h}}{h} \\ ≅ f^{′′} (x), \end{align}

where $f^{^{″}} (x = x_{0})$ is the second derivative of f evaluated at x₀. Thus, with space continuity, H acts like a second derivative[25]. Using this analogy back in (11), we get:

\frac{d}{dθ} K_{θ} (x) = \frac{d^{2}}{d x^{2}} K_{θ} (x) .

This equation is called the continuous diffusion equation: the first derivative in “time” is equal to the second derivate in “space”. The solution to this partial differential equation (PDE) with a Dirac delta[43] initial condition of concentration on x=0, k₀(x)=1_x=0, is given by:

G_{θ} (x) = \frac{1}{\sqrt{4 πθ}} exp (- \frac{x^{2}}{4 θ}) .

This is a Gaussian density in a one-dimensional space where $σ_{e}^{2} = 2 θ$ is the variance of the distribution. With the initial condition K₀(x)=f(x), the solution to this PDE is:

K_{θ} (x) = \int_{R} f (x^{'}) G_{θ} (x - x^{'}) d x^{'},

where g_θ(x,x^′)=G(x−x^′) is called a Gaussian kernel with bandwidth θ. Thus, the Gaussian kernel is the ‘space’ continuous analogue of the diffusion kernel as described on the graph. This analogy works exactly the same in higher dimensions.

Appendix B

Proof of equation (13)

Consider a graph with one locus, Γ₀; this graph has form 0−1−2. We compute exp(θ H) where exponentiation is defined as the Taylor expansion (10), differing from componentwise exponentiation. For Γ₀, H is given by:

\begin{align} H = [\begin{array}{l} - 1 & 1 & 0 \\ 1 & - 2 & 1 \\ 0 & 1 & - 1 \end{array}] . \end{align}

We make use of the eigendecomposition of matrix H = P D P⁻¹ and take note of the fact that Hⁿ=P DⁿP⁻¹. Plugging this Hⁿ in (10), we obtain exp(θ H)=P exp(θ D)P⁻¹. Here exp(θ D) becomes simple componentwise exponentiation because D is a diagonal matrix of eigenvalues. For this specific matrix,

P = [\begin{array}{l} 1 & 1 & 1 \\ - 2 & 0 & 1 \\ 1 & - 1 & 1 \end{array}], D = {[\begin{array}{l} - 3 & 0 & 0 \\ 0 & - 1 & 0 \\ 0 & 0 & 0 \end{array}]}_{.}

Thus, the kernel for a one-dimensional grid graph is:

\begin{align} K_{θ} & = exp (θ H) \\ = P exp (θ D) P^{- 1} \\ = [\begin{array}{l} 1 & 1 & 1 \\ - 2 & 0 & 1 \\ 1 & - 1 & 1 \end{array}] [\begin{array}{l} e^{- 3 θ} & 0 & 0 \\ 0 & e^{- θ} & 0 \\ 0 & 0 & 1 \end{array}] {[\begin{array}{l} 1 & 1 & 1 \\ - 2 & 0 & 1 \\ 1 & - 1 & 1 \end{array}]}^{- 1} \\ = \frac{1}{6} [\begin{array}{l} e^{- 3 θ} + 3 e^{- θ} + 2 & - 2 e^{- 3 θ} + 2 & e^{- 3 θ} - 3 e^{- θ} + 2 \\ - 2 e^{- 3 θ} + 2 & 4 e^{- 3 θ} + 2 & - 2 e^{- 3 θ} + 2 \\ e^{- 3 θ} - 3 e^{- θ} + 2 & - 2 e^{- 3 θ} + 2 & e^{- 3 θ} + 3 e^{- θ} + 2 \end{array}] . \end{align}

(18)

Taking the exponential of eigenvalues always yields a positive real value, so if H is symmetric, exp(θ H) is positive definite, suggesting that the diffusion kernel is a valid kernel. Expression (18) is symmetric and in particular,

\begin{align} K_{θ} (x, x^{^{'}}) = \{\begin{array}{l} - 2 e^{- 3 θ} + 2 & if | x_{i} - x_{i}^{^{'}} | = 1 \\ e^{- 3 θ} - 3 e^{- θ} + 2 & if | x_{i} - x_{i}^{^{'}} | = 2 \\ e^{- 3 θ} + 3 e^{- θ} + 2 & if x_{i} = x_{i}^{^{'}}, x^{^{'}} \neq 1 \\ 4 e^{- 3 θ} + 2 & if x_{i} = x_{i}^{^{'}} = 1 \end{array} \end{align}

(19)

Computing every entry of $K$ is computationally unfeasible and unnecessary. We only need to compute entries corresponding to the pair of genotypes appearing in the sample. In particular, if k_i(x_i,y_i) is the contribution of the i th locus, then:

K_{θ} (x, x^{'}) = \prod_{i = 1}^{p} k_{i} (x_{i}, x_{i}^{^{'}}),

where $k_{i} (x_{i}, x_{i}^{^{'}})$ is determined by the relationship between x_i and $x_{i}^{^{'}}$ , and can take only one of the four values specified above. Thus we can write $k_{i} (x_{i}, x_{i}^{^{'}})$ as:

\begin{align} (e^{- 3 θ} - 3 e^{- θ} + 2) δ_{| x_{i} - y_{i} | = 2} + (- 2 e^{- 3 θ} + 2) δ_{| x_{i} - y_{i} | = 1} \\ + (e^{- 3 θ} + 3 e^{- θ} + 2) δ_{x_{i} = y_{i} \neq 1} + (4 e^{- 3 θ} + 2) δ_{x_{i} = y_{i} = 1} \end{align}

where δ is the indicator function. Therefore,

\begin{align} K_{θ}^{\otimes p} (x, x^{'}) & \propto \prod_{i = 1}^{p} \{(e^{- 3 θ} - 3 e^{- θ} + 2) δ_{| x_{i} - x_{i}^{'} | = 2} \\ + (- 2 e^{- 3 θ} + 2) δ_{| x_{i} - x_{i}^{'} | = 1} \\ + (e^{- 3 θ} + 3 e^{- θ} + 2) δ_{x_{i} = x_{i}^{'} \neq 1} \\ + (4 e^{- 3 θ} + 2) δ_{x_{i} = x_{i}^{'} = 1} \} \end{align}

This can be simplified by using the fact that:

n_{1} + n_{0} + n_{2} = p,

so that:

\begin{align} K_{θ}^{\otimes p} (x, x^{'}) & = {(- 2 e^{- 3 θ} + 2)}^{n_{1}} {(e^{- 3 θ} - 3 e^{- θ} + 2)}^{n_{2}} {(e^{- 3 θ} + 3 e^{- θ} + 2)}^{n_{0} - m_{11}} {(4 e^{- 3 θ} + 2)}^{m_{11}} \\ = {(- 2 e^{- 3 θ} + 2)}^{n_{1}} {(e^{- 3 θ} - 3 e^{- θ} + 2)}^{n_{2}} {(e^{- 3 θ} + 3 e^{- θ} + 2)}^{n_{0} - m_{11}} {(4 e^{- 3 θ} + 2)}^{m_{11}} \\ \cdot \frac{{(e^{- 3 θ} + 3 e^{- θ} + 2)}^{p}}{{(e^{- 3 θ} + 3 e^{- θ} + 2)}^{p}} \\ \propto \frac{{(- 2 e^{- 3 θ} + 2)}^{n_{1}} {(e^{- 3 θ} - 3 e^{- θ} + 2)}^{n_{2}} {(e^{- 3 θ} + 3 e^{- θ} + 2)}^{n_{0} - m_{11}} {(4 e^{- 3 θ} + 2)}^{m_{11}}}{{(e^{- 3 θ} + 3 e^{- θ} + 2)}^{p}} \\ = \frac{{(- 2 e^{- 3 θ} + 2)}^{n_{1}} {(e^{- 3 θ} - 3 e^{- θ} + 2)}^{n_{2}} {(e^{- 3 θ} + 3 e^{- θ} + 2)}^{n_{0} - m_{11}} {(4 e^{- 3 θ} + 2)}^{m_{11}}}{{(e^{- 3 θ} + 3 e^{- θ} + 2)}^{n_{0} + n_{1} + n_{2}}} \\ = \frac{{(- 2 e^{- 3 θ} + 2)}^{n_{1}} {(e^{- 3 θ} - 3 e^{- θ} + 2)}^{n_{2}} {(4 e^{- 3 θ} + 2)}^{m_{11}}}{{(e^{- 3 θ} + 3 e^{- θ} + 2)}^{n_{1} + n_{2} + m_{11}}} \end{align}

(20)

Note that one does not need to count n₀.

References

Zhang Z, Zhang Q, Ding XD: Advances in genomic selection in domestic animals. Chin Sci Bull. 2011, 56: 2655-2663. 10.1007/s11434-011-4632-7.
Article Google Scholar
Lorenz AJ, Chao S, Asoro FG, Heffner EL, Hayashi T, Iwata H, Smith KP, Sorrells ME, Jannink JL: Genomic selection in plant breeding: knowledge and prospects. Adv Agron. 2011, 110: 77-123.
Article Google Scholar
de los Campos G, Gianola D, Allison DB: Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat Genet Rev. 2010, 11: 880-886. 10.1038/nrg2898.
Article CAS Google Scholar
Shao H, Burrage LC, Sinasac DS, Hill AE, Ernest SR, O’Brien W, Courtland HW, Jepsen KJ, Kirby A, Kulbokas EJ, Daly MJ, Broman KW, Lander ES, Nadeau JH: Genetic architecture of complex traits: Large phenotypic effects and pervasive epistasis. Proc Natl Acad Sci. 2008, 105: 19910-19914. 10.1073/pnas.0810388105.
Article PubMed Central CAS PubMed Google Scholar
Mackay TFC, Stone EA, Ayroles JF: The genetics of quantitative traits: challenges and prospects. Nat Rev Genet. 2009, 10: 565-577.
Article CAS PubMed Google Scholar
Xu L, Jiang H, Chen H, Gu Z: Genetic architecture of growth traits revealed by global epistatic interactions. Genome Biol Evol. 2011, 3: 909-914. 10.1093/gbe/evr065.
Article PubMed Central CAS PubMed Google Scholar
Loewe L: A framework for evolutionary systems biology. BMC Syst Biol. 2009, 3: 27-10.1186/1752-0509-3-27.
Article PubMed Central PubMed Google Scholar
Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.
PubMed Central CAS PubMed Google Scholar
Habier D, Fernando RL, Kizilkaya K, Garrick DJ: Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics. 2011, 12: 186-10.1186/1471-2105-12-186.
Article PubMed Central PubMed Google Scholar
Gianola D, Fernando RL, Stella A: Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics. 2006, 173: 1761-1776. 10.1534/genetics.105.049510.
Article PubMed Central CAS PubMed Google Scholar
Gianola D, Hill WG, Manfredi E, Fernando RL, de los Campos G: Additive genetic variability and the Bayesian alphabet. Genetics. 2009, 183: 347-363. 10.1534/genetics.109.103952.
Article PubMed Central PubMed Google Scholar
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM: Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010, 42: 565-569. 10.1038/ng.608.
Article PubMed Central CAS PubMed Google Scholar
Henderson CR: Applications of linear models in animal breeding. 1984, Guelph: University of Guelph
Google Scholar
VanRaden PM: Efficient methods to compute genomic predictions. J Dairy Sci. 2008, 91: 4414-4423. 10.3168/jds.2007-0980.
Article CAS PubMed Google Scholar
Gianola D, van Kaam JBCHM: Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics. 2008, 178: 2289-2303. 10.1534/genetics.107.084285.
Article PubMed Central PubMed Google Scholar
Long N, Gianola D, Rosa GJ, Weigel KA, Kranis A, González-Recio O: Radial basis function regression methods for predicting quantitative traits using SNP markers. Genet Res. 2010, 92: 209-225. 10.1017/S0016672310000157.
Article CAS Google Scholar
Long N, Gianola D, Rosa GJ, Weigel KA: Application of support vector regression to genome-assisted prediction of quantitative traits. Theor Appl Genet. 2011, 123: 1065-1074. 10.1007/s00122-011-1648-y.
Article PubMed Google Scholar
Ober U, Erbe M, Long N, Porcu E, Schlather M, Simianer H: Predicting genetic values: a kernel-based best linear unbiased prediction with genomic data. Genetics. 2011, 188: 695-708. 10.1534/genetics.111.128694.
Article PubMed Central PubMed Google Scholar
Saunders C, Gammerman A, Vovk V: Ridge regression learning algorithm in dual variables. Proceedings of the 15th International Conference on Machine Learning. 1998, Madison, Wisconsin: Morgan Kaufmann, 515-521.
Google Scholar
Hoerl AE, Kennard RW: Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970, 12: 55-67. 10.1080/00401706.1970.10488634.
Article Google Scholar
Gianola D, de los Campos: Inferring genetic values for quantitative traits non-parametrically. Genet Res. 2008, 90: 525-540. 10.1017/S0016672308009890.
Article CAS Google Scholar
González-Recio O, Gianola D, Long N, Weigel KA, Rosa GJM, Avendaño S: Nonparametric methods for incorporating genomic information into genetic evaluations: an application to mortality in broilers. Genetics. 2008, 178: 2305-2313. 10.1534/genetics.107.084293.
Article PubMed Central PubMed Google Scholar
González-Recio O, Gianola D, Rosa GJ, Weigel KA, Kranis A: Genome-assisted prediction of a quantitative trait measured in parents and progeny: application to food conversion rate in chickens. Genet Sel Evol. 2009, 41: 3-10.1186/1297-9686-41-3.
Article PubMed Central PubMed Google Scholar
de los Campos G, Gianola D, Rosa GJ: Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. J Anim Sci. 2009, 87: 1883-1887. 10.2527/jas.2008-1259.
Article CAS PubMed Google Scholar
Kondor IR, Lafferty J: Diffusion kernels on graphs and other discrete input spaces. Proceedings of 19th International Conference on Machine Learning. 2002, Sydney, NSW, Australia: Morgan Kaufmann, 315-322.
Google Scholar
Smola AJ, Kondor R: Kernels and regularization on graphs. COLT/Kernel 2003, LNAI 2777. Edited by: Schölkopf B, Schölkopf B, Warmuth MK. 2003, Heidelberg, Berlin: Springer Verlag, 144-158.
Google Scholar
Lafferty J, Lebanon G: Diffusion kernels on statistical manifolds. J Mach Learn Res. 2005, 6: 129-163.
Google Scholar
Fouss F, Francoisse K, Yen L, Pirotte A, Saerens M: An experimental investigation of graph kernels on collaborative recommendation and semi-supervised classification. Neural Net. 2008, 31: 53-72.
Article Google Scholar
Vishwanathan SVN, Schraudolph NN, Kondor IR, Borgwardt KM: Graph kernels. J Mach Learn Res. 2010, 11: 1201-1242.
Google Scholar
Tsuruta S, Misztal I, Lawlor TJ: Changing definition of productive life in US Holsteins: effect on genetic correlations. J Dairy Sci. 2005, 88: 1156-1165. 10.3168/jds.S0022-0302(05)72782-X.
Article CAS PubMed Google Scholar
Gianola D, Okut H, Weigel KA, Rosa GJM: Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat. BMC Genetics. 2011, 12: 87-
Article PubMed Central CAS PubMed Google Scholar
Kimeldorf G, Wahba G: Some results on Tchebycheffian spline functions. J Math Anal Appl. 1971, 33: 82-95. 10.1016/0022-247X(71)90184-3.
Article Google Scholar
Strandén I, Christensen OF: Allele coding in genomic evaluation. Genet Sel Evol. 2011, 43: 25-10.1186/1297-9686-43-25.
Article PubMed Central PubMed Google Scholar
Meuwissen TH, Solberg TR, Shepherd R, Woolliams JA: A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol. 2009, 41: 2-10.1186/1297-9686-41-2.
Article PubMed Central PubMed Google Scholar
Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA, Stricker C, Gianola D, Schlather M, Mackay TF, Simianer H: Using whole-genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster. PLoS Genet. 2012, 8: e1002685-10.1371/journal.pgen.1002685.
Article PubMed Central CAS PubMed Google Scholar
Gianola D, Rosa GJ, Weigel KA, Crossa J, de los Campos G: Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet Res. 2010, 92: 295-308. 10.1017/S0016672310000285.
Article Google Scholar
Kimeldorf G, Wahba G: A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat. 1970, 41: 495-502. 10.1214/aoms/1177697089.
Article Google Scholar
Vert JP, Kanehisa M: Graph-driven features extraction from microarray data using diffusion kernels and kernel CCA. Proceedings of the Sixteenth Annual Neural Information Processing Systems (2002). Edited by: Becker S, Thrun S, Obermayer K. 2003, British Columbia, Canada: MIT Press, 1425-1432.
Google Scholar
Gärtner T: A survey of kernels for structured data. SIGKDD Explorations. 2002, 5: S268-275.
Google Scholar
Morota G, Valente BD, Rosa GJM, Weigel KA, Gianola D: An assessment of linkage disequilibrium in Holstein cattle using a Bayesian network. J Anim Breed Genet. 2012, 129: 474-487.
CAS PubMed Google Scholar
Crossa J, de los Campos G, Pérez P, Gianola D, Burguen̈o J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, Arief V, Banziger M, Braun HJ: Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics. 2010, 186: 713-724. 10.1534/genetics.110.118521.
Article PubMed Central CAS PubMed Google Scholar
Gönen M, Alpaydın E: Multiple kernel learning algorithms. J Mach Learn Res. 2011, 12: 2211-2268.
Google Scholar
Evans LC: Partial Differential Equations. 2nd Edition. 2010, Providence, Rhode Island: American Mathematical Society
Google Scholar

Download references

Acknowledgements

The authors thank the associate editor and two anonymous reviewers for their valuable comments. This work was supported by the Wisconsin Agriculture Experiment Station and by a Hatch grant from the United States Department of Agriculture.

Author information

Authors and Affiliations

Department of Animal Sciences, University of Wisconsin-Madison, Madison, WI, USA
Gota Morota, Guilherme J M Rosa & Daniel Gianola
Department of Mathematics, University of Wisconsin-Madison, Madison, WI, USA
Masanori Koyama
Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
Guilherme J M Rosa & Daniel Gianola
Department of Dairy Science, University of Wisconsin-Madison, Madison, WI, USA
Kent A Weigel & Daniel Gianola

Authors

Gota Morota
View author publications
You can also search for this author in PubMed Google Scholar
Masanori Koyama
View author publications
You can also search for this author in PubMed Google Scholar
Guilherme J M Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Kent A Weigel
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Gianola
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gota Morota.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

GM conceived, carried out the study, and drafted the manuscript; MK provided mathematical assistance and helped to derive the SNP grid kernel; GJMR and KAW provided critical insights and revised the manuscript; DG supervised the study and revised the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Morota, G., Koyama, M., M Rosa, G.J. et al. Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data. Genet Sel Evol 45, 17 (2013). https://doi.org/10.1186/1297-9686-45-17

Download citation

Received: 06 December 2012
Accepted: 31 May 2013
Published: 13 June 2013
DOI: https://doi.org/10.1186/1297-9686-45-17

Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data

Abstract

Background

Results

Conclusions

Background

Methods

Data

Kernel ridge regression

Additive genomic relationship kernels

Gaussian kernel

Non-Euclidean metric space

Diffusion kernel on a non-Euclidean metric space

Diffusion kernel indexed by observed covariates

Diffusion kernel for binary genotypes

Combining SNP grid kernels and hypercube kernels

Bayesian treatment of kernel ridge regression

Assessment of predictive ability

Results

Discussion

Conclusions

Appendix A

Connection between a diffusion kernel and a Gaussian kernel

Appendix B

Proof of equation (13)

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genetics Selection Evolution

Contact us