Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data

Background Arguably, genotypes and phenotypes may be linked in functional forms that are not well addressed by the linear additive models that are standard in quantitative genetics. Therefore, developing statistical learning models for predicting phenotypic values from all available molecular information that are capable of capturing complex genetic network architectures is of great importance. Bayesian kernel ridge regression is a non-parametric prediction model proposed for this purpose. Its essence is to create a spatial distance-based relationship matrix called a kernel. Although the set of all single nucleotide polymorphism genotype configurations on which a model is built is finite, past research has mainly used a Gaussian kernel. Results We sought to investigate the performance of a diffusion kernel, which was specifically developed to model discrete marker inputs, using Holstein cattle and wheat data. This kernel can be viewed as a discretization of the Gaussian kernel. The predictive ability of the diffusion kernel was similar to that of non-spatial distance-based additive genomic relationship kernels in the Holstein data, but outperformed the latter in the wheat data. However, the difference in performance between the diffusion and Gaussian kernels was negligible. Conclusions It is concluded that the ability of a diffusion kernel to capture the total genetic variance is not better than that of a Gaussian kernel, at least for these data. Although the diffusion kernel as a choice of basis function may have potential for use in whole-genome prediction, our results imply that embedding genetic markers into a non-Euclidean metric space has very small impact on prediction. Our results suggest that use of the black box Gaussian kernel is justified, given its connection to the diffusion kernel and its similar predictive performance.


Background
Prediction of yet-to-be observed phenotypes for complex quantitative traits in agricultural species [1,2] or for disease status in medicine [3] exploits connections between phenotypes, genealogies, and DNA variations potentially representing functional diversity of organisms. Systems biology approaches have uncovered abundant epistasis in model organisms, including the mouse and the rat [4], Drosophila melanogaster [5], and Saccharomyces cerevisiae [6]. In this context, Loewe [7] proposed an evolutionary systems biology framework for arriving at a better understanding of molecular interactions, given that *Correspondence: morota@ansci.wisc.edu 1 Department of Animal Sciences, University of Wisconsin-Madison, Madison, WI, USA Full list of author information is available at the end of the article epistatic interactions between mutations are commonly observed. Therefore, it seems reasonable to argue that genotypes and phenotypes may be connected in forms that are not well addressed by the linear additive models that are standard in quantitative genetics. Bayesian regularized parametric linear additive smoothers, e.g., [8,9] may not be fully adequate for capturing genetic signals under epistatic scenarios [10,11]. Furthermore, attempts to account for epistasis by including interactions in a linear model produces a highly parameterized model structure, possibly yielding a poor predictive ability in cross-validation, and which does not scale well if high-order interactions are included in the model.
Genetic risk prediction in medicine relies on using genomic information to predict the chance of contracting a disease, for example, in personalized medicine for preventive treatment and clinical health care. Prediction http://www.gsejournal.org/content/45/1/17 of genetic risk derived from pre-selected marker variants is mainstream in this domain, as opposed to prediction based on fitting whole-genome markers simultaneously, as done with great success in animal and plant breeding [8][9][10][11]. However, the variants detected in this way are typically not useful for genetic risk prediction, because they explain only a small fraction of the total genetic variance as estimated from covariances between relatives, for example using twin and family studies. Moreover, it has been shown that a large number of variants that do not reach genome-wide statistical significance contribute to the total additive genetic variance [12].
Development of statistical models to predict phenotypic outcomes from all available molecular information that are capable of capturing complex genetic network architectures is therefore important. Arguably, a good predictive model should account for most of the genetic variability, as well as reflect the underlying genetic architecture properly. Also, a predictive model should be flexible with respect to type of input data, e.g., high-throughput chip-based genotypes or whole-genome sequences, and mode of gene action.
An appealing alternative is provided by a kernel-based parametric method known as BLUP (Best Linear Unbiased Prediction) of genetic effects, developed in the 1950's by C. R. Henderson, an animal breeder [13]. BLUP can also be viewed as a regression of a phenotype on a pedigree-based relationship matrix A (when the model is additive), and it has been used for genetic improvement of livestock species for decades. This method was recently extended to incorporate SNP (single nucleotide polymorphisms) by replacing A by a genomic relationship matrix G [14], although there is no impediment to using A and G together [15]. BLUP is suited for handling a massive amount of genetic information, because the computational burden can be proportional to the number of data points rather than the number of predictor variables (e.g., markers), and this is particularly true if a common weight is assigned to a each marker. Recently, kernelbased non-parametric models e.g., [15][16][17][18] have been proposed. A non-parametric treatment can accommodate nonlinear dependencies of phenotypes on predictor variables without explicitly modeling them. This suggests that these procedures can potentially pick up various forms of gene action without posing richly parametrized structures that require making strong distribution and genetic architecture assumptions a priori [10,15]. For example, Long et al. [16] used a computer simulation and found that the predictive ability of a non-parametric smoother was superior to that of a parametric linear counterpart when non-additive effects were strong. These authors also gave evidence that non-parametric smoothing is competitive to linear smoothing, even when additivity accounts for most of the total genetic variability.
Kernel ridge regression [19], a kernel generalization of standard ridge regression [20], is also a nonparametric smoothing method. Ridge regression has received some attention in quantitative genetics in the context of mixed linear models [10,15,[21][22][23][24], and the non-parametric version is carried out by constructing a spatial distance-based relationship matrix called kernel, as opposed to using additive genomic relationship kernels, A or G, which only embed correlations due to additive genetic effects. The choice of a kernel is equivalent to modeling covariance structure among individuals, and phenotypes are regressed on this kernel to obtain estimates of non-parametric regression coefficients.
A simulation study [18] found that in the presence of non-additive effects, a spatial distance-based kernel can outperform an additive genomic relationship kernel in predictive performance, but this has not been explored thoroughly with real data. Furthermore, while the set of all SNP genotype configurations on which a model is built is finite, past research has employed spatial distancebased kernels with infinite, unbounded domains, such as the Gaussian kernel. Our first objective in this study is to compare a spatial distance kernel with a non-spatial distance kernel. Secondly, we assess the performance of a non-Gaussian spatial distance kernel by deploying kernels on graphs as the choice of a basis function, a procedure that is suitable for discrete input data structures. Instead of encoding SNP data in a continuous Euclidean space, as in the case of the Gaussian kernel, we investigated kernels on a non-Euclidean space. We examined a diffusion kernel proposed by Kondor and Lafferty [25], Smola and Kondor [26] and Lafferty and Lebanon [27], which is a kernel defined for functions on discrete spaces, such as a graph. A brief review on 'kernels on graphs' is given by [28], and "graph kernels" are discussed in [29]. As shown later, the diffusion kernel can be viewed as a discretization of the Gaussian kernel. We also tested the sensitivity of applying the same bandwidth parameter to autosomes and allosomes in the spatial distance kernels. This paper investigates the use of several kinds of kernels in a kernel ridge regression framework for genomeassisted prediction of quantitative traits. Two data sets representing dairy cattle and wheat were employed for this purpose. The paper is organized as follows. In the Methods section, we describe the data and introduce basic notions of kernel ridge regression. We then apply the diffusion kernel to strings of dummy variable marker sequences; the motivation of the non-Euclidean metric space is followed by an introduction of the diffusion kernel. In the next section, main results are presented. Finally, in the Discussion section, we address the implication of our results and make concluding remarks. http://www.gsejournal.org/content/45/1/17

Data
Dairy cattle and wheat data were used. The dairy data was provided by the USDA-ARS Animal Improvement Programs Laboratory (Beltsville, MD) and represented 7902 Holstein bulls, each with 43 134 SNPs (minor allele frequency > 0.025) spanning across the whole genome. The target response variable was progeny test predicted transmitting ability (PTA) of productive life (PL). PL is a measure of the observed length of time that a cow stays in the herd, from first calving to culling, and PTA is an estimate of half of the breeding value of a bull, which is a smoothed average assuming additive inheritance. PL is lowly heritable, with heritability estimated at 0.1 [30]. The genotype for each of 42 438 SNP loci on autosomes was coded as 0 (homozygous for allele "a"), 1 (heterozygous), or 2 (homozygous for allele "A"), according to the number of copies of the "A" allele. The remaining 696 loci on the X chromosome were coded as either 0 or 2, representing absence or presence of the "A" allele, respectively. Missing genotypes, due to either low call rates for some SNPs or poor DNA quality, were imputed via random sampling of genotypes with probabilities corresponding to observed genotype frequencies at each locus. Note that other more precise methods are available but were not used here.
The wheat data included 599 inbred lines collected by the International Maize and Wheat improvement Center in Mexico (CIMMYT). Each line was genotyped with 1279 Diversity Array Technology (DArt) markers generated by Triticarte Pty. Ltd. These binary markers take the form of presence (1) or absence (-1). The phenotype here was average grain yield of each line in the first out of four environments represented in the data set, scaled to have zero mean and variance one. Missing genotypes were imputed as for the cattle data above. This data set has been also analyzed with support vector regression and neural network methods [17,31].

Kernel ridge regression
Our goal is to predict an unobserved response y, for example PL in R from a vector of genotypes x at a large number of SNP loci; when p SNP are considered, x is in Z p 3 . To this end, we would like to establish a function g : Z p 3 → R mapping sequences of SNP genotypes onto the real line. A general setting is: where y i is a response variable on case i(i = 1, 2, . . . , n), x i is a p × 1 vector of genotypes obtained on i, g(x i ) is a genetic effect interpretable as the conditional expectation function g(x i ) = E(y i |x = x i ), and i is a residual.
We use kernel ridge regression to infer the unknown function g, and select an appropriate kernel K via a reproducing kernel Hilbert space H (RKHS) of functions on Z p 3 , and optimize: with respect to g, where the first term is the residual sum of squares, g 2 H is the squared norm of g under a Hilbert space, and λ is a regularization parameter. The representer theorem [32] is used to find the optimal g.
In non-parametric regression, the search space is infinite, but the representer theorem allows confining the search to a specific set of functions. It has been shown [10,15,24,32] that the optimizer will be in the span of the functions indexed by the observed covariates, and that the problem simplifies to optimization of: } is a n × n symmetric positive (semi) definite matrix; α is an unknown n × 1 vector of non-parametric regression coefficients; and g = Kα, is the function that minimizes (1). By properties of a reproducing kernel, Kα 2 H = α Kα, so that the function to be minimized with respect to α is: This is equivalent to writing: and then maximizing a penalized likelihood. This penalized likelihood is obtained by assuming that p(y|α, σ 2 e ) and that α follows N(0, K −1 σ 2 α ), where σ 2 e is the variance of the residuals, and σ 2 α is a variance component. Next, we review additive genomic relationship kernels and the Gaussian kernel, and then present how one can build a kernel on a graph with discrete inputs. Hereafter, we denote K as the kernel matrix indexed by the observed covariate; and K(i, j) indicates particular elements of K; K is the infinite dimensional Gaussian kernel, or the 3 p × 3 p dimensional kernel for the diffusion kernel.

Additive genomic relationship kernels
Two types of additive genomic relationship kernels were tested in this study. First, an additive genomic relationship matrix (G1) was constructed following VanRaden [14] as: where Z = {Z ij } is a n × p matrix of centered SNP marker codes, with the entry for ith individual and the jth marker taking the form Here p j is the frequency of allele "A" computed from a base population. The denominator of G1 is a scaling parameter. In practice, the allele frequencies are estimated from the data at hand, but this yields semi-positive definite matrices as discussed by Strandén and Christensen [33]. A second additive genomic relationship matrix (G2) was also as in VanRaden [14] where W is a matrix of standardized genotypes [34] with its jth column being where z .j is the jth column of Z and p represents the number of SNPs.
Since the Holstein data set led to non-positive G1 and G2 matrices because of numerical issues, G i (i = 1, 2) was modified to G * i = 0.95G i + 0.05I, yielding G * matrices that provided valid kernels. This may also avoid numerical instability in the eigenvalue decomposition of the kernel as explained later. The wheat data produced semi-positive definite genomic relationship kernel matrices. The kernel G1 has been applied to several inbred line populations in the past, e.g., Ober et al. [35].

Gaussian kernel
In a Gaussian kernel, the distance between a pair (i, j) of genotypes is represented as a squared Euclidean norm. Given a positive bandwidth parameter θ, the kernel takes the form A small Euclidean distance between two individuals reflects a strong similarity in state between their genotypes. On the one hand, as θ increases, the kernel evaluation approaches K(x i , x j ) = 0, producing a "sharp" or "local" kernel. On the other hand, as θ → 0, the kernel approaches 1, that is, a situation where the two individuals "match" perfectly, providing a "global" kernel.

Non-Euclidean metric space
The SNP data on p loci on some individual often come as , which is clearly a discrete space, as there are 3 p possible configurations of genotypes (not all of which are observable). Before defining the diffusion kernel, consider the meaning of 'diffusion on a graph' . Suppose p = 1, and consider a function k x that measures the spread of 'influence' of the genotype at this locus over the other possible genotypes by assuming that the 'influence' diffuses like heat dissipates. Let kx(0, x) = 1 x=x (x), be the indicator function for genotypex on Z 3 . We call this the time 0 diffusion, since in this casex has absolutely no influence on other genotypes; that is, the influence ofx does not diffuse out to its neighbors. Now, define the time t diffusion of the 'influence' of genotypex on genotype x to be: where α is a constant rate of diffusion and each summand is the differential gradient of the 'influence' between genotypes x and x . This is illustrated in Table 1. As stated above, there is no diffusion at t = 0. Subsequently, the time 1 diffusion with α = 0.1 whenx = 1 is computed as: As shown in Table 1, as t increases the 'influence' spreads over all genotypes more evenly; also, the larger α is, the faster the diffusion is with respect to time t.
Writing (3) in vector form, with kx(t, x) = kx(t), we get: where I is a 3 × 3 identity matrix; kx(0) is a constant 3 × 1 matrix of initial values, and with the first, second, and third rows of the H matrix corresponding to k 0 , k 1 , and k 2 respectively. The negative of this matrix is called the Laplacian of a graph , given by: (6) http://www.gsejournal.org/content/45/1/17 Table 1 Example of diffusion on a graph Let be an undirected graph with vertex set V ( ). In general, the Laplacian of a graph , L( ), is a V ( ) dimensional square matrix given by where A is an adjacency matrix and is a diagonal matrix with ii = n j=1 A ij . We can therefore generalize this 'diffusion' for any graph by using H( ) = −L( ) . Under this definition, given any V ( ) dimensional vector w, The most naive way of constructing a graph on Z p 3 is a Hamming graph. For the case p = 1, a Hamming graph is simply a complete graph of size 3, and has the form On this graph, the distance from genotype 0 ('aa') to genotype 2 ('AA') is the same as that from 0 ('aa') to 1 ('Aa'). Since genotype 'aa' has no copies of the 'A' allele, it may be more reasonable to assume that genotype 'Aa' is closer to 'AA' , which has two copies of the ' A' allele. This can be viewed from a mutational perspective as well. Genotype 0 ('aa') requires two mutations to become genotype 2 ('AA'), while genotype 1 ('Aa') requires only one mutation. Thus, the graph of interest would be given by (6). The latter is a path graph for SNP data, which will be taken as a minimal basis for our graph. In a path graph, all vertices are on a straight line, as in (6).
A SNP grid of p loci is a p dimensional grid with vertices in Z For example, the graph below is the grid for 2 loci derived from the Cartesian graph product of two path graphs as in (6): The graph Laplacian for graph (8) is a square matrix of dimension 3 2 × 3 2 : where the subscripts denote the vertices of graph (8). When there are p loci, the p-dimensional grid graph has 3 p vertices corresponding to sequences of genotypes, such that two vertices are adjacent if and only if just one SNP locus differs by 1. Now, suppose p = 3. The Cartesian graph product of (6) and (8) yields a 3 dimensional grid graph with 3 3 vertices, as shown in Figure 1. The diffusion kernel computes a similarity between two vertices on this graph, and projects this information into a more interpretable space.

Diffusion kernel on a non-Euclidean metric space
Consider now the continuous analog of the diffusion scheme above. This can be done by making 'time' or 'space' continuous, and 'time' will be made continuous first. Let α = θh (θ > 0) and t = 1/h. By using a small h, we can http://www.gsejournal.org/content/45/1/17 achieve a discretization of the 'diffusion time' on a much finer scale, and the coefficient matrix is: If an infinitesimal scale is considered by taking h → 0, (9) converges to: If a graph with a Laplacian L( ) is considered, then exp(−θL( )) is called the diffusion kernel or heat kernel for graph , where θ is a rate of diffusion [25]. Here, putting K = exp(θH) and taking the derivative with respect to θ gives: which is a discrete diffusion equation (heat equation) on a graph with H = −L( ). Note that diffusion kernels always need to be associated with a graph. A Gaussian kernel is obtained by making this diffusion kernel "space" continuous. The connection between the two kernels is provided in Appendix A.

Diffusion kernel indexed by observed covariates
When a graph is large and asymmetric, the computation of the diffusion kernel K( ) can be extremely difficult. For instance, for a SNP grid with 43 134 loci, the dimension of K is 3 43134 by 3 43134 . However, symmetry helps. If a closed form of K can be arrived at, there is no need to compute K(x, x ) for all pairs of genotype sequences x, x . This is indeed the case for the Gaussian kernel, where the dimension of K is infinite. With Kondor and Lafferty's result given in [25], we may obtain the closed form of the diffusion kernel from the sample for our SNP grid.
First, one needs to consider the Cartesian graph product for the diffusion kernel of a graph. Let K 1 (θ) and K 2 (θ) be the kernels for graphs 1 and 2 , respectively. The diffusion kernel for = 1 2 is given by [25]: where denotes the Cartesian graph product and ⊗ is the tensor product (infinite dimensional Kronecker product). Consider a graph with one locus, 0 , with form 0 − 1 − 2.
Then, we see that the diffusion kernel of the SNP grid on p loci with bandwidth parameter θ is given by: To this end, we just need to compute K θ ( 0 ) = exp(θH) with H in (5).
With this result, one can create the H matrix for a SNP grid as follows. Let x and x be SNP data for p loci; n s be http://www.gsejournal.org/content/45/1/17 the number of loci for which |x i − x i | = s, and m 11 be the number of loci for which x i = x i = 1. In other words, n 1 is the number of loci at which two individuals differ by 1, and m 11 is the number of loci at which two individuals share heterozygous states. Then: with proportionality constant (e −3θ + 3e −θ + 2) q , where q = n 1 + n 2 + m 11 . The last term is a contribution from heterozygosity. We refer to this as SNP grid kernel, specifically developed to model SNP data in this study. A proof of this result is given in Appendix B.

Diffusion kernel for binary genotypes
Another diffusion kernel tailored for binary genotypes is required for the chromosome X of sires or for the wheat inbred lines. In this setting, instead of (6), the path graph for one locus (p = 1) is: 0 − 2 and the corresponding graph Laplacian is given by: = 1 −1 −1 1 , as opposed to (5). For two loci (p = 2), the Cartesian product of graphs 1 (0−2) and 2 (0−2) yields the graph: 00 − 01 | | 10 − 11 (15) where the first digits ∈ V ( 1 ) and the second digits ∈ V ( 2 ). Then, the associated graph Laplacian is: where the subscripts denote the rows and columns of vertices of graph (15). Specifically, we compute K θ = exp(θH) with H defined in (14) and perform the tensor product p times. With this, the kernel is given by: where d(x, x ) is the Hamming distance, that is, number of coordinates at which x and x differ [25]. Following Kondor and Lafferty [25], this diffusion kernel for binary markers will be referred to as the hypercube kernel.

Combining SNP grid kernels and hypercube kernels
In addition in the Holstein data, we combined the two kernels derived from autosomes and from chromosome X to see the influence of applying the same value of the bandwidth parameter to different types of chromosomes. This is given by: where # is a Hadamard product of matrices. In general, given a set of n individuals, we may partition SNP into several subsets, say x = (x 1 , x 2 , ...., x r ). If K q is the diffusion kernel corresponding to subset x q , then the diffusion kernel for all sets can be computed as: This result also holds for the Gaussian kernel, but not necessarily for every kernel, e.g., the exponential kernel defined with the Euclidean distance (||x i − x j ||) does not hold this property.

Bayesian treatment of kernel ridge regression
Once the choice of the kernel is determined, (2) can be maximized by taking the derivative of (α) with respect to α to obtain: where λ is a regularization parameter. Here, implementation of kernel ridge regression was cast in a Bayesian framework with λ = σ 2 σ 2 α , where σ 2 and σ 2 α are the residual variance and the variance attached to α respectively. Then, note that [36,37]: This is proportional to p(α|y, σ 2 that is, the posterior density of α (given σ 2 e and σ 2 α ) for the linear model: with ∼ N(0, Iσ 2 e ) and with prior α ∼ N(0, K −1 σ 2 α ). Minimizing (α) will maximize exp(− 1 2 (α)), soα is the conditional posterior mode of α. One may change the basis K using the eigenvalue decomposition K = , where is the matrix of eigenvectors of K and is a diagonal matrix in which diagonals are the eigenvalues, as http://www.gsejournal.org/content/45/1/17 shown in de los Campos et al. [36], such that, for δ = one gets, in a fully Bayesian model, Once a prior is assigned to σ 2 e and σ 2 α , a Monte Carlo Markov Chain (MCMC) scheme can be used to infer all unknown parameters, including λ. Scaled inverse chisquare prior distributions were assigned to σ 2 e and σ 2 α , each with 3 degrees of freedom and a scale parameter equal to 1. Samples from posterior distributions were obtained by the Gibbs sampler in [36], and each of the analyses was based on 100 000 MCMC samples with the first 60 000 samples discarded as burn-in. After burn-in, samples were thinned at a rate of 10, resulting in 4000 mildly correlated samples for posterior inference. Convergence was monitored by inspecting trace plots of the variance parameters. A bandwidth parameter θ yielding high predictive ability is needed as well. However, sampling of the bandwidth parameter in MCMC sampling requires computation of kernels at each iteration, which is very demanding given the number of individuals and SNP considered in our study. For this reason, evaluation of the diffusion kernel was performed over a fixed grid of values of θ. The range of θ considered provided average values of K(x, x ) that were evenly spaced, approximately, between 0.13 to 0.99. Computation of kernels and Gibbs sampling was carried out in Fortran and in R, respectively.

Assessment of predictive ability
The predictive ability of RKHS models with either a diffusion kernel or a Gaussian kernel was assessed by crossvalidation. A subset of 5403 bulls born from 1952 through 2003 was used as the training set for the Holstein data. A testing set of 2499 bulls born from 2004 through 2006 was used to evaluate predictive ability. For the wheat data, a 10 fold cross-validation scheme was applied by assigning 599 lines randomly to one of 10 disjoint subsets. Each set was used for validation in turn, while the other 9 subsets were used to train the model. To illustrate, we estimated α in the Holstein data using the training set y = (y 1 , · · · , y 5403 ) and their corresponding SNP genotypes x 1 , · · · , x 5403 , and then predicted responses in the testing set as: whereŷ test is the 2499 × 1 vector of predicted responses of bulls in the testing set; 1 is a 2499 × 1 vector of ones;μ train is the posterior mean of the intercept estimated from the training set; K test↔train is a 2499 × 5403 matrix with elements k(j, i) test↔train representing the allelic similarity between bulls in the testing (j = 1, . . . , 2499) and training (i = 1, . . . , 5403) sets, with the same bandwidth parameter employed in the training set, andα train is the vector of posterior means of 5403 non-parametric regression coefficients obtained from the training set. In the equation above, K test↔train was either the diffusion or the Gaussian kernel.
In a Bayesian setting, however, one can embed all the above steps in a convenient way. Prior to Gibbs sampling, first we construct a full kernel matrix containing both training and testing data sets. We treat the responses of testing set individuals as unobserved, and these values are predicted via a predictive distribution. This is easy to incorporate in the Gibbs sampling scheme. Pearson's correlation between the predicted values (mean of the predictive distribution) and the observed PTA, Cor(ŷ test , y PTA ), and predictive mean-squared error (MSE) defined as 2499 i=1 (ŷ test i − y PTA i ) 2 /n were computed to evaluate the predictive ability of the two kernels. Here,ŷ test i is the mean of the predictive distribution of response i in the testing data set, which is the ith element of the K test↔trainαtrain .

Results
To illustrate the effect of the bandwidth parameter (θ) on the SNP grid kernel, Figure 2 contains histograms showing how θ controls similarities among individuals based on evaluating the kernel on the SNP data. The larger θ is, the stronger the prior inter-correlation structure. It is important to note that the diagonal elements in our SNP grid kernel matrices are not necessarily equal to one, as opposed to what happens in a Gaussian kernel; here, K is a correlation matrix. Table 2 shows the average of diagonal, K(x i , x i ), and off-diagonal, K(x i , x j ), elements for diffusion, Gaussian and two additive genomic relationship kernels at varying bandwidth values. The mean values of the diagonal elements of the four diffusion kernels shown in Figure 2 (see Table 2) were 0.369, 0.693, 0.874, and 0.952 for θ = 10, 11, 12, and 13, respectively. This is because in equation (13), even when x = x , so that n 1 = n 2 = 0, m 11 (the number of ' Aa' genotypes shared by x and x ) may not be zero. This implies that our diffusion kernel accounts for the degree of heterozygosity in a sample. From the perspective of the kernel as a smoothing function, the diffusion kernel performs smoothing for all elements based on heterozygosity as well as allelic similarity. As explained below, the larger the heterozygosity, the weaker the smoothing, leading to a smaller penalty; this is not the case, however, in the Gaussian kernel. In the kernel computation, each factor in (13) is < 1, and in particular, the factor corresponding to m 11 is the largest. Henceforth, if the sample contains few heterozygotes, our K will be large. Consequently, the penalty from the opti-http://www.gsejournal.org/content/45/1/17 mizer function f, ||f H || = α T Kα, will tend to be large. This is interpretable as imposing stronger smoothing for samples with low heterozygosity. As for the "correlation" with itself, an individual with low heterozygosity will have diagonal elements close to one, as in the case of a Gaussian kernel. Therefore, in addition to the 'distance' between genotypes of two individuals, the diffusion kernel takes into account the extent of heterozygosity, while the Gaussian kernel incorporates only the former. Also, the two kernels differ in their definition of distance. The diffusion kernel on the SNP grid is based on the Manhattan distance, while the Gaussian kernel is defined on the Euclidean distance. The Manhattan distance is the distance between two points measured by the the sum of the absolute differences of their coordinates. As shown in Table 2, the average of off-diagonal elements of the diffusion kernel was smaller than that of diagonal elements. This is because the first two terms of (13) will be different from zero (n 1 , n 2 > 0) for a pair of individuals. Diffusion kernel evaluations between an individual and itself were always larger than kernels evaluated between pairs, that is, diagonal elements had the largest values for each row of K. In the Gaussian kernel, diagonal elements are always equal to 1 and a smaller θ value produces a stronger prior correlation. The first type of additive genomic relationship kernel (G1 * ) had the average diagonal and off-diagonal elements close to 1 and 0, respectively, as expected. Similarly, G2 * had an average off-diagonal close to 0 but it had smaller average diagonal elements than those of G1 * .
The right-most columns of Table 2 give the evaluation of predictive ability of the kernels measured as the correlation between predicted values and observed PTA, and MSE of prediction, for several different bandwidth parameters (G1 * and G2 * do not involve this parameter). The predictive correlation of the diffusion (SNP grid) kernel was best at θ = 11, while with the Gaussian kernel this was achieved at θ = 10 −5 . Although the averages of diagonal and off-diagonal elements varied substantially with different bandwidth parameters in the diffusion and Gaussian kernels, the influence of this variability on predictive correlations was small. Importantly, no major difference was observed between the diffusion and the Gaussian kernels in terms of predictive performance. Differences among kernels were very minor, probably due to the fact that the response (PTA) is already a smoothed mean based on a large number of daughters of each bull. There was consistency between the correlation and the MSE, in the sense that the value of θ with the highest predictive correlation had the smallest MSE. http://www.gsejournal.org/content/45/1/17 Averages of diagonal k(x i , x i ) and off-diagonal k(x i , x j ) kernel elements, predictive correlation, and mean-squared error of prediction (MSE) for the diffusion, Gaussian, and two additive genomic relationship kernels (G1 * and G2 * ) with different values of the bandwidth parameter θ for the Holstein data. Values in parentheses were obtained by combining the SNP grid and the hypercube kernels by applying a same bandwidth parameter. G1 * and G2 * do not involve bandwidth parameters. The best prediction within the same kernel is underlined.
Predictive performance of G1 * was only slightly worse than that of the spatial distance kernels with the best bandwidth parameters. Values in parentheses in Table 2 were obtained by combining the SNP grid kernel from autosomes and the hypercube kernel from allosomes by applying the same bandwidth parameter. Incorporation of X-chromosome information reduced the average off-diagonal elements slightly and deteriorated predictive performance to some extent. The average diagonal and off-diagonal elements remained the same in G1 * and G2 * , but a minor reduction in their predictive abilities was observed.
In the wheat data, the superiority of the spatial distance-based kernels over the additive genomic relationship kernels was clear. Table 3 indicates that the diffusion and Gaussian kernels had the best predictive correlations (MSE) at 0.586 (0.685) and 0.582 (0.686), respectively, whereas those of G1 and G2 were 0.518 (0.709) and 0.521 (0.708). This is likely due to picking up non-additive genetic variation that this wheat data harbors. With binary markers, the diagonal elements of the diffusion kernel are always 1, since in equation (16) the Hamming distance d(x, x ) is always zero. As with the Holstein data, no apparent difference was observed between the diffusion and the Gaussian kernels.

Discussion
Arguably, relationships between phenotypes and genotypes may be non-linear and complex [10,15,31]. For this reason, ignoring non-additive effects such as dominance and epistasis in a model may lead to inferior predictive ability of individual phenotypes.
A spatial distance-based kernel non-parametric regression is capable of mapping genotypes to phenotypes Average of diagonal k(x i , x i ) and off-diagonal k(x i , x j ) kernel elements, predictive correlation, and mean-squared error of prediction (MSE) for the diffusion, Gaussian, and two additive genomic relationship kernels at different values of the bandwidth parameter θ for the wheat data. The predictive correlation and the MSE were obtained from a 10-fold cross-validation. Additive genomic relationship kernels (G1 and G2) do not involve bandwidth parameters. The best prediction within the same kernel is underlined. http://www.gsejournal.org/content/45/1/17 in a way that accurately reflects underlying, albeit unknown, relationships. These kernel methods incorporate non-linearity of a predictor set x through a nonlinear transformation of x, subsequently allowing analysis of the response in terms of features φ(x) in a linear way. This is particularly useful when the response has a linear relationship with respect to the parameters, but is nonlinear on covariates, such as in the case of polynomial regression.
The predictive ability of kernel-based genetic models depends on the choice of a kernel and associated bandwidth parameter(s). If the two data points lie in the real line, x, x ∈ R, it seems reasonable to compute their distance in terms of Euclidean distance. However, SNP genotypes, coded as dummy variables, take a discrete form. Therefore, it may be worthwhile to consider a kernel designed to capture the discrete structure of the input variables. The best predictive kernel and its performance may vary depending on the underlying genetic architecture, quantitative trait loci (QTL) numbers and distribution of effects, data set used, and kernel method applied. Here, we investigated the use of ridge regression with a diffusion kernel to assess if this would enhance predictive ability over that of the Gaussian kernel and two additive genomic relationship counterparts. Kondor and Lafferty [25] obtained promising results when the diffusion kernel was compared with several kernels in classification problems with a set of discrete predictors, and this kernel has been used in a microarray-based gene function prediction problem [38]. Ober et al. [18] used the Matérn covariance function, which contains the Gaussian and the exponential kernel as particular cases. Therein, the smoothing parameter controls the actual form of a kernel, and this is directly driven by sample data. Although they obtained a Gaussian form as a choice of the covariance function, the Matérn function is bounded by the Euclidean norm by definition, which may not be suited for discrete genomic data.
A strength of kernels for structured data is their ability to address similarities between two data points x, x / ∈ R [39]. The diffusion kernel defines the distance between two data points on graphs, namely vertices, and projects this information into a more interpretable space. As shown in the context of modeling linkage disequilibrium [40], various graph structures can be used to represent sets of discrete random variables, such as genotypes. Coupled with the representer theorem, the diffusion kernel allows casting underlying graph structures into a regression on the real line under a Hilbert space. The main idea behind this kernel is the matrix exponentiation of the graph Laplacian. The p-dimensional grid graph with vertices representing a vector of genotypes was chosen for the graph structure. Each grid conveys information on similarity in terms of the Manhattan distance. Two vertices x and x are connected if x i = x i for all i, except at one coordinate. In the Holstein data, with n = 7902 and p = 42 438, it is unlikely that any of two vertices present in our data are connected. However, what grid graphs embrace is how many "steps" separate a vector of genotypes observed in individual i from an observed vector of genotypes in individual j. The diffusion kernel and its associated graph structure are free of parametric structures. They are constructed without posing genetic architecture assumptions a priori. For illustration purposes, we used 0, 1, and 2 for allele coding, but these should be interpreted as mere strings. Kernel computation still remains the same no matter what allele coding method is adopted here. The parametric component in our study is the construction of the path graph in equation (6), but it is not relevant to gene action modes. This allows us to build a flexible non-parametric model without making strong assumptions a priori. This is appealing, because we seldom know the underlying genetic architectures of complex traits. As shown in past studies including animals [31] and plants [41], a non-parametric method stands out when prediction of phenotypes is the primal focus.
Our motivation for applying the diffusion kernel stemmed from the assumption that a non-Euclidean distance may be able to more clearly represent genomic similarities. We carried out a matrix exponentiation of two graph Laplacians created from two path graphs (one for SNP and one for binary markers) for this purpose. This yields a kernel based on the Manhattan distance accounting for the heterozygosity that two individuals share. The two spatial distance kernels resulted in better predictive performance than the two additive genomic relationship kernels in the wheat data. This agrees with the previous simulation study of Ober et al. [18], in which the Gaussian kernel outperformed G1 in the presence of non-additive effects. Superiority of the spatial distance kernels was less obvious in the Holstein data. This may be due to the phenotype we chose for this study, since the PTA response variable is a smoothed average using linear mixed models. Although the differences were small in cattle, two nonparametric kernels applied in this study outperformed the additive relationship kernels in two of the datasets used. This suggests that unknown cryptic genetic architectures are likely to be intrinsic to complex traits and, hence, kernels that can accommodate such structure yielded better predictions.
As for the difference between the diffusion and the Gaussian kernels in terms of predictive ability, the diffusion kernel had the highest predictive correlation and the lowest MSE with θ = 11 in the Holstein data, but the difference with the Gaussian kernel was negligible. The same result was seen in the wheat data. This implies that the Gaussian kernel is robust, even if it incorporates http://www.gsejournal.org/content/45/1/17 genotypes on the real line such as 1.25 or -12.3. Our objective to properly incorporate genotypes into a kernel had a small impact on predictive ability of yet-to-be observed phenotypes. Although the distance between genotypes is certainly not continuous, additional efforts to discretize the Euclidean distance may not be needed. Another possible reason might be that genotypes do not reside in the Euclidean or non-Euclidean spaces explored here, but in a manifold [27].
Incorporation of X chromosome genotypes for building a kernel led to a smaller average diagonal and off-diagonal elements (to some extent) in spatial distance kernels, but no change was observed in the additive genomic relationship kernels. In both types of spatial distance kernels, however, the predictive correlations were worse than when kernels were constructed purely from autosomes. This suggests that applying specific bandwidth parameters to autosomes and allosomes in the spatial distance kernels might be important. A similar decline in predictive performance was observed in the two additive genomic relationship kernels, which do not involve any bandwidth parameter. Further research is needed to investigate what produces this drop in predictive performance, although if no markers contribute to PL on chromosome X, this would add extra noise.
To our knowledge, this study involves one of the largest data sets employed for spatial kernel-based genomeenabled selection of agricultural species. The challenge here was computation of the diffusion kernel, rather than the Gibbs sampler. Approximately, it took four days to compute one diffusion kernel on a Linux workstation equipped with the Intel(R) Xeon(R) CPU E5450 3.00GHz and 16GB of RAM. The Gaussian kernel required slightly less time for building, but with several candidates over a grid of values of the bandwidth parameter θ, this was an expensive task for both kernels. One useful approach might be that of multiple kernel learning (MKL) [36,42], which uses a few kernels with different covariance structure in a single RKHS model. Finally, the SNP grid graph and the hypercube graph used in this study are naive graph structures for modeling discrete inputs. Perhaps developing a graph structure that is more suitable for SNP data might increase predictive correlations.

Conclusions
We investigated the performance of a diffusion kernel, which was specifically developed to model discrete marker inputs, using Holstein cattle and wheat data. On the one hand, the predictive ability of the diffusion kernel was similar to that of non-spatial distance-based additive genomic relationship kernels in the Holstein data, due to the fact that the response (PTA) is already a smoothed mean based on a large number of daughters of each bull, but outperformed the latter in the wheat data. On the other hand, only minor difference was observed between the diffusion and the Gaussian kernels in terms of predictive performance. Although the diffusion kernel as a choice of basis function may have potential for use in whole-genome prediction, the results of this study suggest that the simple Gaussian kernel is robust enough, and that the scope for enhancing predictive ability via kernel refinement may be limited. http://www.gsejournal.org/content/45/1/17 where g θ (x, x ) = G(x−x ) is called a Gaussian kernel with bandwidth θ. Thus, the Gaussian kernel is the 'space' continuous analogue of the diffusion kernel as described on the graph. This analogy works exactly the same in higher dimensions.