 Research
 Open Access
 Published:
Genomic prediction based on data from three layer lines using nonlinear regression models
Genetics Selection Evolution volume 46, Article number: 75 (2014)
Abstract
Background
Most studies on genomic prediction with reference populations that include multiple lines or breeds have used linear models. Data heterogeneity due to using multiple populations may conflict with model assumptions used in linear regression methods.
Methods
In an attempt to alleviate potential discrepancies between assumptions of linear models and multipopulation data, two types of alternative models were used: (1) a multitrait genomic best linear unbiased prediction (GBLUP) model that modelled trait by line combinations as separate but correlated traits and (2) nonlinear models based on kernel learning. These models were compared to conventional linear models for genomic prediction for two lines of brown layer hens (B1 and B2) and one line of white hens (W1). The three lines each had 1004 to 1023 training and 238 to 240 validation animals. Prediction accuracy was evaluated by estimating the correlation between observed phenotypes and predicted breeding values.
Results
When the training dataset included only data from the evaluated line, nonlinear models yielded at best a similar accuracy as linear models. In some cases, when adding a distantly related line, the linear models showed a slight decrease in performance, while nonlinear models generally showed no change in accuracy. When only information from a closely related line was used for training, linear models and nonlinear radial basis function (RBF) kernel models performed similarly. The multitrait GBLUP model took advantage of the estimated genetic correlations between the lines. Combining linear and nonlinear models improved the accuracy of multiline genomic prediction.
Conclusions
Linear models and nonlinear RBF models performed very similarly for genomic prediction, despite the expectation that nonlinear models could deal better with the heterogeneous multipopulation data. This heterogeneity of the data can be overcome by modelling trait by line combinations as separate but correlated traits, which avoids the occasional occurrence of large negative accuracies when the evaluated line was not included in the training dataset. Furthermore, when using a multiline training dataset, nonlinear models provided information on the genotype data that was complementary to the linear models, which indicates that the underlying data distributions of the three studied lines were indeed heterogeneous.
Background
Genomic estimated breeding values (GEBV) are generally predicted by a regression model [1] trained by a set of animals with known phenotypes and genotypes for a dense marker panel that covers the genome [2]. Prediction accuracy of such models depends on several factors, among which size of the set of training animals is most important, which has been addressed in several studies [2],[3] that consistently claim that the biggest limitation for the accuracy of genomic prediction of livestock is the number of animals with both genotype and phenotype data. In most cases, the number of markers is however substantially larger than the number of training samples. This means that genomic prediction typically has a small sampletosize ratio, which is also known as a n << p problem [1]. One of the major disadvantages is that n << p may lead to a severe overfitting problem, which may affect the accuracy of the predictions in a validation dataset. Dimension reduction [4],[5] could be an alternative approach to retain the most relevant information of the genotype data [6],[7] in a lowdimensional vector space.
Our study aimed at investigating a more straightforward and feasible approach to alleviate the n < < p problem, which consists of enlarging the training set by using data from multiple populations. However, studies on acrossbreed genomic prediction using 50 k genotypes have shown that the use of a multibreed training dataset typically results in a limited or no increase in accuracy compared to using training data from a single breed [8][11]. Previous studies have hypothesized that in order to successfully combine training datasets of HolsteinFriesian and Jersey dairy cattle breeds, genotypes on at least 300 000 SNPs (single nucleotide polymorphisms) should be used [12].
Besides insufficient SNP density, another reason that may explain the limited increase in prediction accuracy observed when using multipopulation compared to singlepopulation training data could be that the commonly used models cannot deal appropriately with heterogeneous multipopulation data. To date, all acrosspopulation genomic prediction studies have used linear models. These linear models generally assume that the effect of a SNP in one population is the same in another population. This assumption can be violated due to several reasons. First, the linkage disequilibrium (LD) may differ between populations. Second, it is quite likely that at least some of the segregating QTL (quantitative trait loci) are populationspecific. Third, the absolute effect of a QTL may differ between populations because of differences in genetic background. The assumption of linearity may be too rigorous for any of these situations, especially when using the common 50 k SNP chip. In fact, if differences between populations or lines are too large, predictive ability of acrossbreed genomic prediction with linear models may be lower than that of withinbreed genomic prediction [13]. A few studies have proposed to use multitrait linear models [14][16], where trait by line combinations are modelled as separate but correlated traits, to try to accommodate these issues.
As an alternative solution, we propose to use nonlinear models by kernel learning [13],[17],[18]. The basic idea is to predict the breeding value of a test animal using a limited number of training animals with similar genotypes that do not necessarily come from a single population. By doing so, the entire heterogeneous data space spanned by genotypes is decomposed into a large number of locally homogeneous subareas [19][21], regardless of their population of origin. Such a model might be able to extract the useful information across populations. At the very least, the nonlinear models by kernel learning are expected to better capture the heterogeneous nature of the data compared to linear models.
The objective of this study was to investigate the accuracy of multiline genomic prediction using nonlinear models by kernel learning and a linear model that modelled trait by line combinations as separate but correlated traits, and to compare the prediction accuracy of these models to that of commonly used linear genomic prediction models presented by Calus et al. [22]. This comparison was performed with a dataset that included three lines of layer hens.
Methods
Linear regression
Linear regression models [23] have been widely used to implement genomic prediction [24]. In concrete terms, the ultimate goal of a regression task is to predict an unseen value y from a vector of observations/features x. In the scenario of genomic prediction, (x, y) corresponds to genotypes (x) and phenotypes (y) of n training individuals. Linear regression uses a linear function to map the observations x to the responsible value y by a vector w as the linear weights on x:
where the weight vector w can be estimated using the training data. To best approximate the underlying functional relationship between x and y by Equation (1), ridge regression aims at minimizing the average quadratic loss (L) between the true response value y_{ i } and w^{t}x_{ i }:
The vector y refers to a column vector [y_{1}, y_{2} ,…, y_{ n } ]^{t} that contains the phenotypes of all training animals, while the matrix X contains the genotypes of all training animals. The norm of w is the regularization term. Adding it into the objective function alleviates the overfitting problem, which might be detrimental to prediction performance since the number of genotypes is generally much larger than the number of training samples. Parameter γ refers to the weight given to the regularization term.
Minimization of the loss function L by Equation (2) with regard to w results in the following estimate:
If the following matrix lemma [25] is applied:
the solution to w* can be reformulated to:
With this estimate, the prediction y* based on the test vector x_{ t } becomes:
These descriptions provide the basis for the development of the nonlinear models presented below. For comparison, we included two linear models, i.e. ridgeregression based on principal component analysis (RRPCA) and genomeenabled best linear unbiased prediction (GBLUP) [26]. More detailed descriptions of these models, and the results obtained with these models on this data, are in [22].
Multitrait genomeenabled best linear unbiased prediction (MTGBLUP)
One of the disadvantages of linear regression is that the underlying data structures might not be well characterized by the linear weights. In genomic prediction, this implies that the estimated effects are not necessarily strictly additive genetic effects [17], and in the context of multibreed genomic prediction, this may be further interpreted as the true SNP effects not being the same in different breeds or lines. One straightforward approach to allow estimated SNP effects to differ between lines, is to use a multitrait GBLUP (MTGBLUP) model that allows genetic correlations between the lines to differ from 1 [14]. The data available was not large enough to estimate these correlations; however, additional data was available on nongenotyped animals for each line. Therefore, pairwise genetic correlations between lines were estimated by applying REML (restricted maximum likelihood) [27] with a model that used the inverse of a combined pedigree and genomic relationship matrix [28] that included all three lines. Using this combined relationship matrix, the number of training records ranged from 24 906 to 27 896 across the three lines, while when only genotyped animals were considered, it ranged from 1004 to 1023. Using the estimated variance components, the MTGBLUP model was run using a Gmatrix as described in [26], such that only genotyped animals were included in the reference population.
Nonlinear kernel regression
Another interpretation of the expectation that the underlying data structures across breeds or lines might not be well characterized by the linear weights is that the inherent mapping function might not be linear. To capture such data features, the common tandem is to adopt a nonlinear function (.) {x → φ(x)}. The nonlinear function results in new representations of genotypes that may be associated with both additive and nonadditive effects [17],[29]. Accordingly, Equation (5) can be modified by replacing x by φ(x):
where Ф(X) contains the transformed genotypes using φ(x). Interestingly, the predictor does not necessarily depend on the mapping function φ(x) but on the inner products between the vectors φ(x) and φ(y), namely φ(x)φ(y)^{t}, as a result of the following terms in (6):
Ф(X)Ф(X)^{t}: the element of the resultant matrix on the i th column and j th row is φ(x_{ i })φ(x_{ j })^{t},
Ф(X)φ(x_{ t }): the i th element of the resultant vector is φ(x_{ i })φ(x_{ t })^{t}.
This property implies that the design of the kernel function K(x, t) = φ(x)φ(t)^{t} is sufficient to give rise to the predictor without any knowledge on the mapping function φ(x):
where K is a matrix with elements K(x_{ i }, x_{ j }), i, j = 1,2,…, n and k is a vector with elements K(x_{ i }, x_{ t }), i = 1,2,…, n.
Construction of kernels
One possible interpretation of kernel learning is that the kernel function of two vectors x and t, K(x, t), to some extent describes the similarity between x and t by tending to yield a relatively large value when x is similar to t. There are two typical approaches to evaluate the similarity of two vectors: crosscorrelation x^{t}t and distance d(x, t). Both of these are intrinsically related: x^{t}t is inversely proportional to d(x, t) if the measure d is Euclidean distance: d(x, t) =  x  t ^{2} =  x ^{2} +  t ^{2}  2x^{t}t. Therefore, in this study both crosscorrelationbased kernels [13],[30] and distancebased kernels [30][33] that use those two similarity measures were used.
Crosscorrelation based kernels
The polynomial kernel is the most classical crosscorrelationbased kernel [28],[34] that depends on the inner product of two vectors:
This kernel maps the original feature space into one that is spanned by monomials of degree l. A more general definition of the polynomial kernel is:
which is called an inhomogeneous polynomial kernel since a unit shift is added onto the inner product of two vectors. Compared with the homogeneous kernel given by Equation (8), the explicit mapping function of this kernel contains all monomials whose degrees are equivalent to or smaller than l.
Distancebased kernels
Similarity can also be measured by the distance d: if x and t are similar, the function value of d(x, t) should be small. Mathematically speaking, the distance function should satisfy the following three properties:

1.
d(x, x) ≥ 0,

2.
d(x, t) = d(t, x),

3.
d(x, t) < d(x, z) + d(z, t).
Then, a valid kernel can be constructed by the following equation:
Distancebased kernels are derived from L_{ p }norm distance, which has been proven to satisfy the aforementioned requirements [34]:
Two wellknown distance kernels are special cases of this general equation: the radial basis function (RBF) kernel (p = 2, also known as Gaussian kernel) [31] and the Laplacian Kernel (p = 1) [33]:
Comparison of methods
In our study, accuracy of genomic prediction based on multiline training was evaluated for two nonlinear models that were based on two different kernels that are the most representative of the two categories of kernels described in the previous section [35]. The first uses the RBF kernel and is termed “RBF” hereafter, and the second uses the polynomial kernel and is termed “Poly” hereafter. Linear regression, also known as ridge regression (RR), is a special case of kernel linear regression that adopts the linear kernel [13]. A method equivalent to RR, i.e. GBLUP that uses a genomic relationship matrix [26], is applied here for comparison.
Considering that the number of SNPs is relatively large compared to the number of animals with phenotypes, all models were also implemented after performing principal component analysis (PCA) to reduce the data dimensions while still explaining 97% of the variance of the SNP genotypes in the data. These three models are termed RRPCA for RR, RBFPCA for RBF kernel based linear regression and PolyPCA for polynomial kernel based linear regression.
Data, preanalysis, and experimental configurations
To compare the models, data of two brown and one white lines of layer chickens were analysed. The brown layer lines B1 and B2 were closely related to each other, while the white line (W1) was only distantly related to the brown lines. The phenotype data used was the number of eggs in the first production period until the hens reach the age of 24 weeks. Across the three lines, 3753 female birds had both phenotypes and genotypes for 45 974 SNPs from the chicken 60 k Illumina Infinium iSelect Beadchip [36] after editing. More details on the dataset and on the editing of the SNP data are described in Calus et al. [22].
Seven different training sets and one validation set per line were defined to evaluate the accuracy of genomic prediction with single and multiline training datasets. For each line, the youngest generation, containing 238 to 240 birds, was used as a validation set. Breeding values for the validation animals were predicted using phenotypes of the training set, which were precorrected for hatch week. For the validation animals, the correlation coefficient between the GEBV and their observed phenotypes were computed to evaluate the accuracy of genomic prediction with various training datasets. These correlations are hereafter referred to as ‘predictive correlations’. Commonly, such correlations are divided by the square root of the heritability of the trait to reflect the accuracies of predictions of true breeding values. In this case, we did not do that, because such an adjustment assumes that all the captured genetic variance is additive, while the kernel functions may capture some nonadditive effects. Approximate standard errors of the predictive correlations were computed using the expected sampling variance of an estimated correlation ($\widehat{\rho}$), as $\frac{1{\widehat{\rho}}^{2}}{\sqrt{N2}}$ where N is the number of training animals [24]. The coefficient of the regression of phenotypes on GEBV (b_{1}) was computed to evaluate bias of the predictions. Standard errors of the regression coefficients, denoted as $S{E}_{{b}_{1}}$, were derived with bootstrapping, which involved computing regression coefficients for 10 000 bootstrapping samples of the 238 to 240 validation animals, using the Rpackage “boot” [37]. The regression coefficients were considered as not significantly different from 1 when $\left{b}_{1}1\right<2\times S{E}_{{b}_{1}}$[38].
The first three training sets consisted of one of the three lines. The next three training sets included each of the three pairwise combinations of the three lines. The last training set included layers from all three lines. The resulting training sets included ~1000 to 3000 animals, and the number of segregating SNPs ranged from 30 508 to 45 974 [22].
Results
Genetic correlations between lines
The estimated genetic correlations between the three lines are in Table 1. The genetic correlation between lines B1 and B2 was equal to 0.63, thus significantly larger than 0, which confirms that B1 and B2 are closely related lines. Genetic correlations between lines B1 and W1 and between lines B2 and W1 were equal to 0.26 and 0.55, respectively. The large standard errors of these estimates show that the estimated genetic correlation between line B1 and W1 is not significantly different from 0, while the correlation between B2 and W1 is significantly lower than 0.
Accuracy of genomic predictions
Tables 2, 3 and 4 show the predictive correlations for each line of six methods using seven training datasets. In the following, we first describe results across the training datasets and then differences between the methods.
Table 2 shows the predictive correlations of line B1 across the training datasets. The impact of multiline training for line B1 differed slightly between models. Results of the two models with the highest predictive correlations are discussed as examples. The GBLUP model achieved the highest predictive correlation when the model was trained exclusively on data from line B1. In other words, enlarging the training set by adding the training animals from any other line deteriorated the prediction performance. However, the second best model, namely RBF, which had a performance that was slightly inferior to that of the GBLUP model, benefited slightly from enhancing training with data from other lines.
Table 3 contains the predictive correlations for line B2. Compared to the scenario for which the training dataset only contained line B2, both linear models GBLUP and RRPCA had a ~0.03 higher predictive correlation with multiline prediction. Predictive correlations for the nonlinear models were, however, very similar to each other across the training datasets.
Interestingly, focussing on the results for line B1 with training on data from line B2 only, or vice versa, the predictive correlations of the linear and RBF models were clearly superior to those of the Poly models. This suggests that the genotypes of lines B1 and B2 shared some structural similarities that benefitted the predictions of the linear and RBF models. In these situations, the Poly models resulted in predictive correlations that were generally close to 0.
Table 4 shows the predictive correlations for the line W1 validation data. Predictive correlations were very similar across models and training datasets whenever line W1 was included in the training data. When line W1 was not included in the training data, the predictive correlations were always negative, except for MTGBLUP and the Poly models.
Overall, the benefit of multiline training was limited, and only clearly observed in a few cases when the training data included a closely related line, e.g. lines B1 and B2. Therefore, enlarging the training set with unrelated or distantly related animals did not significantly improve predictive correlations.
Bias of genomic prediction within and across lines
Bias of genomic predictions was assessed by evaluating coefficients of the regression of phenotypes on GEBV. Bias decreases as regression coefficients get closer to 1. For all three lines (See Additional file 1: Tables S1, S2 and S3), bias was more controlled for all models if the evaluated line was included in the training data, otherwise, large biases were observed, especially for the nonlinear (Poly and RBF) models. These results indicate that GBLUP, RRPCA, MTGBLUP and RBFPCA gave reasonable results in terms of bias, as long as the evaluated line or a closely related line was included in the training dataset.
Model comparison
Among the nonlinear models, the Poly models generally performed worse than the RBF models, both in terms of predictive correlations (Tables 2, 3 and 4) and bias (See Additional file 1: Tables S1, S2 and S3), when the evaluated line was included in the training data. In addition, the predictions of the Poly models had close to 0 predictive correlations and very large biases when based on information from a closely related line (lines B1 and B2).
In the comparison between linear and nonlinear models, it is important to note that the nonlinear RBF models yielded predictive correlations that were comparable to those of the best linear models (either GBLUP or RRPCA) for lines B1 and W1 when the training data included all lines (Tables 2 and 4). For line B2, RBF performed better than the GBLUP model, while RRPCA had the highest predictive correlation in all scenarios (Table 3). For line B1, however, RRPCA had a lower predictive correlation than the RBF and GBLUP models (Table 2). For lines B1 and B2, the MTGBLUP model generally yielded predictive correlations that were similar to those of most of the other models (Tables 2 and 3). The same was observed for W1 when W1 was included in the training data (Table 4). However, when W1 was not included in the training data, MTGBLUP yielded positive predictive correlations but almost all other models yielded negative predictive correlations.
In summary, the results show that the performance of RBF models was fairly similar to that of the linear models, and that the Poly models generally performed worse. The MTGBLUP model in some situations could generate positive predictive correlations when the trait had a negative correlation between the evaluated line and the line(s) included in the training data.
Complementarity analysis
Because linear and nonlinear models focus on different aspects of the genomic data, in this subsection, we analysed the complementarity between models. One way to measure the complementarity between two approaches is based on the correlation between their predictions. Correlations of genomic predictions were computed between models for the training dataset that included all three lines (Table 5). In general, predictions from the Poly models had the lowest correlations with those of other models, which is in line with the observation that, in most cases, the Poly models had the poorest performance in terms of predictive correlation. Ignoring the Poly models, the correlations between predictions from the different models were generally high (>0.9) for line W1. For lines B1 and B2, the predictions from the RBF models had correlations lower than 0.9 with those of GBLUP and RRPCA and higher than 0.9 with those of MTGBLUP. The prediction from the MTGBLUP model deviated substantially from those of GBLUP, with correlations of 0.91 to 0.98. The level of the correlations showed that combining predictions of different models could lead to more accurate predictions. The potential of such an approach was investigated by evaluating combined predictions of two models. A weighted combination of two predictions (â_{1}, â_{2}), can be easily obtained using the following equation:
where parameter β defines the weight given to the two approaches. When β is equal to 0 or 1, the combination is reduced to either of the two predictions. Figure 1 shows the predictive correlations of this combined prediction for the linear models GBLUP and RRPCA and the nonlinear model RBF. In Figure 1, each row represents the results for one combination of models and each column represents the results for one of the lines. For line B1, combining predictions from a linear and a nonlinear model improved the predictive correlation, especially for the combination of GBLUP and RBF. For line B2, there was little gain by combining models, which is probably due to the superior performance of the RRPCA model. For line W1, the combined prediction was in all cases slightly more accurate. Interestingly, across all situations, the benefit of combining predictions of two models was largest when the two models had a similar predictive correlation.
Computational complexity
For practical applications of genomic prediction in livestock, it is important that the predictions can be computed efficiently. Therefore, in this section, we analytically evaluate the computational complexity of linear and nonlinear models. Revisiting both prediction models, they can be generalized by the following expression:
where y is the vector of training phenotypes. For the linear model, A = XX^{t} and b = Xx_{ t } (referring to Equation (5)), while for the nonlinear model A = K and b = k (referring to Equation (7)). The computation cost depends heavily on the inversion of matrix (A + γ I)^{1}, which is o(n^{3}) [25]. Parameter n is equal to the dimension of matrix A. The computational complexity of the linear and nonlinear models depends on the size of matrix A, which is m × m (i.e. ridge regression BLUP) or n × n (i.e. GBLUP) for the linear models and n × n for the nonlinear models implemented in our study, which means that the complexities are either o(m^{3}) or o(n^{3}).
In genomic prediction, the number of genotypes (m) is typically much larger than the number of training animals (n). When ridge regression is used in the linear model (i.e. matrix A is of size m × m) and combined with the use of PCA (i.e. RRPCA in our case), the size of the matrix decreases to less than n × n, because the number of retained principal components will have a maximum value of n1 [4]. Therefore, computational complexity of the nonlinear models implemented in our study is comparable to that of the linear GBLUP model, as summarized in Table 6. Thus, the nonlinear models are expected to be able to deal with datasets of similar size as the commonly used GBLUP model.
Discussion
The objective of this study was to compare the accuracy of multiline genomic prediction when using nonlinear or linear models. In general, when the evaluated line was included in the training data, the nonlinear RBF models yielded similar predictive correlations as the linear models. The nonlinear models appeared to be slightly less sensitive to the structure of multiline training datasets. For example, some of the linear models showed small decreases in predictive correlations for lines B1 and W1 when adding other lines [22], but this did not (or rarely) occur for the nonlinear models. When only information from a closely related line was used for training, the linear models and the nonlinear RBF models had similar performance, indicating that the strong assumptions of the linear models may at least partly hold for the closely related lines used in our study. Our expectation was that the nonlinear models would be better able to use relevant information, without making strong assumptions as done in the linear models [21],[39], but the results showed that, overall, the linear models and nonlinear RBF models performed similarly.
The complementarity analysis is another aspect of our study. It has been shown that combining genomic predictions of different models, a procedure also known as “bagging” [40], may lead to more robust predictions with generally a higher accuracy [41] or at the very least result in similar accuracies as achieved with the underlying models [42]. In our study, except for line B2, for which RRPCA performed significantly better than any other model, both measures of complementarity indicated that combining linear and nonlinear models has the potential to result in slightly more accurate predictions, which means that the linear models capture different features of the data than the nonlinear models. The fact that nonlinear models captured some predictive variation that was not explained by linear models may be partly due to the ability of nonlinear models to capture nonadditive effects. Since many nonadditive effects are not passed onto the next generation, predictions from nonlinear models may be less useful for achieving genetic gain than the linear models. Nevertheless, capturing nonadditive effects does help to better predict the performance of an animal itself.
Another focus of this study was to investigate whether the potential benefit of multiline genomic prediction depends on the genomic similarities of the lines considered. We showed that only some of the lines benefitted from multiline training, which is consistent with previous studies e.g. [8],[12]. The genotype data of the lines analysed in this work were apparently quite heterogeneous and thus, there was no consistent gain in predictive correlations from using multiline training data. In some situations, there was a small benefit for lines B1 and B2 but not for W1. This was as expected based on results of the genotypedistance matrix reported by Calus et al. [22], that showed that animals from lines B1 and B2 were more closely related than animals from lines B1 or B2 with animals from line W1. Training data for which relationships with the predicted data are poor, are expected to have negligible contributions to the nonlinear predictor. In contrast, the distance between two individuals from lines B1 and B2 was relatively small, indicating that the properties of the genotypes of these two lines were similar. These properties include allele frequencies and LD. Similarities between populations in both of these properties were shown to be closely related to genomic relationships between populations [43]. This might explain the improvement in predictive correlations for lines B1 and B2 in some scenarios when line B1 or B2 was added to the training data. Indeed, the estimated genetic correlations between the lines revealed that the trait investigated was highly correlated between lines B1 and B2. There was, however, no clear improvement in or even deterioration of predictive correlations for lines B1 and B2 when line W1 was included in training, or vice versa. However, across several linear models, positive predictive correlations of 0.10 to 0.14, although not significantly greater than 0, were consistently obtained for line B2 when only line W1 was used for training [22]. Moreover, genetic correlations were equal to 0.26 between lines B1 and W1 and 0.55 between lines B2 and W1, which suggests that information of line W1 was not useful for lines B1 and B2 and vice versa. In summary, a benefit from using multiline training is especially expected when lines share several common properties, which can be characterized by genomic relationships between lines. Estimating the genetic correlation of the trait between lines may also be very informative. If the distance between the lines is very large and if the estimated correlation is close to 0 or even negative, the benefit of using multiline genomic prediction is expected to be very limited.
Another interesting conclusion of the comparison between models for the three lines is that no single model was superior over all others for each scenario, which is similar to the results obtained when comparing different linear models [22]. The MTGBLUP model did not necessarily perform better than the other models for lines B1 and B2, but was able to yield substantial positive predictive correlations for line W1 when line B1, B2, or both were used for training. However, when line W1 was used to predict lines B1 and B2, MTGBLUP performed considerably worse than the other linear models. For predicting line B2, RRPCA performed much better than the other models. Interestingly, for line B2, the RBFPCA model was also more advantageous than the other regression models. For predicting line W1, all models performed quite similar whenever line W1 itself was included in the training data.
As an important criterion for model evaluation, the bias of the genomic predictions was evaluated (See Additional file 1: Tables S1, S2 and S3). First, when training and validation data were from the same line, the bias was limited for all models. The genotype distance between a brown hen and a white hen is relatively large such that the kernel value of those two genotypes by Equation (10) becomes small. Therefore, the variance of the GEBV becomes small and the bias accordingly can become very large. In other words, the nonlinear models may yield realistic predictive correlations close to 0 combined with very large biases, while the strong assumptions of the linear model appear to control the bias, but at the same time may result in poor predictive correlations. These results highlight the importance of evaluating bias as well as accuracy if the predicted line or breed is not represented in the training data. Conversely, our results show that including the evaluated line in the training data is the best way to control the bias of the predictions, regardless of the model used.
By achieving a significant reduction in the dimension of genotypes, PCA is shown to benefit nonlinear models, similar to what has been observed for the linear RRPCA model [22]. Concentrating on the nonlinear kernel model that produced the highest predictive correlations, i.e. the RBF kernel, PCA had a minor impact on the predictive correlations, as shown by the correlation between the predictions from RBF and RBFPCA. This might be explained by the nature of the nonlinear model: the prediction depends heavily on the distance relationships between training and testing animals, which are not altered by PCA. The Poly models also had very similar predictions whether PCA was performed or not. Regardless, the performance of Poly models was generally worse than that of other models, suggesting that they should not be considered for genomic prediction. Overall, our results with the nonlinear RBF and linear RRPCA models suggest that dimensionality reduction of the genotype data might be helpful to decrease computational complexity while hardly affecting model accuracy.
Conclusions
In this study, we investigated genomic prediction with multiline data. Considering the possible complex heterogeneous data distributions of genotypes in such data, we used nonlinear models by kernel linear regression, which rely on the similarity among animals but do not make assumptions on the linearity of genotypes, as the conventional linear models do. On this basis, it was anticipated that the nonlinear models would capture different features of multiline data than the linear models.
Our results indicate that the nonlinear RBF models had very similar prediction performance as the generally used linear model GBLUP. Using one line to predict performance in another closely related line, yielded similar prediction accuracies with the RBF and the considered linear models, which suggests that the genotypes of closely related lines share some structural similarities. This was supported by the estimated genetic correlation of 0.63 between the trait in the two closely related lines. Using only data from a distantly related line for prediction with a linear model resulted sometimes in small positive predictive correlations, in a few cases in considerable negative predictive correlations, and sometimes in predictions with very large bias. This suggests that genomic prediction using only information from a distantly related line or breed should be avoided. Furthermore, despite the similar predictive correlations, linear and nonlinear models were shown to capture some complementary predictive information, since the combined prediction slightly improved the predictive correlations.
Additional file
References
 1.
de los Campos G, Hickey JM, PongWong R, Daetwyler HD, Calus MPL: Wholegenome regression and prediction methods applied to plant and animal breeding.Genetics 2013, 193:327345.,
 2.
Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001, 157: 18191829.
 3.
Daetwyler HD, Villanueva B, Woolliams JA: Accuracy of predicting the genetic risk of disease using a genomewide approach. PLoS ONE. 2008, 3: e339510.1371/journal.pone.0003395.
 4.
Yan SC, Xu D, Zhang BY, Zhang HJ, Yang Q, Lin S: Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell. 2007, 29: 4051. 10.1109/TPAMI.2007.250598.
 5.
Lin YY, Liu TL, Fuh CS: Multiple kernel learning for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell. 2011, 33: 11471160. 10.1109/TPAMI.2010.183.
 6.
Dadousis C, Veerkamp RF, Heringstad B, Pszczola M, Calus MPL: A comparison of principal component regression and genomic REML for genomic prediction across populations.Genet Sel Evol, 46:60.,
 7.
Solberg TR, Sonesson AK, Woolliams JA, Meuwissen THE: Reducing dimensionality for prediction of genomewide breeding values. Genet Sel Evol. 2009, 41: 2910.1186/129796864129.
 8.
Weber KL, Thallman RM, Keele JW, Snelling WM, Bennett GL, Smith TPL, McDaneld TG, Allan MF, Van Eenennaam AL, Kuehn LA: Accuracy of genomic breeding values in multibreed beef cattle populations derived from deregressed breeding values and phenotypes. J Anim Sci. 2012, 90: 41774190. 10.2527/jas.20114586.
 9.
Daetwyler HD, Swan AA, van der Werf JHJ, Hayes BJ: Accuracy of pedigree and genomic predictions of carcass and novel meat quality traits in multibreed sheep data assessed by crossvalidation. Genet Sel Evol. 2012, 44: 3310.1186/129796864433.
 10.
Makgahlela ML, Mantysaari EA, Stranden I, Koivula M, Nielsen US, Sillanpaa MJ, Juga J: Across breed multitrait random regression genomic predictions in the Nordic Red dairy cattle. J Anim Breed Genet. 2013, 130: 1019. 10.1111/j.14390388.2012.01017.x.
 11.
Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, Mason BA, Goddard ME: Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed highdensity single nucleotide polymorphism panels. J Dairy Sci. 2012, 95: 41144129. 10.3168/jds.20115019.
 12.
De Roos APW, Hayes BJ, Goddard ME: Reliability of genomic predictions across multiple populations. Genetics. 2009, 183: 15451553. 10.1534/genetics.109.104935.
 13.
Schölkopf B, Smola AJ: A short introduction to learning with kernels. Advanced Lectures on Machine Learning. Edited by: Bousquet O, Rätsch G. 2003, SpringerVerlag, Berlin, 4164. 10.1007/354036434X_2.
 14.
Karoui S, Carabano MJ, Diaz C, Legarra A: Joint genomic evaluation of French dairy cattle breeds using multipletrait models. Genet Sel Evol. 2012, 44: 3910.1186/129796864439.
 15.
Legarra A, Baloche G, Barillet F, Astruc JM, Soulas C, Aguerre X, Arrese F, Mintegi L, Lasarte M, Maeztu F, Beltrán de Heredia I, Ugarte E: Within and acrossbreed genomic predictions and genomic relationships for Western Pyrenees dairy sheep breeds Latxa, Manech, and BascoBéarnaise. J Dairy Sci. 2014, 97: 32003212. 10.3168/jds.20137745.
 16.
Olson KM, VanRaden PM, Tooker ME: Multibreed genomic evaluations using purebred Holsteins, Jerseys, and Brown Swiss. J Dairy Sci. 2012, 95: 53785383. 10.3168/jds.20115006.
 17.
Gianola D, van Kaam JBCHM: Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics. 2008, 178: 22892303. 10.1534/genetics.107.084285.
 18.
Morota G, Koyama M, Rosa GJM, Weigel KA, Gianola D: Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data. Genet Sel Evol. 2013, 45: 1710.1186/129796864517.
 19.
Gönen M, Alpaydin E: Supervised learning of local projection kernels. Neurocomputing. 2010, 73: 16941703. 10.1016/j.neucom.2009.11.043.
 20.
Gönen M, Alpaydin E: Localized algorithms for multiple kernel learning. Pattern Recogn. 2013, 46: 795807. 10.1016/j.patcog.2012.09.002.
 21.
Sun Y, Todorovic S, Goodison S: Locallearningbased feature selection for highdimensional data analysis. IEEE Trans Pattern Anal Mach Intell. 2010, 32: 16101626. 10.1109/TPAMI.2009.190.
 22.
Calus MPL, Huang H, Vereijken A, Visscher J, Ten Napel J, Windig JJ: Genomic prediction based on data from three layer lines: a comparison between linear methods. Genet Sel Evol. 2014, 46: 5710.1186/s1271101400575.
 23.
Saunders C, Gammerman A, Vovk V: Ridge regression learning algorithm in dual variables. ICML1998 Proceedings of the 15th International Conference on Machine Learning. 1998, Morgan Kaufmann, San Franciso, 515521.
 24.
Daetwyler HD, Calus MPL, PongWong R, de los Campos G, Hickey JM: Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics. 2013, 193: 347365. 10.1534/genetics.112.147983.
 25.
Golub GH, Van Loan CF: Matrix computations. 2012, JHU Press, Ithaca, New York
 26.
VanRaden PM: Efficient methods to compute genomic predictions. J Dairy Sci. 2008, 91: 44144423. 10.3168/jds.20070980.
 27.
Gilmour AR, Gogel BJ, Cullis BR, Thompson R: ASReml User Guide Release 3.0. 2009, Hemel Hempstead, VSN International Ltd
 28.
Aguilar I, Misztal I, Johnson DL, Legarra A, Tsuruta S, Lawlor TJ: Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J Dairy Sci. 2010, 93: 743752. 10.3168/jds.20092730.
 29.
de los Campos G, Gianola D, Rosa GJM, Weigel KA, Crossa J: Semiparametric genomicenabled prediction of genetic values using reproducing kernel Hilbert spaces methods.Genet Res 2010, 92:295308.,
 30.
Men CQ, Wang WJ: Selection of Gaussian Kernel Parameter for SVM Based on Convex Estimation. Lect Notes Comput Sci. 2008, 5263: 709714. 10.1007/9783540877325_79.
 31.
Wang J, Lu H, Plataniotis KN, Lu JW: Gaussian kernel optimization for pattern classification. Pattern Recogn. 2009, 42: 12371247. 10.1016/j.patcog.2008.11.024.
 32.
Prato M, Zanni L: A practical use of regularization for supervised learning with kernel methods. Pattern Recogn Lett. 2013, 34: 610618. 10.1016/j.patrec.2013.01.006.
 33.
Sotak GE, Boyer KL: The LaplacianofGaussian kernel: a formal analysis and design procedure for fast, accurate convolution and fullframe output. Comput Vision Graph. 1989, 48: 147189. 10.1016/S0734189X(89)800362.
 34.
Chen L, Ng R: On the marriage of Lpnorms and edit distance. Proceedings of the Thirtieth International Conference on Very Large Data Bases. 2004, 792803.
 35.
Hofmann T, Schölkopf B, Smola AJ: Kernel methods in machine learning. Ann Stat. 2008, 36: 11711220. 10.1214/009053607000000677.
 36.
Groenen MA, Megens HJ, Zare Y, Warren WC, Hillier LW, Crooijmans RP, Vereijken A, Okimoto R, Muir WM, Cheng HH: The development and characterization of a 60 k SNP chip for chicken. BMC Genomics. 2011, 12: 27410.1186/1471216412274.
 37.
Canty A, Ripley B: boot: Bootstrap R (SPlus) Functions. R package version 1.234. 2009.
 38.
Mäntysaari E, Liu Z, VanRaden P: Interbull validation test for genomic evaluations. Interbull Bull. 2010, 41: 1722.
 39.
Liu Y, Liu Y, Chan KCC: Dimensionality reduction for heterogeneous dataset in rushes editing. Pattern Recogn. 2009, 42: 229242. 10.1016/j.patcog.2008.06.016.
 40.
Breiman L: Bagging predictors. Mach Learn. 1996, 24: 123140.
 41.
Gianola D, Weigel KA, Kramer N, Stella A, Schon CC: Enhancing genomeenabled prediction by bagging genomic BLUP. PLoS ONE. 2014, 9: e9169310.1371/journal.pone.0091693.
 42.
Heslot N, Yang HP, Sorrells ME, Jannink JL: Genomic selection in plant breeding: a comparison of models. Crop Sci. 2012, 52: 146160. 10.2135/cropsci2011.06.0297.
 43.
Wientjes YCJ, Veerkamp RF, Calus MPL: The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics. 2013, 193: 621631. 10.1534/genetics.112.146290.
Acknowledgements
The authors acknowledge financial support from the Dutch Ministry of Economic Affairs, Agriculture, and Innovation (Publicprivate partnership “Breed4Food” code KB12006.03005ASGLR). Hendrix Genetics is gratefully acknowledged for making the data available. Two anonymous reviewers are gratefully acknowledged for their critical comments and very useful suggestions that helped us to considerably improve the manuscript.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HH performed most of the analyses and wrote the first draft of the manuscript. AV helped in describing the dataset and interpreting the results. JJW performed the analyses with GBLUP. MPLC supervised the study. All authors read and approved the final version of the manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Huang, H., Windig, J.J., Vereijken, A. et al. Genomic prediction based on data from three layer lines using nonlinear regression models. Genet Sel Evol 46, 75 (2014). https://doi.org/10.1186/s1271101400753
Received:
Accepted:
Published:
Keywords
 Radial Basis Function
 Genetic Correlation
 Training Dataset
 Genomic Prediction
 Radial Basis Function Kernel