Genomic breeding value estimation using nonparametric additive regression models

Genomic selection refers to the use of genomewide dense markers for breeding value estimation and subsequently for selection. The main challenge of genomic breeding value estimation is the estimation of many effects from a limited number of observations. Bayesian methods have been proposed to successfully cope with these challenges. As an alternative class of models, non- and semiparametric models were recently introduced. The present study investigated the ability of nonparametric additive regression models to predict genomic breeding values. The genotypes were modelled for each marker or pair of flanking markers (i.e. the predictors) separately. The nonparametric functions for the predictors were estimated simultaneously using additive model theory, applying a binomial kernel. The optimal degree of smoothing was determined by bootstrapping. A mutation-drift-balance simulation was carried out. The breeding values of the last generation (genotyped) was predicted using data from the next last generation (genotyped and phenotyped). The results show moderate to high accuracies of the predicted breeding values. A determination of predictor specific degree of smoothing increased the accuracy.


Introduction
Genomic selection refers to the use of genomewide dense marker genotypes for breeding value estimation and subsequently for selection. Genomic breeding value estimation relies on linkage disequilibrium (LD) between genetic markers and QTL and needs genomewide and dense marker data. The main challenge is the estimation of many effects from a limited number of observations. To cope with this problem, Meuwissen et al. [1] proposed Bayesian methods that used informative priors. Meuwissen et al. [1] and Solberg et al. [2] showed by means of simulations that these methods are able to estimate genomic breeding values with a remarkably high accuracy, even for individuals without own phenotypic observa-tions. This offers the opportunity to speed up genetic gain by reducing the need for progeny testing [3].
Gianola et al. [4] argued that the assumptions made in the Bayesian models of Meuwissen et al. [1] are rather strong (e.g. the priors are very informative) and introduced nonparametric and semiparametric models, which make fewer assumptions. Two ways of modelling the genotypic data are presented by these authors. The first models all genotypes of an individual across the genome simultaneously; see eq. (1) of Gianola et al. [4]. Subsequently, the non-or semiparametric estimate includes additive genetic effects as well as dominance and epistasis. From this total genomic value, an additive breeding value can be extracted by performing linear approximations as shown in eq. (8) of Gianola et al. [4]. In the second way of modelling, the genotypes are modelled for each locus separately, see eq. (7) of Gianola et al. [4]. The authors [4] suggest estimating the nonparametric functions of the genotypes of a certain locus by applying additive model theory [5]. This way of modelling ignores epistatic effects.
The total genomic value of an individual is of interest in many cases, favouring the first way of modelling the genotypic data in Gianola et al. [4]. For example, one might think of classifying individuals with respect to their liability to a certain disease. In most livestock selection schemes, however, the breeding values, defined as the sum of the additive effects [6], are in general the most important. Following this, the second way of modelling the genotypic data in Gianola et al. [4], as described above, seems to be an interesting option, because it yields directly the additive effects, if the genotypes are modelled appropriately, and no extra computational step for the linear approximation is needed.
The aim of the present study was to investigate the ability of kernel regression using additive models to estimate genomic breeding values. In particular, the modelling of the genotypic data is shown and a method for the optimal selection of model parameters is presented. Using simulations, the accuracy of predicted breeding values from nonphenotyped animals were evaluated. The results were compared to those obtained from the BLUP method for genomic breeding value estimation.

Nonparametric kernel regression using additive models
Assume that n individuals (i = 1, ..., n) are genotyped at N single nucleotide polymorphisms (SNPs) (j = 1, ..., N). Biallelic SNP are considered. In this case, q = 2 different alleles are possible at a SNP (l = 1, q). An allele is coded as 0 or 1 and is denoted by x. The individuals are diploid, thus they have two chromosomes (k = 1, 2). Further, the individuals are phenotyped for a heritable quantitative trait. The phenotypes are denoted by y and are free of systematic errors. In the additive allelic model, the phenotype of an individual is represented as where x ijk is the kth allele of individual i at marker locus j and g j (x ijk ) is the function value of the kth allele at this locus. e i is a normally distributed random residual. The conditional expectation function is The conditional expectation function for any locus j with its alleles x jl can be written in terms of densities [7] where p(x jl ) is the density of x jl and can be estimated using a kernel smoother as where K denotes for the kernel and  for a smoothing parameter. In (3), x jl is the point at which the density is estimated, this is termed the focal point [7]. The joint density of x jl and y at point (x jl ,y) is estimated as Now, it can be shown [e.g. [4,8]] that substituting (3) and (4) in (2b) results in the Nadaraya-Watson kernel regression estimator [9,10] for the conditional expectation function g j (x jl ) The additive haplotype model is similar to the allelic model except that haplotypes, formed by pairs of flanked markers, are considered instead of single allelic marker effects. Consequently, the outlines shown above hold, if it is assumed that x ijk is the kth haplotype at chromosome segment j of individual i and the first summation in (1) is over N segments. The coding of the haplotypes is done so that x can take q = 4 different values, i.e. 1-1, 1-0, 0-1, or 0-0. Similarly, the functions of the segments are estimated using the Nadaraya-Watson regression estimator. In the following no distinction is made between the allelic and the haplotype model, unless stated. The loci and segments are both denoted as predictors and the alleles and haplotypes both as levels of the predictors, or short, as levels.
The x ijk are discrete with only q = 2 (q = 4) different values in the allelic (haplotype) model, see above. Therefore we choose the binomial kernel of Aitchison and Aitken [11]. Using this kernel, for each focal x jl and each observed x ij the number of disagreements d is estimated. In the allelic model d takes values of 0 (e.g. x jl is 0 and x ij is 0) or 1 (e.g. x jl is 0 and x ij is 1), and in the haplotype model values of 0 (e.g. x jl is 1-1 and x ij is 1-1), 1 (e.g. x jl is 1-1 and x ij is 1-0 or http://www.gsejournal.org/content/41/1/20 0-1) or 2 (e.g. x jl is 1-1 and x ij is 0-0). Using this definition of d, the binomial kernel K is where  is the smoothing parameter with    1 [11].
The Nadaraya-Watson regression applying the binomial kernel for the estimation of the functions is Extending (2a) to account for multiple predictors, the conditional expectation function can be written as Assuming additivity of the predictors, this leads to the following iterative backfitting algorithm [12,5] for computing the functions.

Repeat step 2 until convergence is reached.
In step one the nonparametric function values are initialised with some small numbers.
Step two comprises the application of the Nadaraya-Watson regression (denoted by NWR) in the form described in (5), but using ( | x ijk ) instead of y i . The term ( | x ijk ) is called the partial residual and denotes for the phenotypes corrected for every predictor except for the level k of individual i at predictor j. The collinearities result in a non-uniqueness of the estimates [5]. Therefore, (x jl ) are centred in the second step by subtracting the mean of fitted function values to the 2n chromosomes at the predictor j. This centring ensures that the overall mean of the fitted function values is zero at every cycle of the backfitting and the algorithm converges to one possible solution [5]. It might be noted that the backfitting algorithm is very similar to the Gauss-Seidel algorithm, further details can be found in [5].
Choosing the smoothing parameter  In applying kernel regression, one key question is which value for the smoothing parameter  should be used. As stated above, when a binomial kernel is applied, the lower and upper bound of  is 0.5 and 1, respectively. When  = 1 the whole weight of K(x jl , x ij , ) is concentrated at x ij = x jl and (x jl ) in (3) is just the proportion of cases x jl was observed in the sample. On the contrary, when  = 0.5, the degree of smoothing is at maximum and K(x jl , x ij , ) gives the same weight to each of the x jl [11,7]. One way of selecting an appropriate  is to apply bootstrapping as follows [13]. Assume a number of B bootstrap samples (b = 1, ..., B). In each b, the data points are split into two sets. The first set, denoted as the estimation set, is formed by the entire bootstrap sample and the second, denoted as the test set, is formed by the individuals not found in the corresponding bootstrap sample. Since a bootstrap sample is generated by drawing n observations out of the original pool of n observations with replacement [13], the probability of any given progeny being chosen after n drawings is This means that only those bootstrap samples are considered where the corresponding individual i was not in the estimation set, but in the test set. Averaging over all individuals yields Note that the subscript i denotes for the individual. The , which produced the smallest aveRSS, can be chosen to analyse the original sample. This method is termed the equal lambda method (ELM) in the following, because the  takes the same value for each predictor.
Different  might be optimal for different predictors and a predictor specific determination of  is desirable. In principle, the bootstrap strategy can be expanded accordingly. However, this would need B times N times the number of  in the grid calculations, which is computationally not feasible. Additionally, the constellation, which results in the smallest aveRSS might be difficult to find. In previous analysis we investigated the optimal degree of smoothing for predictors taking the knowledge of the simulated QTL into account. The degree of smoothing was less for predictors in LD with a QTL compared to predictors not in LD with a QTL. Additionally, predictors that showed a similar variance of their function values, also showed a similar optimal . This lead to the following algorithm for the group-wise predictor specific  determination, subsequently named unequal lambda method (ULM).
1. Determine one  valid for all predictors using ELM.
2. Estimate the variance of the q function values for each predictor (q = 2 in the allelic and q = 4 in the haplotype model, see above).
3. Select those m (e.g. m = 5) predictors which show the highest variance and determine an optimal  for them using bootstrapping, but letting the lower bound of  be as determined in ELM. The  for the remaining predictors are fixed at the determined value from ELM.

Repeat step 3 for the next set of m predictors, which
show the next highest variance. Here, keep  for the remaining predictors fixed at their determined value, i.e. from ELM for predictors with a lower variance, and from step (3) otherwise.

Repeat step 4 until all predictors are passed.
Finally, the original sample is analysed with the groupwise predictor specific .

BLUP method for genomic breeding value estimation
The BLUP model of Meuwissen et al. [1] can be applied in an allelic model or in a haplotype model. For simplicity only the allelic BLUP model will be considered in the following. In Meuwissen et al. [1] it is assumed that the variance of a marker effect is /(2N), with being the additive genetic variance. Note that each marker affects the phenotype two times, via the paternal and the maternal allele, hence the 2N in the denominator. If the unequal gene frequencies at the markers are taken into account, the variance of a marker effect becomes / (4N ), with being the average heterozygosity across markers. The derivation is given in the Appendix 1, and can also be found in Habier et al. [14] using a different approach. If equals 0.5 (i.e. the allele frequency at every marker is 0.5), the expression reduces to /(2N).

Simulations
In order to test the ability of the additive nonparametric regression models to predict reliable breeding values, and to compare the results from those obtained from BLUP, a simulation study was conducted. The simulations were performed as described by Solberg et al. [2]. Briefly, a population was simulated over 1000 generations with mutations and random selection and mating with an effective population size of 100. Ten chromosomes each of 100 cM length and each with 100 potential QTL evenly distributed over the chromosome were generated. The number of segregating QTL depended on the mutation rate at the QTL, which was assumed to be 2.5 × 10 -5 [2]. For each mutation at the QTL an additive effect was sampled from the gamma distribution with a shape and a scale parameter of 1.66 and 0.4, respectively [15]. This implied that many QTL had small and only few had large effects. QTL effects were sampled such that they had equal probability of positive or negative effects. QTL effects were simulated to be additive. The marker density was 1 cM, 0.5 cM or 0.25 cM. The mutation rate at the markers was assumed to be 2.5 × 10 -3 [2]. Markers showed in general multiple alleles. In order to reflect SNP markers, they were converted to biallelic markers by assuming that only one of the mutations was visible as described by Solberg et al. [2]. The proportion of segregating SNPs (segregating QTL) was around 98% (5-6%) of the number of simulated markers (QTL) at generation 1000. In generation 1001, the number of animals was increased to 1000 by factorial mating. The LD of pairs of segregating markers was estimated as r 2 value in generation 1001. The average r 2 of two adjacent segregating markers was 0.158, 0.222, and 0.295 for the marker density 1 cM, 0.5 cM and 0.25 cM, respectively [2]. The animals in generation 1001 produced 1000 offspring for generation 1002 by random mating. Animals in generation 1001 and 1002 were genotyped at the SNP markers and animals in generation 1001 were also phenotyped. The phenotypes were the sum of their simulated breeding value and a random deviation e (e ~ N(0, )).
was chosen such that the heritability of the trait was h 2 = 0.25 or h 2 = 0.5. For the haplotype model, the simulated haplotypes were used (no extra haplotype determination was performed). The number of replicates was 10 for each marker density and each h 2 .
In the additive nonparametric regression, the functions were estimated using the data from the generation 1001. These were used to predict the breeding values (EBV) of the generation 1002 as The smoothing parameter  was varied as  = 0.5, 0.525, .... A total of B = 50 bootstrap samples were generated. For ULM, the groups size for the group-wise predictor specific  determination was m = 5, 10 and 20 for a marker density of 1 cM, 0.5 cM and 0.25 cM, respectively. The convergence criterion to exit the backfitting algorithm was an average change of the function values of two consecutive iterations below 2.5 * 10 -5 . A relaxation factor [e.g. [16]] of 0.7 was included. Additionally, generation 1001 was analysed using the BLUP model described above, assuming the variance of the effects of each marker is / (4N ) and using the simulated variance components.
The BLUP system of equations was solved iteratively by applying the Gauss-Seidel algorithm [e.g. [16]]. The same convergence criterion as for the nonparametric additive model was used. Also these estimates were used to predict the breeding values of generation 1002.
The correlation between the true breeding value and the EBV of the individuals in generation 1002 as well as the regression coefficient of the TBV on the EBV was estimated, which served as empirical measures of the ability of the methods to predict accurate and unbiased breeding values of individuals without own phenotypic observations [1]. Unbiased means here E(TBV|EBV) = EBV, and a regression coefficient below one (above one) indicates that the EBV vary too much (too little). Unbiased EBV are important if selection has to be carried out from multiple generations using estimated marker effects in one generation. Assume selection will be done across two-year classes, where the marker effects are estimated in the older year class only. Further assume that the younger year class is in general superior (i.e. has a higher population mean) due to selection response. If the EBV vary too much (too little) then too many animals will be selected from the older (younger) year class.

Results
The results are shown in Tables 1 and 2. Summarized over all genetic configurations analyzed, the accuracies of EBVs obtained from ULM were highest. However, these were also most biased, as indicated by the in general lower regression coefficients. The accuracies from ELM and BLUP were very similar.
The impact of the heritability can be seen when comparing the results reported in Table 1 with those in Table 2.
As expected, the accuracies of the EBVs were higher for a heritability of 0.5. Additionally, the EBVs were in general less biased for the higher heritability. This was most obvious for ULM. Increasing marker density led to higher accuracies of EBVs for all methods. With increasing marker   The heritability was 0.25. Average from 10 replicates. ELM and ULM denotes for equal lambda and unequal lambda method, respectively. a Correlation between true and estimated breeding value; standard deviations are in parenthesis b Regression of true on estimated breeding value; standard deviations are in parenthesis density the regression coefficient of the true on the estimated breeding value decreased for ELM and ULM, resulting in general in an increased bias with increasing marker density. One exception is for ELM and a marker density of 1 cM, where the EBVs vary too little. Here, the bias decreased when moving to a marker density of 0.5 cM (see second row of Tables 1 and 2). In contrast, with increasing marker density the regression increased for BLUP.
The differences between the allelic and the haplotype model were small, regardless of the method used (Tables  1 and 2). The haplotype model produced slightly better results in low marker density situations, but with dense markers the accuracies from the allelic and the haplotype model were very similar. The same was reported for the BayesB method [17,2].
The computational demand was in an increasing order: BLUP, ELM and ULM. For example, one replicate with a marker density of 1 cM analysed with the allelic model took below one minute when using BLUP, around one hour for ELM and several hours for ULM. The reason is, that ELM and ULM included bootstrapping to determine the optimal . Naturally, the computation time would even be higher if the number of bootstrap samples (B) would be larger. It seems that B = 50 is at the lower bound when comparing with literature reports [13]. However, increasing B did not produce significantly different results (not shown), indicating that B = 50 was sufficient here.
The time to reach convergence depended on  and the marker density. With increasing  and increasing marker density more iteration were needed until convergence was reached. For example, in general the number of iterations for  = 0.6 was ~15 and for  = 0.9 was ~50 for a marker density of 1 cM. The same figures for a marker density of 0.25 cM were ~20 and ~90, respectively. Figure 1 and 2 showed that during the grid search for the optimal , the accuracy increased with increasing  monotonically and decreased monotonically after the optimum  was passed. Therefore, in order to speed up computations, the grid was started at the lower bound of  and was ended when the aveRSS from (7a) and (7b) stopped decreasing, assuming that the optimal  was reached or is not far away. The start at the lower bound was because convergence is reached fast if  is small (see above). Additionally, if aveRSS failed to decrease due to some random sampling before the optimal  was reached, this would result in an over-smoothing, and hence, the results would be conservative.
For ULM the numbers of predictors with a  within a defined bin are shown in Tables 3 and 4. A higher marker density results in more predictors that are less smoothed, i.e. showing a  closer to one. This is due to the higher number of predictors in LD with the QTL. Also, with an increased heritability more predictors are less smoothed (top and bottom of Tables 3 and 4). The grid search for finding the optimal  is more powerful in high heritability situations, leading to this lesser degree of smoothing. Additionally, as for ELM, more smoothing is done in the haplotype model than in the allelic model. This can be seen in the higher number of predictors showing a  > 0.9 in the allelic model (Table 3 and 4).

Discussion
As stated in the introduction, in genomic breeding value estimation we are faced with the problem of estimating many effects from a limited number of observations, and, additionally, many effects show collinearities due to the LD between the SNPs. The BLUP model overcomes these problems by treating the predictors as random variables and estimating them simultaneously. In the nonparamet- Results from the allelic additive nonparametric regression Results from the haplotype additive nonparametric regression ric kernel regressions (ELM and ULM), the numerous effects are estimable by smoothing the phenotypes against one predictor at a time, assuming that the effects of the remaining are removed from the phenotypes. Of course, the true effects of the remaining predictors are unknown and have to be estimated themselves, resulting in the iterative backfitting algorithm [5]. Nuisance factors can be included in the algorithm and can be estimated parametrically using least squares. The model is then semiparametric and the backfitting algorithm iterates between the parametric (i.e. estimating the effects of the nuisance factors by least squares) and the nonparametric part (i.e. estimating the SNP function values by the Nadaraya-Watson regression), without changing the general structure of the algorithm [5].
Using kernel regression, the choice of the appropriate degree of smoothing is important, which depends on the sample size. Naturally, if the sample size grows to infinity, smoothing is almost not required [7] and hence  should be close to 1. However, sample size is never infinite, and, therefore,  has to be chosen carefully, taking the sample size into account. Indeed, in ELM the optimal  for a marker density of 1 cM, a heritability of 0.5 and applying the allelic model is 0.74 (Figure 1a). If the size of the data set would only be 500, the optimal  would be 0.65 (not shown elsewhere). The applied bootstrap strategy takes the sample size into account, because the estimation set is of equal size as the full data set. In ELM the  determined by bootstrapping was very close to the optimal . This can be seen by comparing the results reported in Table 2 for the ELM with the maximum achievable accuracies shown in Figures 1 and 2. Alternatively, leave-one-out cross validation is suggested [13,7]. Using this method, for a given , the functions are fitted using all but one observation and then the prediction error of this observation is calculated given the fitted functions. This is repeated for all observations. The , which produces the lowest average prediction error, is chosen to be the optimal . However, this strategy would require running n times the analysis, which would computationally be too demanding in the present data sets. The bootstrap as applied in this study is related to this cross-validation strategy, see [13] for a detailed discussion.
When nuisance factors are included in the model and the number of data points in some classes is very low, it might happen that in some bootstrap samples these effects are not estimable or estimated poorly. One obvious solution is to use only those bootstrap samples where the number of data points in each class is above a defined threshold.
Since it is assumed that the nuisance effects and the SNP effects are independent, this would not affect the results regarding the choice of the appropriate . The optimal  depended on the marker density. With increasing density, more smoothing (i.e. a lower ) was required. This is because the QTL effects are represented   by all SNPs that are in LD with it. With an increasing number of SNP being in LD with the QTL, each SNP captures a smaller part of the QTL effect, and hence, requires more smoothing. Naturally, the number of SNP in LD with the QTL is higher in high marker density situations. Additionally, with increasing number of SNP, more SNP show by chance spurious effects, and hence, more smoothing is required to minimise the impact of these spurious effects. In this study the markers were equally distributed across the chromosomes. In practise it might happen that this is not the case and some QTL are in LD with many markers (requires more smoothing) whereas others only with few markers (requires less smoothing). It can be assumed that ULM might cope with unequal marker densities better than ELM and BLUP, because of the group-wise specific  estimation.
The results from the allelic BLUP and the allelic ELM are very similar (Tables 1 and 2). This might be intuitively surprising, because of the different assumptions underlying these models. However, we compared both models formally and found close similarities between them, leading to the similar results.  Tables 3 and 4), resulting in the higher accuracies of the EBVs estimated by ULM (Tables 1 and 2). The standard deviations in Tables 3 and 4 are high for  > 0.7. This might be due to the difficulty in finding the optimal  and additionally due to the unequal distribution of the simulated QTL effects. As described above, these followed gamma distribution with a high density for small and a low density for large effects [15]. Hence, some replicates might show several big QTL resulting in more predictors with a large  whereas other replicates might show only small or medium sized QTL and the number of predictors with a  close to one is small in these replicates as well.
In ULM a critical question is how large the group size (m) should be. Alternatively the algorithm could have been repeated several times with updated  and stopped when the  did not change anymore, which would be, however, computationally very demanding.
It may be possible to estimate  by the use of a prior distribution in ULM. One possibility for such a procedure would be to sample  from a mixture of two distributions, one for predictors in LD with a QTL and the second component of the mixture for predictors not associated with a QTL. The latter distribution would put significantly more, if not all, probability mass at  equal to 0.5 (smoothing is at maximum), whereas the first one would support less smoothing. However, as the models were implemented in this study, they do not use any prior information, in contrast to BayesB of Meuwissen et al. [1]. A comparison of the results presented in Table 2 with those of Solberg et al. [2], who simulated the same genetic configuration but applied BayesB, suggests that the accuracy of ULM is lower compared to the accuracies of BayesB in the allelic case.

Conclusion
Nonparametric additive regression models for genomic breeding value estimation were shown to estimate breeding values of individuals without phenotypic information with moderate to high accuracy. The optimal degree of smoothing was determined either for all predictors jointly (ELM) or for groups of predictors separately (ULM). The latter increased the accuracies of the EBVs. The accuracies of the superior model, the ULM model, are in general slightly lower compared to BayesB. The behaviour of these models for the estimation of genomic breeding values The BLUP estimate of the M2 effect can be derived in the same way.
According to eq (5) of the main text, the nonparametric function value of M1 can be written as with As shown in the main text, in the allelic model q equals 2 and d can take the values 0 or 1, depending on the number of disagreements between the focal (M1) and the observed allele x ik and therefore v ik can take only two values, v 1 (v 2 ) for phenotypes associated with M1 (M2). Following this, (A3) results in where y M1,i and y M2,i denote for the phenotypes associated with M1 and M2, respectively. This can be written as with w 1 + w 2 = 1 and both weights are nonnegative. Here w 1 and w 2 depend on the degree of smoothing () and on n M1 and n M2 . The nonparametric function value of M2 can be expressed in the same way. Eq (A5) has the same form as (A2), hence by choosing  appropriately, such that the weights w 1 and w 2 are similar or the same in BLUP and in the nonparametric regression, both models became similar or the same. If one  is used across all loci, it becomes impossible to choose a  such that the weights w 1 and w 2 are equal for both models for all loci. It may however be possible to choose  such that these weights are very similar.