Bayesian neural networks with variable selection for prediction of genotypic values.

BACKGROUND
Estimating the genetic component of a complex phenotype is a complicated problem, mainly because there are many allele effects to estimate from a limited number of phenotypes. In spite of this difficulty, linear methods with variable selection have been able to give good predictions of additive effects of individuals. However, prediction of non-additive genetic effects is challenging with the usual prediction methods. In machine learning, non-additive relations between inputs can be modeled with neural networks. We developed a novel method (NetSparse) that uses Bayesian neural networks with variable selection for the prediction of genotypic values of individuals, including non-additive genetic effects.


RESULTS
We simulated several populations with different phenotypic models and compared NetSparse to genomic best linear unbiased prediction (GBLUP), BayesB, their dominance variants, and an additive by additive method. We found that when the number of QTL was relatively small (10 or 100), NetSparse had 2 to 28 percentage points higher accuracy than the reference methods. For scenarios that included dominance or epistatic effects, NetSparse had 0.0 to 3.9 percentage points higher accuracy for predicting phenotypes than the reference methods, except in scenarios with extreme overdominance, for which reference methods that explicitly model dominance had 6 percentage points higher accuracy than NetSparse.


CONCLUSIONS
Bayesian neural networks with variable selection are promising for prediction of the genetic component of complex traits in animal breeding, and their performance is robust across different genetic models. However, their large computational costs can hinder their use in practice.


Background
The biochemical mechanisms that underlie phenotypes work through non-linear interactions between molecules and proteins. Nevertheless, in the practice of animal breeding, additive prediction methods, which assume that phenotypes depend on markers individually and without interactions between them, have been successful. The U-shaped allele frequency of causal loci explains why these microscopic interactions give rise to traits mainly due to additive genetic variance [1], and therefore the success of additive methods. However, traits still have an epistatic component and therefore methods that can fit more than the additive genetic component have the potential to better predict genotypic values and phenotypes of animals.
In reality, the causal variants of a trait are not necessarily among the markers that are used for genomic prediction. Therefore, not all markers may aid in the prediction of genetic values. Prediction methods may therefore be optimized if they allow to model a proportion π of the total number of markers as irrelevant for phenotype prediction. Additive methods such as BayesB [2] and BayesCπ [3] allow for this variable selection of markers van Bergen et al. Genet Sel Evol (2020) 52:26 (sparsity). Depending on the genetic architecture of the trait, additive Bayesian variable selection methods may have a small advantage over genomic best linear unbiased prediction (GBLUP) [4][5][6].
Parametric methods assume a specific functional form. Members of this family of methods are the additive and dominance methods of quantitative genetics. In addition, parametric methods assume that genetic variance can be decomposed orthogonally into additive, dominance, additive × additive, dominance × additive etc. variance components. This orthogonal decomposition is valid only under restricted assumptions such as linkage equilibrium, random mating and no inbreeding [7]. Since these assumptions are invalid in practice, prediction methods that do not make them have the potential to obtain better predictive performance of traits than methods that do.
Non-parametric methods assume neither a particular form of the unknown relation from genetic material to genotypic values, nor the aforementioned partitioning of genetic variance. Because these models do not distinguish between the different variance components, they predict genotypic values instead of breeding values, where breeding values correspond only to the additive component of genotypic values. In genetics, reproducing kernel Hilbert space regression [8] and neural networks have been studied as non-parametric methods for the prediction of phenotypes. Neural networks, in particular, are powerful and interesting non-parametric methods because they can approximate any function, and current software packages make them easy to use.
Neural network models that have been investigated in animal breeding include Bayesian regularized artificial neural network (BRANN) [9,10], scaled conjugate gradient artificial neural network (SCGANN) [10] and approximate Bayesian neural network [11] methods. BRANN is a neural network method that avoids overfitting by means of Bayesian regularization via Bayesian prior distributions, and achieved higher accuracy than additive methods for prediction of milk production in Jersey cows. For marbling score in Angus cattle, BRANN had a higher predictive accuracy than both additive methods and SCGANN, which is a neural network method without Bayesian regularization. These results imply that Bayesian regularization can have a benefit for prediction of traits. BRANN has been used for the ranking of markers based on their impact on the network. [12] While this approach can help to identify the most important markers, it does not promote sparsity during inference since the ranking is performed as a separate step after inference.
There are several neural network methods that try to achieve sparsity with different regularizations on the weights (parameters of the neural network). For example, ℓ 1 (Lasso) regularization causes as many weights as possible to reach zero and has previously been studied in animal breeding [13], and group Lasso, which allows for pruning weights in groups instead of individually, has been studied for image classification [14]. Sparsity based on an ℓ 0 regularization for individual weights has also been studied for image classification [15]. These approaches are based on Maximum a Posteriori inference, and typically focus on sparsity of all network nodes or weights, rather than sparsity of inputs (markers) specifically.
In summary, there are phenotype prediction methods that allow for variable selection of markers and there are non-parametric methods that allow for fitting non-additive effects of individuals. However, there are no methods that allow for both. In this study, we introduce a method called NetSparse to fill the previously unexplored combination of both Bayesian variable selection and Bayesian neural networks for prediction of total genotypic values.
In the section Methods, the framework for Bayesian phenotype prediction is set and we explain how both additive Bayesian methods and NetSparse fit within this framework. The section Simulations describes the simulation of data used to compare NetSparse with other methods. In the section Results, we compare NetSparse to the reference methods GBLUP, BayesB, GBLUP-AD, BayesB-AD and GBLUP-ADAA.

Methods
First, we set up a general Bayesian framework for phenotype prediction methods, then we will describe how the different methods considered fit in this framework.

Bayesian phenotype prediction
We assume that N individuals (i = 1, . . . , N ) are phenotyped and genotyped, such that the observed phenotype y i is the sum of a genotypic value g i and a residual e i . In addition, we assume that the genotypic value of individuals can be computed from their marker genotypes by a function f (·; u) , depending on unknown parameters u (different for each method), and that the residuals e i follow a normal distribution with mean 0 and variance σ 2 e , which we write as e i ∼ N 0, σ 2 e . These assumptions lead to the following model: ⊺ is the vector with marker genotypes of individual i, the exact encoding of which depends on the method. We gather the vectors in the matrix X = (x 1 x 2 · · · x N ) . Model (1) implies that the likelihood of the data is: The likelihood is combined with a prior distribution p(u | η) , depending on hyperparameters η , which indicates what a priori are considered to be plausible values of u that could have generated our data.
In our terminology, we use the term "parameter" for u , these are variables which directly determine the model predictions. The term "hyperparameter" is used for η , these are the parameters which only indirectly influence the predictions of the model. The hyperparameters η and σ 2 e can crucially influence the performance of a model. There are roughly two approaches for these hyperparameters: they can be either estimated or integrated out. For the GBLUP models in particular, where the hyperparameters are the variance components, 1 this estimation can be done for instance with restricted maximum likelihood (REML) [16]. Alternatively, if estimation of these hyperparameters is difficult, they can be given prior distributions and integrated out together with the other parameters via Markov chain Monte Carlo sampling ("MCMC sampling" section).
The joint distribution of y and u conditioned on X is the product of the likelihood and prior distribution: The posterior distribution p u | X, y via Bayes' Theorem is then: This posterior distribution is the distribution over u obtained by combining the information in the prior distribution and the information in the observed data.
Model (1) can be used to make phenotype predictions y * for an individual with markers x * by computing the posterior predictive distribution: The expected value of y * with respect to the posterior predictive distribution is: p y, u | X = p y | X, u p(u).

Bayesian inference
Using the previous framework, we will briefly describe the five methods that we used for reference (GBLUP, BayesB, GBLUP-AD, BayesB-AD and GBLUP-ADAA), as well as our new method, NetSparse. Of these methods, GBLUP [17] was chosen because it is the most used in practice. We chose BayesB [2] because it is a common method that includes variable selection, similar to NetSparse.

GBLUP
The SNP-BLUP model is an additive model where each marker p is assigned an additive effect a p with shared prior variance σ 2 a . Specifically, in SNP-BLUP, f is chosen as a linear function f SNP-BLUP (x; u) = x ⊺ a , with u = a . The prior distribution over a is p a | σ 2 a = N a | 0, σ 2 a 1 . The hyperparameters σ 2 e , σ 2 a are estimated with REML. The posterior distribution over a is: where � −1 = σ 2 e 1 + σ 2 a X ⊺ X . The matrix X ⊺ X is proportional to the additive genomic relationship matrix G . The posterior predictive distribution is also Gaussian: This equivalent formulation of SNP-BLUP is called GBLUP and uses the additive relationship matrix G instead of allele effects, based on markers. For derivations of these formulas, see for instance [18]. We used the GBLUP implementation of MTG2 [19].

BayesB
The BayesB model, like SNP-BLUP, is an additive model where every marker is assigned an additive effect [2]. However, contrary to SNP-BLUP, BayesB also includes marker selection, which we will indicate by a (binary) marker selection vector s ∈ {0, 1} P . If an entry in this marker selection vector has the value 1, the corresponding marker is selected for inclusion in the model. If the entry is equal to 0 the marker is not selected. If the posterior distribution is concentrated at s p = 1 , then the p-th marker contributes significantly to phenotype prediction, (3) E[y * = f (x * ; u)] = y * p y * | X, y, x * , σ 2 e dy * = y * N y * | f (x * ; u), σ 2 e p u | X, y du dy * = f (x * ; u)p u | X, y du.
p a | X, y = N a | σ 2 a �Xy, � , (4) p y * | X, y, x * = N y * | σ −2 e x ⊺ * �Xy, x ⊺ * �x * . van Bergen et al. Genet Sel Evol (2020) 52 :26 while if most of the probability is concentrated at s p = 0 then the p-th marker does not contribute significantly to phenotype prediction. In addition, there is a hyperparameter π that is equal to the proportion of non-contributing markers, i.e. π = 1 − p s p /P . SNP-BLUP, in contrast, assumes that all additive effects come from the same normal distribution, making it similar, but not exactly equal, to BayesB with π = 0 . If a dataset contains relatively few quantitative trait loci (QTL), this mismatch should result in BayesB having a better performance than GBLUP.
Specifically, in BayesB, the function f is f BayesB (x; u) = a ⊺ (x ⊙ s) , with u = (a, s) and ⊙ the element-wise (Hadamard) product (x ⊙ s) p = x p s p . For each marker p, marker effect a p has prior distribution p a p | σ 2 a p = N a p | 0, σ 2 a p and hyperprior distribu- . The expression for the posterior distribution over u is: and the expected value of the posterior predictive distribution is 2 : The last line comes from the identity a ⊺ (x * ⊙ s) = (a ⊙ s) ⊺ x * and means that prediction can be obtained by averaging allele effects and then making predictions using those, instead of averaging predictions directly. The expectation value cannot be computed analytically, but it can be approximated by sampling ("MCMC sampling" section).

AD methods
The aforementioned additive methods can be adapted to fit additive and dominance effects. For the additive effects, markers were encoded as: and for the dominance effects, markers were encoded as: Note that for the GBLUP implementation and assuming Hardy-Weinberg equilibrium (HWE), the use of these encodings leads to additive and dominance relationship matrices as described in [20]. For the BayesB implementation, the two encodings are appended, such that every individual is represented by an array twice as long as for the additive models. Using GBLUP and BayesB with these longer arrays allows dominance to be fitted as well and we call the resulting methods GBLUP-AD and BayesB-AD [21,22]. Because they explicitly model additive and dominance effects, these methods should work best on data where both additive variance and dominance variance are significant.

GBLUP-ADAA
The AD construction for GBLUP can be extended further to fit additive by additive epistasis (section Simulations), in addition to additive and dominance effects, by adding a third covariance matrix, given by G ⊙ G . We call this method GBLUP-ADAA. As with GBLUP, the MTG2 software was also used for GBLUP-AD and GBLUP-ADAA.

NetSparse
In our NetSparse model ( Fig. 1), f is chosen as a neural network with one hidden layer 3 :  van  prediction of continuous outcomes, the output activation function g was chosen as the identity. For classification, a different transfer function, such as softmax, would be more appropriate, but such analyses fall outside the scope of this study. Given that the computational resources are sufficient, one would determine H via a cross-validation procedure. However, we did not have access to such resources, thus we used H = 20 , such that the model was able to fit complex non-linear interactions within reasonable computation time. A value of H larger than 20 led to an impractical increase in computation time.
A neural network can be interpreted as repeatedly taking linear combinations and elementwise application of an activation function ( tanh ). Given x ⊙ s , the value of h i depends linearly on the j-th column of W , and given each tanh (h i ) , the output of the network depends linearly on w . The non-linearity from tanh makes sure that the neural network can fit more than additive relations; if tanh , as well as g, was replaced by the identity function the network would only be able to fit linear functions.
Some of the prior distributions are: 4 where N 0, 2 2 denotes the half-normal distribution with scale parameter 2 2 , which is the same as a normal distribution with standard deviation 2 restricted to positive values only, and β is the precision parameter in the (Gaussian) likelihood. The prior distribution over s is the same as that of BayesB. The posterior distribution over u in NetSparse (2) is: As with BayesB, this expression can not be computed analytically, but it can be approximated by sampling.

MCMC sampling
The integral in (3) can be computed analytically for GBLUP, but not for BayesB (5) and NetSparse (11). To obtain an approximation to E[y * ] for these models, we do MCMC sampling to obtain samples from the joint posterior distribution over (u, η) . Given such samples , the expectation value of y * can be estimated as: For BayesB, we implemented a Gibbs sampler in the BGLR R package [23]. Instead of averaging predictions, the sampler averages allele effects, but this is equivalent (see (6)). For NetSparse, we used the PyMC3 package [24] to sample from the NetSparse posterior distribution, p u, η|X, y , which is the integrand of (11). The conditional distributions cannot be sampled from, directly, so a Gibbs sampler cannot be used, therefore PyMC3 uses a composite sampler, which alternatively uses MCMC samplers for the discrete ( s p ) and for the continuous variables (the rest). To sample s , we used a Metropolis-Hastings algorithm, where we iterate over the individual components in a random order. For each s p , we evaluate P 1 = p s p | rest and P 2 = p 1 − s p | rest , then we set s p ← 1 − s p with probability min(1, P 2 /P 1 ).
For the continuous parameters, we have the conditional posterior distribution p θ | X, y, s , where we write θ for the combination of all continuous variables: W, w, b h , b o , and η . This conditional posterior distribution is the integrand of (11). To sample these parameters, we used the Hamiltonian Monte Carlo sampler (HMC) [25,26]. HMC uses the same Metropolis-Hastings procedure as for s , but with a more complicated proposal. To generate a proposal, initialize θ(0) ← θ , and for each θ i draw a new variable r i (0) from a normal distribution and compute the energy . Given this initial state, generate a proposal state from (θ(0), r(0)) by numerically evolving it for a time T according to the Hamiltonian dynamics: This new state (θ(T ), r(T )) will have energy E T = H(θ(T ), r(T )) . This proposal is evaluated with a Metropolis-Hastings acceptance criterion: set θ ← θ(T ) with probability min (1, exp (E 0 − E T )) , otherwise θ ← θ(0) . The r i are discarded. We note that only the gradient of the posterior distribution is required, but not the matrix of second derivatives.
We used the NUTS variant of HMC [27]. For highdimensional models with continuous variables, using the gradient of the posterior distribution allows HMC to explore the parameter space faster than either Metropolis-Hastings or Gibbs [28] samplers [29], and therefore requires fewer sampler steps.
Besides computing the posterior distribution, simulation of the Hamiltonian dynamics also requires the gradient of the posterior distribution. PyMC3 calculates this gradient by the automatic differentiation capabilities of Theano [30].
We drew four independent chains of 1000 samples each, where for each chain the first 500 samples were used to tune the sampler and discarded, the last 500 samples of each chain were used for predictions. We also ran a few longer chains, but this did not change the results.

Simulations
To compare the performance of these methods, we evaluated them on populations in which the traits have different phenotypic models (additive, dominance and epistatic).

Population structure
Our aim was to simulate a population with a family structure and linkage disequilibrium pattern that roughly resemble those of livestock populations, using QMSim [31]. The historical population was simulated by mating 250 males with 250 females for 1900 generations to reach mutation-drift equilibrium. To mimick breed formation, a bottleneck was introduced by gradually decreasing the population size to 75 males and 75 females during the next five generations. This population size was maintained for 95 generations, and, then, population size was increased to 1050 (50 males and 1000 females) in the last historical generation. From the last historical generation, all males and females were randomly mated for 15 generations to create the current population. Litter size in the current population was 10, and at each generation all sires and dams were replaced to create non-overlapping generations. For all scenarios, the reference population consisted of 500 randomly sampled individuals from generation 14, and the validation population consisted of 2000 randomly sampled individuals from generation 15.

Genome
The genome consisted of 10 chromosomes, of 100 cM each. For each chromosome, 40 000 biallelic loci were simulated. Mutation rate in the historical generations was 2.5 · 10 −6 , and there was no mutation in the last 15 generations. From all loci segregating in generation 14, m loci were selected to become QTL, which varied across scenarios, and 5000 loci were selected to become markers. Although this density is lower than a typical commercial livestock SNP chip (60K), we chose this lower density to decrease computational demand. The markers were selected based on their allele frequency; the allele frequency distribution of markers was approximately uniform. The QTL were randomly selected and the allele frequency distribution of QTL was approximately U-shaped.

QTL effects
Additive effects (a) of QTL were sampled from a normal distribution with mean 0 and variance 1. Dominance factors ( δ ) were also sampled from a normal distribution, with varying mean and variance across scenarios.
Dominance effects (d) were computed as δ|a| [32,33]. Similar to dominance effects, we assumed that the magnitude of epistatic effects were proportional to the additive effects of the interacting QTL. For all m(m − 1)/2 pairwise combinations of QTL, epistatic factors ( γ ) were sampled from a normal distribution with mean 0 and variance 1. The epistatic effects ( ǫ ) between QTL k and l were computed as γ √ |a k a l |.

Breeding values, dominance deviations, epistatic deviations, and phenotypes
Breeding values ( A ) and dominance deviations ( D ) were simulated with genotype coefficient matrices that followed the natural and orthogonal interactions (NOIA) parameterization, as in [20]. With NOIA, the coefficient matrices are constructed such that the genetic effects ( A and D ) are statistically orthogonal, even in the absence of HWE. However, the epistatic values were simulated with epistatic coefficient matrices that followed one of three biological models for epistasis (Fig 2). The resulting epistatic values are not orthogonal to A and D , which means that A and D change when epistasis is simulated. Thus, we begin by explaining the simulation of epistatic deviations and subsequently discuss how A and D were computed. The first step was to compute epistatic values for all nine possible combinations of genotypes at loci k and l as c kl = tǫ kl , where ǫ kl is the epistatic effect between loci k and l, and t is a vector containing 9 ( 3 × 3 ) epistatic coefficients, following one of three epistasis models (Fig. 2). The coefficients in t were ordered from top-to-bottom and left-to-right (AABB, AaBB, aaBB, ..., aabb). Then, using the NOIA parameterization and the two-locus genotype frequencies, epistatic values were partitioned into nine statistically orthogonal effects following the procedure described in [20]: This procedure was repeated for all m(m − 1)/2 pairwise interactions between QTL.
The epistatic deviation of individual i was computed as: where h k a,i ( h l a,i ) is the additive genotype coefficient of individual i at locus k (l), and h k d,i ( h l d,i ) is the dominance genotype coefficient of individual i at locus k (l). Elements of the additive genotype coefficients, h k a,i , were encoded as in (7), where p AA , p Aa , and p aa are the genotype frequencies of marker k in the base generation (generation 14). Elements of the dominance genotype coefficients were encoded as in (8). The breeding value of individual i was computed as: where α k is the average effect of locus k, which was computed as: where p k is the allele frequency of locus k in generation 14. The dominance deviation of individual i was computed as: where d k ′ was computed as: Total genetic values were computed as TGV = BV + D + E . Phenotypes were computed as y = TGV + e , where e is a vector of random residuals, sampled from a normal distribution with mean zero and variance σ 2 e = σ 2 TGV , such that the broad sense heritability H 2 is equal to 50%.

Scenarios
As a base scenario, a purely additive trait with 300 QTL was simulated (Base). We varied the number of QTL to be 1000 ( S 1000 , 100 ( S 100 ), or 10 (S 10 ). Hereafter, we will call this characteristic of the trait "Sparsity". Dominance was varied by sampling dominance factors δ from N 0.6, 0.3 2 with the D medium scenario, or from N (1.2, 0.3 2 ) with the D extreme scenario, which is extreme overdominance. Following [1,34], epistasis was varied by applying the additive × additive model ( E A ), complementary model ( E C ), or interaction model ( E I ). The relative variance components in the simulated scenarios are listed in Table 1. The location and additive effects of QTL in each scenario were not resampled for the dominance and epistasis scenarios, so they were the same as in the base scenario.

Table 1 Summary of the scenarios used in the simulations
The rightmost columns contain the average proportions of additive, dominance, and epistatic variance in the replicate genotypes. In all scenarios, the broad sense heritability is H 2 = 50%

#QTL Explanation Var(A) Var(D) Var(E)
Base  Table 2 Mean accuracy and standard error of the mean of each method, calculated over ten replicates each, times 100 Each row corresponds to a scenario, as summarized in  Table 3 Mean accuracy increase of NetSparse relative to each other method and its standard error on the mean calculated over ten replicates each, times 100 Significant entries, determined with the Benjamini-Hochberg procedure for α = 0.05 for the one-sided paired t-test corresponding to the hypotheses E ρ NetSparse − ρ Method = 0 , are marked in italic Scenario GBLUP BayesB GBLUP-AD BayesB-AD GBLUP-ADAA Base 1.2 ± 0.6 1.0 ± 0.5 2.2 ± 0.7 3.3 ± 0.6 2.5 ± 0.6 S 10 27.7 ± 1.6 6.7 ± 0.8 28.1 ± 1.5 9.3 ± 1.1 28.5 ± 1.4

Comparison of methods
To evaluate the performance of the different methods, each one was trained on the 500 animals in the training population, and the accuracy was obtained by taking the Pearson correlation coefficient between predictions and the total genotypic values of the 2000 animals in the validation population.
Direct comparison of the average accuracies per scenario (Table 2) required many replicates, because the accuracies fluctuated considerably between replicates.

Table 4 p-values of the one-sided paired t-test for the hypotheses
Each row corresponds to a scenario, as summarized in Table 1 The line x = y is added in red for reference. A marker that is above the line means a replicate with higher accuracy for NetSparse than the method it is compared to, and a marker that is below the line means a replicate with lower accuracy for NetSparse than the other method Therefore, instead of comparing the average accuracies of the methods, we used the mean and standard error of the difference in accuracy between methods, ρ NetSparse − ρ Method , which fluctuated much less (Table 3). In addition, we calculated the p-values corresponding to the one-sided paired t-test for the null hypotheses H 0 : E ρ NetSparse − ρ Method = 0 for each reference method. Significance of p-values with respect to the treshold of 0.05 were corrected for multiple testing via the Benjamini-Hochberg procedure (Table 4).

Results
First, we considered the effect of sparsity on the prediction of genotypic values in the additive scenarios for all methods (Fig. 3). In the sparse scenario with 10 QTL ( S 10 ), the accuracy with Netsparse was about 0.28 higher than with GBLUP(-AD,-ADAA), and about 0.08 higher than with BayesB(-AD). In the scenario with 100 QTL ( S 100 ), NetSparse had an increase in accuracy of 0.06 over the GBLUP(-AD,-ADAA) methods, of 0.02 over BayesB and 0.05 over BayesB-AD. In the "Base" scenario with 300 QTL, NetSparse was better than the methods that fit dominance, but not significantly better than the additive methods In the 1000 QTL scenario NetSparse was significantly better than BayesB and BayesB-AD, but not significantly better than the methods based on GBLUP. Now, we consider the simplest possible phenotypic model after the additive one, the dominance model. In the medium dominance scenario ( D medium ), all methods performed roughly the same (Fig. 4). Hence, methods that tried to fit dominance did not result in higher accuracies than methods that did not. In the extreme dominance scenario ( D extreme ), GBLUP-AD, BayesB-AD and GBLUP-ADAA methods had better performance than the other methods, which matched our prior expectation.
The epistatic scenarios (Fig. 5) contain components which can be fitted only by NetSparse and GBLUP-ADAA, thus we expected that in the additive × additive scenario, GBLUP-ADAA would have the best fit and that NetSparse would have the best fit among the other two scenarios. In the additive × additive scenario ( E A ), NetSparse had a significantly higher accuracy than the other methods except BayesB. Surprisingly, GBLUP-ADAA did not fit this scenario better than the other methods. In the complementary ( E C ) scenario, Net-Sparse had 0.6 to 1.5 percentage points higher accuracy on average than the other methods, but these results were not consistent across replicates. The accuracy of NetSparse in the interaction scenario ( E I ) was on average three or more standard errors above the other methods.

Discussion
In this study, we compared methods that differed in flexibility. For example, the GBLUP-AD method is more flexible than the GBLUP method, because it allows for dominance effects to be fitted. In theory, the more flexible method should be able to give the same predictions as simpler methods by setting the additional hyperparameters to zero. In reality, however, these additional hyperparameters also have to be estimated from the data because the genetic architecture of the trait is unknown.
In this study, we used the default prior distributions from the BGLR package and estimated hyperparameters from the training dataset. We chose not to fine-tune the prior distributions on test performance, because in reality this is not possible. As a result, when the actual genetic architecture of a trait is simple (e.g. additive and not sparse), a more flexible method will perform worse than a simpler method. Our results indeed showed that sometimes more flexible methods performed worse than simpler methods. For example, if we consider the scenario with complementary epistatic effects and consider the GBLUP and BayesB methods, BayesB with hyperparameter π set to zero is equivalent to GBLUP, but when fitting the value of π in BayesB to the data, a non-zero value of π is estimated, which in this scenario gives worse test performance than π = 0 . In [5], it also was seen that, in certain cases, GBLUP can have higher accuracy than BayesC, which is a sparse method similar to BayesB. The particular observation that NetSparse has higher accuracy than BayesB for the S 10 and S 100 scenarios was unexpected because BayesB is a sparse additive method, while NetSparse is a sparse non-additive method. Since the underlying data generating process is sparse additive, the expectation is that BayesB matches the simulated data better than NetSparse. The difference in method between NetSparse and BayesB is that NetSparse includes nonadditivity and that NetSparse and BayesB use different priors for the variances. Therefore, we also made a comparison with LinSparse (Fig. 6), which is NetSparse without non-additive effects. The accuracy obtained with LinSparse for these scenarios was higher than for BayesB, which strongly suggests that the difference in accuracy between them originated from the different prior distributions for the variances. In BayesB, the prior distributions for the variances are scaled inverse chi-squared distributions, which are conjugate priors for the likelihood function, which makes Gibbs sampling possible. The NUTS sampler in PyMC3 does not require conjugate priors and, following the suggestions of [35], we chose half-normal distributions for the standard deviations. The main difference between the scaled inverse chi-squared and half-normal distributions is that the half-normal distribution decays faster than exponentially for large values, which gives much lighter tails than the scaled inverse chi-squared distribution, which decays only polynomially.
The epistatic method GBLUP-ADAA did not seem to give better fits than methods that did not fit epistasis. We think this is due to a lack of data for estimating epistatic effects accurately. Inaccurate estimates of these effects will not improve predictive ability.
Given that neural networks are able to fit more than additive relations, we expected that NetSparse would also be able to fit dominance and epistatic effects. This expectation was confirmed in scenarios with strong dominance effects and scenarios with epistatic effects because accuracy of NetSparse was higher than the accuracy of additive models. However, NetSparse may be at a disadvantage with traits that have negligible non-additive effects. Nevertheless this indicates a potential for Sparse Bayesian Neural Networks for improving phenotype prediction.
The main limitation of NetSparse is running time. On our hardware, training NetSparse took around 4 h per scenario with 500 animals and 5000 SNPs. The running time of NetSparse scales approximately linearly with both the number of animals and the number of SNPs, and can therefore become prohibitive when applied to larger datasets. The other methods have running times that were less than 2 min on these datasets, making them much more feasible for use on larger datasets. Considering the promising results of NetSparse, further studies could try to increase the computational performance so that larger datasets can be analyzed. As sampling of independent MCMC chains can be done in parallel on different machines, additional computational resources can speed up the wall time of sampling by a factor equal to the number of chains used. The MCMC sampling could The line x = y is added in red for reference. A marker above the line means a replicate with higher accuracy for NetSparse than the method it is compared to, a marker below the line means a lower accuracy of NetSparse than the other model also be replaced by variational inference, where the real posterior is approximated by a simpler variational posterior from which samples can be drawn directly, which would help NetSparse scale to larger datasets. The discrete variable s could be handled, for instance, using the Concrete Distribution. [36] Conclusions This study shows that in nearly all scenarios the accuracy of NetSparse is not significantly lower than that of all other methods investigated. In particular, the Net-Sparse method performed as well or better than GBLUP and BayesB for all scenarios evaluated. On data generated from a sparse QTL simulation model, accuracies obtained with NetSparse were significantly higher than accuracies obtained with all the other methods investigated. In the medium dominance scenarios, accuracy obtained with NetSparse was 0.0 to 0.8 percentage points higher than that with the other methods investigated. In the extreme dominance scenario, accuracy obtained with Netsparse was 0.6 percentage points higher than that with other methods that did not explicitly model dominance. For methods that did explicitly model dominance, the accuracy was 5.8 to 6.3 percentage points lower for NetSparse. In the epistatic scenarios, accuracy obtained with NetSparse was 0.6 to 3.9 percentage points higher than that with the other methods. However, running time can be limiting, as NetSparse inference took about 200 times as long as the other methods.