A computationally efficient algorithm for genomic prediction using a Bayesian model

Background Genomic prediction of breeding values from dense single nucleotide polymorphisms (SNP) genotypes is used for livestock and crop breeding, and can also be used to predict disease risk in humans. For some traits, the most accurate genomic predictions are achieved with non-linear estimates of SNP effects from Bayesian methods that treat SNP effects as random effects from a heavy tailed prior distribution. These Bayesian methods are usually implemented via Markov chain Monte Carlo (MCMC) schemes to sample from the posterior distribution of SNP effects, which is computationally expensive. Our aim was to develop an efficient expectation–maximisation algorithm (emBayesR) that gives similar estimates of SNP effects and accuracies of genomic prediction than the MCMC implementation of BayesR (a Bayesian method for genomic prediction), but with greatly reduced computation time. Methods emBayesR is an approximate EM algorithm that retains the BayesR model assumption with SNP effects sampled from a mixture of normal distributions with increasing variance. emBayesR differs from other proposed non-MCMC implementations of Bayesian methods for genomic prediction in that it estimates the effect of each SNP while allowing for the error associated with estimation of all other SNP effects. emBayesR was compared to BayesR using simulated data, and real dairy cattle data with 632 003 SNPs genotyped, to determine if the MCMC and the expectation-maximisation approaches give similar accuracies of genomic prediction. Results We were able to demonstrate that allowing for the error associated with estimation of other SNP effects when estimating the effect of each SNP in emBayesR improved the accuracy of genomic prediction over emBayesR without including this error correction, with both simulated and real data. When averaged over nine dairy traits, the accuracy of genomic prediction with emBayesR was only 0.5% lower than that from BayesR. However, emBayesR reduced computing time up to 8-fold compared to BayesR. Conclusions The emBayesR algorithm described here achieved similar accuracies of genomic prediction to BayesR for a range of simulated and real 630 K dairy SNP data. emBayesR needs less computing time than BayesR, which will allow it to be applied to larger datasets. Electronic supplementary material The online version of this article (doi:10.1186/s12711-014-0082-4) contains supplementary material, which is available to authorized users.


Background
Genomic prediction uses information from high-density genetic polymorphisms, such as single nucleotide polymorphisms (SNP) panels, to predict the genetic merit of individuals for quantitative traits. Selection based on these estimated breeding values could substantially increase the rates of genetic improvement for quantitative traits in animal and plant species [1]. Implementation of genomic selection is a two-step process: (1) estimation of the effects of SNPs in a reference population given the phenotypes and SNP genotypes of reference individuals and (2) calculation of genomic estimated breeding values (GEBV) for selection candidates based on their genotypes [1]. If the SNP effects are random variables drawn from a prior distribution, the accuracy of GEBV is maximised if, in step (1), SNP effects are estimated by their expected value conditional on the data.
Several methods, which differ in the assumed prior distribution of SNP effects, have been proposed to estimate SNP effects for genomic prediction. The prior assumption that SNP effects are all drawn from the same normal distribution results in the statistical method called best linear unbiased prediction (BLUP). BLUP for genomic prediction can be implemented using two equivalent models [2]. Either the SNP effects are estimated directly, termed SNP_BLUP (e.g. [1]), or a genomic relationship matrix is calculated from SNP genotypes, termed genomic BLUP (GBLUP) [2,3]. Other models assume that the SNP effects follow a non-normal distribution. For example, in the model called BayesA, the SNP effects follow a Student's t distribution [1], while mixture distributions are used in BayesB [1], BayesC, BayesCπ [4] and BayesR [5], and exponential distributions are used in BayesLASSO [6]. With real data and for some traits, GBLUP methods achieve levels of accuracy of genomic prediction similar to non-normal distributions methods such as BayesA, BayesB, and BayesR when moderate SNP densities (e.g. 50 K in dairy cattle; less in some crop species with extensive linkage disequilibrium) were used [7][8][9][10][11]. As described by several authors, GBLUP has the advantage that it is computationally efficient [12][13][14]. However, for traits with quantitative traits loci (QTL) of large to moderate effect, the Bayesian methods can give higher accuracies of prediction than GBLUP [15][16][17]. Moreover, genomic prediction models that assume non-normal distributions of effects in some cases give higher accuracies than GBLUP when very large numbers of SNPs (e.g. 630 K or whole-genome sequence data) are used, particularly for multi-breed and across-breed predictions [5,[18][19][20][21][22]. A disadvantage of these methods, however, is that it is difficult, if not impossible, to write closed form solutions for estimates of SNP effects or other parameters, so Markov chain Monte Carlo (MCMC) sampling is used to derive posterior distributions for these effects (e.g. [1]). However, this is computationally expensive, particularly when the number of SNPs is large. For example, the BayesB method can result in the highest accuracy of genomic prediction in some situations, but, since it uses a Metropolis Hastings algorithm, computing time with large numbers of SNPs (e.g. 800 000 SNPs) is very long. Other methods, such as BayesA, BayesLASSO, and BayesR, are usually implemented using Gibbs sampling. While Gibbs sampling is faster than the Metropolis Hasting algorithm, it is still slow with very large numbers of SNPs genotyped in large numbers of individuals.
In dairy cattle routine genomic evaluations, different genomic prediction methods have been implemented by different countries and organisations [23]. According to Mantysaari [23], GBLUP, or its single-step implementation [24,25], is one of the most popular genomic prediction methods implemented for official genomic evaluation in many countries, including Canada, New Zealand, Australia, Germany and Ireland. By contrast, only two countries, i.e. The Netherlands and Switzerland have implemented MCMC non-linear models (BayesA and BayesC) for genomic prediction. In addition, non MCMC versions of BayesA (also termed nonlinear A [2]) are used for genomic prediction in the USA. In the future, genomic evaluations may be based on whole-genome sequence data and Bayesian methods may be required to take advantage of this data [26,27]. Therefore, a way to implement Bayesian models that is faster to compute than the MCMC methods is desirable.
There have been a number of proposals to reduce the computing time required to arrive at satisfactory estimates of the SNP effects from Bayesian methods (e.g. [28][29][30]). These proposals use algorithms other than Gibbs sampling. For instance, VanRaden [2] described an iterative method to implement approximations of both BayesA and BayesB. Meuwissen [29] described a method termed fas-tBayesB by using iterative conditional expectation (ICE) in the BayesLASSO model. FastBayesB iteratively calculated each SNP's posterior mean, conditioning on current estimates of all other SNPs as if they were true effects. FastBayesB greatly reduces computing time but several parameters required to describe the prior distribution of SNP effects are assumed to be known. This issue was dealt with in a later publication by an expectationmaximisation (EM) algorithm that estimated those parameters by maximising a joint posterior probability based on the prior distribution of SNP effects, in a method called EmBayesB [31]. Lower prediction accuracies were observed for these methods compared with MCMC implementations [29,31]. Two potential reasons for this are: (1) the errors in the estimates of SNP effects other than the SNP for which the effect is being estimated were ignored [29], and (2) the prior distribution of SNP effects that they assume (a double exponential) may not match the true distribution of SNP effects as well as the mixture distribution assumed by BayesB and BayesR.
Our aim in this paper was to develop a fast EM counterpart to MCMC BayesR (emBayesR). BayesR assumes that SNP effects are drawn from a mixture of normal distributions, one with zero variance (and hence zero effects). BayesR shares some of the advantages of BayesB, in that SNP effects can be zero, moderate, or large, but is more computationally efficient since it can be implemented with Gibbs sampling [5]. In BayesR, the proportion of SNPs in each normal distribution is estimated from the data, instead of being pre-set as a constant value in BayesB. Consequently, BayesR is able to approximate a wide range of possible true distributions of SNP effects. With real data, BayesR achieves accuracies comparable to BayesA [5] and BayesB (Goddard and Meuwissen, unpublished data).
Our EM algorithm retains the BayesR model assumption that SNP effects are assumed to be derived from four different normal distributions, but requires much less computing time than BayesR. It also differs from other EM methods by estimating the effect of each SNP while accounting for the errors in the estimates of all other SNPs. It does this by treating the combined effect of the other SNPs as a residual breeding value, and approximating its prediction error variance from a GBLUP prediction. To compare speed and accuracy of prediction of emBayesR with that from BayesR, we used both a simulated dataset and a real dataset on 630 K SNPs for dairy cattle.

Methods
In this section, we first describe the model of BayesR (here also named MCMC_BayesR) for genomic prediction and second, an EM algorithm named emBayesR. Finally, the 10 K simulated data and 630 K real dairy data that were used to evaluate the performance of emBayesR, are described.

Statistical model for emBayesR and prior distributions of parameters
The linear model for phenotypes is: where, y is a n × 1 vector of phenotypic records (n is the number of animals); 1 n is a n x 1 vector of 1 s, μ is the population mean; Z is a n × m design matrix with ele- , in which x i is the n × 1 vector of genotypes for the i th SNP (0, 1 or 2 copies of the second allele), and p i is the allele frequency of each SNP i (m is the number of SNPs); e is a n × 1vector of random normal deviates, e e N 0; Iσ 2 e À Á ; g is a m × 1 vector of SNP effects.
For convenience, polygenic effects were not included in the model but they can be readily added (and have been added in the MCMC version of BayesR, e.g. [5]).
BayesR [5] assumes that SNP effects (g) are drawn from a mixture of four normal distributions N 0; σ 2 k À Á according to the proportion vector Pr = {Pr k |k = 1, 2, 3, 4}. Variances used were σ 2 for the analysis of the real dairy data and σ 2 k ¼ 0; 0:0006 Ã σ 2 g ; 0:006 Ã σ 2 g ; 0:06 Ã σ 2 g n o for the analysis of the simulated data, where σ 2 g is total genetic variance [5]. Here, the coefficients of σ 2 g used to define σ 2 k for the simulated data were different to those used for the real data because of the criterion that the sum of the variance across all SNPs approaches the overall genetic variance explained by SNPs. In the simulation data, with 10 050 SNPs, there were only 50 QTL (17 QTL in σ 2 k 2 ½ , 16 QTL in σ 2 k 3 ½ and 17 QTL in σ 2 k 4 ½ ). To make the overall variance summed over all the SNPs approximately equal to σ 2 g , vector σ 2 k for the simulated data was set to 0; 0:0006 Ã σ 2 g ; 0:006 Ã σ 2 g ; 0:06 Ã σ 2 g n o . For the real data (with high-density SNP panels), the value of σ 2 k that is 0; 0:0001 Ã σ 2 g ; 0:001 Ã σ 2 g ; 0:01 Ã σ 2 g n o was assumed as in [5]. In addition, the proportion of SNPs in each normal distribution Pr k ; Σ 4 k ¼ 1 Pr k ¼ 1 À Á was assumed to follow a Dirichlet distribution with parameter α = (1, 1, 1, 1) T , which is a 4 × 1 vector of the pseudocounts of the number of SNPs in each distribution. Therefore, the BayesR model has two fixed parameters as input: σ 2 k and α (the prior for Pr).
For each SNP i, there is a latent binary variable b ik (b ik = 0 or 1) that indicates whether or not the effect of SNP i follows the normal distribution with variance σ 2 k k ¼ 1; 2 ; 3 ; 4 ð Þ . Therefore: Then, the prior distribution of each SNP effect (g i ) conditional on variable b ik is: where δ(g i ) denotes the Dirac delta function with all probability mass at g i = 0.
Then, the joint distribution p(g i , b i ) conditional on Pr is:

Expectation-maximisation steps for emBayesR
An EM algorithm is applied to BayesR to obtain estimates of parameters, including SNP effects (ĝ) and the proportion of SNP effects in each distribution c Pr . The aim of emBayesR is to predict Zg by Zĝ as accurately as possible. The best predictor for g i would be g i = E(g i |y), but we approximated this by estimating ĝ i by the value of g i that maximises the posterior probability Pðg i jy; c Pr;μ;σ 2 e Þ , where c Pr,μ andσ 2 e are the MAP (Maximum A Posterior) estimator of Pr, μ, and σ 2 e , conditional on y. In the following, we first deal with estimating ĝ i and then return to c Pr.
For estimation of g i , we maximised the marginal posterior of g i rather than the joint posterior of all g. To do this, we first introduce two vectors of missing data (u, b i ), and use the EM algorithm to integrate them out of the posterior distributions. Here, u is the combined effects of all other SNPs except the current SNP, i.e. u = Zg − Z i g i , and the other vector b i = {b ik |k = 1, 2, 3, 4} is for indicator variables that determine which normal distribution each SNP effect is derived from, as described above. Then Equation (1) can be re-written as: The full posterior distribution with the missing data, p g i ; u; μ; b i y;PrÞ À is (following Bayes' theorem): Where is the likelihood of the data given y * and u, and y Ã ¼ y−1 nμ . Then, the log of the posterior is: This can be re-written as: logf yjg i ; u;μ; c σ 2 e ; c Pr In the E-step of emBayesR, we will take expectation of the log posterior function of Equation (6) over the missing data (u, b). Only the second term (6b) in the equation logp g i ; u; b i jy;μ; c σ 2 e ; c Pr involves b i . Therefore: , which is the posterior probability for each SNP to belong to each of the four normal distributions. The derivation of P ik is explained in Additional file 1. Next, we take the expectation over missing data u.
To calculate the expectation of Equation (6a) over u, we only need to take the expectation of Q over u. Applying Searle's expectation rule [32] to E û (Q), we obtain: Where û = ∑ j ≠ i Z j ĝ j and PEV is the predicted error variance.
Substituting P ik = E(b ik |y) and using the above E û (Q), the expectation of Equation (6) over û, b is: The calculation of PEV(û) is approximated from a GBLUP model, and is explained in Additional file 2.
The M-step of emBayesR involved estimation of the SNP effect0073 (g i ). Differentiating Equation (7) with regard to g i gives: Setting this equal to 0 results in the following posterior mode estimate for each SNP effect (g i ).
where, Z i is the i th column of matrix Z, and y † ¼ y− u−1 nμ .
The mean of the posterior distribution can also be calculated as follows: which reduces to: The mode estimation of SNP effects (Equation 8a) was implemented in our EM iterations, unless otherwise stated. The posterior mean of Equation (8b) was used in some cases to evaluate the accuracy of genomic prediction using either the mode or mean estimates of SNP effects. Furthermore, to investigate the degree of shrinkage, the least square estimate of the SNP effect was also calculated for some examples: Similar EM steps used for estimating ĝ i (but with different full models) are applied to estimate other parameters, including the proportion of SNP effects in each distribution (Pr), the error variance σ 2 e À Á , and the mean (μ).
To obtain c Pr , we return to the full model Equation (1) with all SNP effects (g) included. We introduce the missing variables b, so the full likelihood is: Note that p(y|b) does not involve Pr, so when we differentiate with respect to Pr, this term drops out and can, therefore, be ignored, resulting in: logp Pr ð Þ ¼ P 4 k¼1 logPr k ; and E bjy logp bjPr ð Þ¼ P n i¼1 P 4 k¼1 P ik logPr k ; where Then, considering that P 4 k¼1 Pr k ¼ 1 , we use Lagrange multiplier λ and differentiate with respect to Pr k . Given that Pr follows a Dirichlet distribution: Therefore, the solution is: Finally, to estimate the error variance σ 2 e and μ, we simplify Equation (5) Z iĝ i and then the full likelihood based on this model is: The expectation for the full log likelihood based on this model is: Therefore, differentiating Equation (10) with regard to σ 2 e and μ, we get: for which computation of the term tr PEV c u Ã is explained in Additional file 2.
In order to demonstrate the importance of the PEV correction for SNP effect estimates, the accuracy of emBayesR with and without accounting for PEV will be compared in the Results section. emBayesR without PEV has a similar EM step as emBayesR with PEV to derive the parameters P ik ;ĝ i ; Pr k ; σ 2 e and μ but differs in the equations of emBayesR with PEV to calculate P ik (Equation A3 in Additional file 1) and σ 2 e (Equation 11) in that the term tr(PEV(û)) is not included in emBayesR without PEV.

The emBayesR algorithm
The emBayesR algorithm can be described as follows: Step 1 Initialise starting values for g; Pr; σ 2 e ; σ 2 g ; α and σ 2 k . There are two groups of parameters: fixed parameters and changing parameters. α ¼ 1; 1; 1; 1 ð Þ ; σ 2 g and σ 2 k are fixed parameters, where α is the prior parameter for Pr, and σ 2 g is used to set the value of σ 2 k . The other variables g; Pr; σ 2 e À Á are updated during EM iterations. We used g = 0.01 and Pr = {0.5, 0.487, 0.01, 0.003}, as in [5]. To initialise σ 2 e and σ 2 g , we used GBLUP implemented through ASREML3.0 [33] to estimate the error variance σ 2 e and the genetic variance σ 2 g as inputs for the next steps. Then, as mentioned before, the value of σ 2 g defines σ 2 k , using for the real data and σ 2 for the simulated data.
Step 2 Calculate PEV with Equation (A7) of Additional file 2 (or it can be taken from ASREML in the step above). Then for each SNP i (i in 1:m): Step 3 Correct y for the effects of all other SNPs except the current SNP i, using: Step 4 Estimate the probability that the effect of SNP i is from one of four normal distributions logl ik with Equation (A5) of Additional file 1.
Step 5 Calculate P ik with Equation (A6) of Additional file 1.
Step 6 Estimate the effect of SNP i with Equation (8a).
Step 7 After all SNP effects have been estimated, calculate Pr k with Equation (9), update σ 2 e with Equation (11), and update μ with Equation (12).

Return to
Step 3 and iterate until convergence. Here, the convergence criterion evaluated at each iteration q was (ĝ q − ĝ q − 1 ) ' (ĝ q − ĝ q − 1 )/((ĝ q' ĝ q ) < γ. The criterion γ = 10 − 10 was selected after trialling the algorithm in a number of datasets and investigating changes in SNP effect estimates across iterations.
We calculated the time complexity of the algorithm (the function with parameters number of SNPs and number of animals that determines the time taken for the algorithm to run) based on the above eight steps. Time complexity is estimated in computer science applications by counting the number of innermost loops for elementary operations, which is notated O.

Simulated data
Simulated data were used to determine how close the genomic prediction accuracy of emBayesR was to that of BayesR. The simulated dataset described in [21] was used. Briefly, FREGENE was used to simulate wholegenome sequence data in a population with an effective size (Ne) of 25 900 and a genome size of 50 Mb split equally over 10 chromosomes. The genome size of 50 Mb was chosen for computing efficiency. The accuracy of prediction in a c times larger genome (i.e. 50c Mb) would be approximately the same as found in our 50 Mb genome, provided the number of animals was c times larger than used here (i.e. 5000c) [27]. The mutation rate per bp was 9.38 × 10 −9 and the recombination rate was 1 × 10 −8 per base pair per generation [21], based on estimates for these rates in mammals. To ensure a drift-recombination-mutation equilibrium, the population was run for 370 000 generations. A total of 10 050 markers (including 50 QTL) were randomly selected as SNPs for genomic prediction. The SNP density was equivalent to~600 000 SNPs on a 3000 Mb genome, similar to many mammals. Fifty QTL were randomly picked from the segregating loci, which is equivalent to 3000 QTL on a human or bovine genome. To evaluate the genomic prediction performance of emBayesR, BayesR and other algorithms, we generated two genetic architectures that differed in the distribution of true QTL effects. For this first dataset, named HD_Mix, the 50 QTL allele substitution effects were sampled from an equal mixture of three normal distributions with variances 0; 0:0006σ 2 g ; 0:006σ 2 g ; 0:06σ 2 g . For the second genetic architecture (referred to as HD_One), QTL allele substitution effects were sampled from a single normal distribution. For the breeding values on simulation data, true breeding values (TBV) for individuals were obtained by summing genetic values across QTL. For each genetic architecture, heritabilities (h 2 ) of either 0.45 or 0.1 were used. For each set, phenotypes of 5000 individuals were generated by adding a random residual value to the TBV of each individual. This residual value was sampled from a normal distribution, TBV is the variance of TBV in the population. Thus, we generated four datasets named HD_Mix_45 (five replicates following the mixture data model with heritability 0.45), HD_Mix_10 (five replicates following the mixture data model with heritability 0.10), HD_One_45 (five replicates following the one normal data distribution with heritability 0.45) and HD_One_10 (five replicates following the one normal distribution with heritability 0.10). Each replicate entailed sampling new SNP effects and generating new phenotypes.
To compare prediction accuracies and computing efficiencies of emBayesR, BayesR, GBLUP and fastBayesB, 5000 individuals were randomly separated into reference sets and validation sets. With an h 2 of 0.45, there were 2500 individuals in the reference set and 2500 in the validation set. With an h 2 of 0.1, there were 3750 individuals in the reference set and 1250 in the validation set. Accuracies were the correlations between GEBV and TBV.

Real data
A total of 3354 Holstein-Friesian bulls were genotyped for both the Illumina Bovine HD SNP array (632 003 SNPs following quality controls as described in [5]), and the Bovine SNP 50 array (43 025 SNPs). Bulls genotyped at the lower density were imputed to the higher density using Beagle 3.0 [34], and applying quality controls as described in [5]. Phenotypes were daughter trait deviations (DTD) from two groups of traits: functional traits, including angularity, mammary conformation, stature, fertility (calving interval) and somatic cell count (SCC), and production traits, including milk yield, protein yield, protein % and fat %. For some of these traits, known QTL with moderate to large effects segregate in this population, for example a mutation in the DGAT1 gene affects fat % [35]. Bulls were split into reference and validation sets by age, with the youngest bulls in the validation set. The numbers of bulls in the reference and validation sets for each trait are listed in Table 1. As a surrogate for prediction accuracy, the correlation of GEBV and DTD in the validation set was used. To investigate the computing time required for emBayesR relative to BayesR with  different numbers of SNPs, we also ran genomic predictions in the same data but with the 50 K SNP chip genotypes (38 968 SNPs) extracted from the 630 K data on 3354 animals, for milk yield.

Results
The results are presented in three sections. First, we investigated the convergence of parameters estimated by emBayesR and how close parameter estimates from emBayesR were to the true parameter values, and those estimated by BayesR, in terms of SNP effects and Pr, in the simulated data. We also evaluated the effect of the PEV correction on estimates of these parameters, and the accuracy of genomic prediction. Moreover, the accuracy of genomic prediction from the joint posterior mode estimation from emBayesR was compared to the accuracy when the posterior mean estimate of SNP effects was used. The mode estimation for SNP effects (Equation 8a) of emBayesR was used for the evaluation of performance of emBayesR. Thus, we also compared the accuracy of prediction with mode (8a) and mean (8b) Equations for estimates of SNP effects (Equation 8b). In the second section of results, we compared the accuracy of genomic prediction from emBayesR to that of BayesR, as well as computing speed in simulated and real datasets. Finally, the sensitivity of prediction accuracy from emBayesR to the underlying genetic architecture (multinormal distribution, normal distribution of QTL effects, real 630 K data) was investigated.

Convergence of parameter estimates with emBayesR
The algorithm is considered to have "converged" when estimated SNP effects from the previous iteration are very close to estimated SNP effects in the current iteration. The convergence criterion of emBayesR was (ĝ q − ĝ q − 1 ) ' (ĝ q − ĝ q − 1 )/((ĝ q' ĝ q ) < 10 − 10 , where q is the current iteration number. Since the convergence criterion assessed only changes in SNP effect estimates, it does not guarantee that the estimates of the other parameters, i.e. Pr (the proportion of SNPs in each distribution) and the error variance, have converged. In the simulated dataset HD_Mix_45, convergence was reached after 2500 iterations, and at that point, there was also very little change in the error variance and Pr from the previous iteration ( Figure 1).

Comparison of parameter estimates
Estimates of SNP effects and Pr from emBayesR can be compared to the corresponding estimates from BayesR. For the HD_Mix simulated data, estimates of large SNP effects were very similar for BayesR and emBayesR ( Figure 2). The plot of BayesR and emBayesR estimated effects against true effects is in Figure 3. However, for smaller effects, emBayesR shrunk effects to a greater degree than BayesR, in some replicates. The degree of shrinkage from the BayesR algorithms relative to other algorithms can be demonstrated by plotting estimates of SNP effects (HD_Mix data set) from BayesR, FastBayesB, emBayesR and SNP-BLUP against their least square estimates (Figure 4). Both BayesR and emBayesR regressed moderate size SNP effects towards 0 more than SNP-BLUP and FastBayesB. However, BayesR and emBayesR did not shrink large SNP effects nearly as much as SNP-BLUP.
Estimates of Pr from emBayesR and BayesR are compared with the true proportion of SNP effects in each of the four normal distributions in Table 2. The genetic architecture of the HD_Mix data was such that 50 QTL were distributed evenly in three normal distributions with nonzero variances. The true proportion of the SNP effects (around 10 000 markers) in the four normal distributions with different variances 0; 0:0006σ 2 g ; 0:006σ 2 g ; 0:06σ 2 g was (0.995, 0.0017, 0.0016, 0.0017). As shown in Table 2, when h 2 = 0.45, both BayesR and emBayesR estimated the proportions of SNP effects from the four distributions to be roughly 0.99, 0.01, 0.001, and 0.001. However, when h 2 = 0.1, BayesR over-estimated the proportion of SNP effects in the smallest non-zero distribution σ 2 2 ¼ 0:0006σ 2 g and this tendency was even greater with emBayesR. This agrees with results in Figure 2, where emBayesR shrunk small effects to very small effects more than BayesR and this may have contributed to the over-estimation of the proportion of   SNP effects from the distribution with the smallest non-zero variance 0:0006σ 2 g . In the 630 K dairy cattle data, the posterior mean estimates of Pr from emBayesR were similar to those from BayesR, as shown in Table 3.

Sensitivity to the prior for the Dirichlet distribution
Another feature of estimates of Pr, may be sensitivity to its prior parameter α (the pseudo-count of SNPs in each distribution in the Dirichlet distribution). To evaluate the sensitivity of emBayesR to α, we used different values for α and investigated the effect on Pr with the dataset HD_Mix_45 (

Effect of PEV
We also compared estimates of parameters and accuracies of genomic prediction with and without accounting for PEV or estimates of all other SNPs in the emBayesR algorithm. When the PEV was accounted for in the emBayesR algorithm, there was a 6% improvement in the accuracy of genomic prediction in the simulated data when h 2 = 0.45, and 5% when h 2 = 0.1 (Table 5), compared to when PEV was not accounted for. Estimates of SNP effects from emBayesR with and without PEV were plotted against estimates of SNP effects from BayesR ( Figure 5A). Estimates of SNP effects from emBayesR without accounting for PEV were considerably shrunken, particularly for small effects, compared with estimates of SNP effect from BayesR. Estimates of SNP effects with emBayesR when PEV were accounted for were much closer to those from BayesR, although there was still some over-shrinkage, particularly of small effects. Figure 5B, in which estimates of SNP effects obtained with BayesR, emBayesR, emBayesR_without_PEV are plotted, illustrates this result.
We also compared the accuracy of prediction based on the joint posterior mean (Equation 8b) versus the mode (Equation 8a) in the simulated data ( Table 6). As shown in Table 6, using either the mean (emBayesR_Mean) or the mode (emBayesR_Mode) for estimates of SNP effect gave similar prediction accuracies.

Accuracy of genomic prediction with emBayesR and BayesR
In the simulation data, the accuracy of genomic prediction with emBayesR was the same as with BayesR when heritability was 0.10, but 1% lower when heritability was 0.45 (Table 7). However, both methods resulted in GEBV that were close to unbiased, based on the regression of TBV on GEBV being close to 1, although for HD_Mix_10, the regression was 0.89 with both BayesR and emBayesR.
emBayesR across the nine dairy cattle traits was 0.4% lower than with BayesR. The accuracy with emBayesR was on average 5% better than with FastBayesB. The average accuracy of BayesR across the nine traits was 3% higher than with GBLUP, which was due to very similar accuracies for four of the nine traits, and only protein % and fat % showing clear improvements in accuracy compared to GBLUP. For these traits, several QTL with moderate to large effects are known to exist [35,36].

Computing performance of emBayesR compared with BayesR
We compared the speed of emBayesR with BayesR and fastBayesB using three criteria: the time complexity of each iteration (the function in terms of number of SNPs and individuals that determines the time taken to do one iteration), the number of iterations to convergence (or in the case of BayesR until changes in SNP estimates were sufficiently small so that the accuracy of genomic prediction did not change), and total computing time required with the 630 K real data. First, as mentioned in the method section, the time complexity for emBayesR is O(nm), which is the same as with the MCMC method for BayesR and with ICE iterations for fastBayesB, and with the nonlinear A method of VanRaden [2] and SNP_BLUP [1].
Second, for BayesR, the accuracy of prediction exceeded 0.61 at 20 000 iterations, and did not improve with a larger number of iterations, as shown in Figure 6. For five traits (milk, protein, fertility, fat % and protein %) and using the 630 K real data, the numbers of iterations required for convergence for emBayesR and fastBayesB are given in Table 9. FastBayesB required slightly more iterations to reach convergence than emBayesR for most traits.
Finally, the overall computing times for emBayesR, BayesR and fastBayesB with the same implementation (each trait on one processor) were compared (Figure 7). The algorithms were implemented on a range of datasets with different sizes, including 10 K simulated data (HD_Mix model, 2500 animals with around 10 000 SNPs), 50 K data (3049 animals with 38 968 SNPs), and 630 K data (3049 animals with 632 003 SNPs). As shown in Figure 7, the speed advantage of emBayesR compared to BayesR was greater as the number of SNPs in the dataset increases. For example, with the 630 K data, BayesR needed approximately 4 days of real computing time, while emBayesR required just 4 hours (including the time to calculate PEV in GBLUP) to achieve the final solutions.  (3) an unknown model of QTL effects in the 630 K real data. emBayesR and BayesR gave higher accuracies than GBLUP for the HD_Mix model data (M45_2), while for the HD_One data, the advantage of emBayesR and BayesR was smaller than that of GBLUP (Table 10), as might be expected given that the HD_Mix data has a proportion of QTL with larger effects. In estimating Pr, emBayesR generally had somewhat poorer agreement with the underlying data model than BayesR (Table 10), especially for the HD_One_45 data.
However, on 630 K real data, emBayesR gave very similar estimates of Pr and accuracy of genomic prediction than BayesR and GBLUP (accuracy only for the later comparison) (Tables 3 and 8). One conclusion from the relative performance of emBayesR to BayesR in the 10 K simulated data and in the 630 K real data, is that emBayesR cannot distinguish SNP effects with zero variance from those with a very small variance when there is little information in small datasets, as in the HD_One simulated data. However, among the 630 K SNPs there are likely more SNPs in the non-zero distributions, which should increase the precision of estimates of Pr.

Discussion
Genomic prediction with non-linear Bayesian methods, including BayesR, can be more accurate than GBLUP in some situations, such as when QTL with moderate to large effects segregate [2,3], but at the cost of longer computing time. To retain the accuracy of BayesR while reducing computing time, we propose here an EM algorithm, termed emBayesR, for genomic prediction, as an alternative to the MCMC implementation of BayesR. In both 10 K SNP simulated data and 630 K real dairy cattle data, emBayesR gave accuracies of genomic prediction similar to BayesR, with greatly reduced computing time. As in BayesR, emBayesR estimates SNP effects, error variances and posterior probabilities of each SNP belonging to the k th distribution (here, there were four distributions, one with zero variance).
Results from BayesR and emBayesR differed in three ways, albeit to a small degree. Estimates of Pr with emBayesR tended to have more SNP effect estimates in the smallest non-zero distribution than BayesR; emBayesR shrunk small SNP effects towards 0 somewhat more than BayesR; and the accuracy of emBayesR predictions was approximately 0.5% lower than the accuracy of BayesR. Our EM algorithm differed from the MCMC BayesR in several respects, which may explain these results. The EM algorithm estimates the SNP effect (g i ) by the mode of the posterior distribution when the mixing proportions (Pr) and the error variance σ 2 e À Á are held at their MAP estimates, whereas the MCMC version estimates g i by the mean of the posterior distribution while Pr and σ 2 e vary over their posterior distributions. Also, when we used the mean instead of the mode of the posterior distribution of g i as an estimate of g i , we found that it makes no discernible difference in prediction accuracy, as shown in Table 6. However, varying Pr and σ 2 e across their posterior distributions in BayesR, but not emBayesR, may explain differences in results. In addition, emBayesR uses an approximation of the prediction error variance of all other SNPs when estimating g i .
Bayesian estimates are sensitive to the prior if the data does not contain enough information to overwhelm the prior. Estimates of Pr with both BayesR and emBayesR were affected by the prior α but not to a large degree, considering that the simulated data contained only 50  causal mutations and the prior had little effect on the accuracy of genomic predictions. Results from using emBayesR with the simulated data indicate the algorithm was unable to consistently distinguish a SNP with no effect from a SNP with a very small effect. We would expect that, in data in which more causal mutations are segregating and with many more animals, estimates of Pr would be less sensitive to the prior. Other EM algorithms for genomic prediction have been described using thick-tailed t-distributions or exponential distributions as priors for the SNP effects. These include EM-BSR [37] and FastBayesA [38], which aim at enhancing the computing efficiency of BayesA. emBayesR differs from most previous non-MCMC implementations of Bayesian methods for genomic prediction in two respects, i.e. it uses the BayesR model with a mixture of four normal distributions for SNP effects and it accounts for errors in all other estimated SNP effects when estimating the effect of the current SNP by including the PEV term in the model. When we implemented the EM algorithm without the PEV term, the accuracy of prediction declined by 8%. The accuracy of fastBayesB was, on average, 9% lower than that of emBayesR, suggesting that much of the loss in accuracy of fastBayesB is due to ignoring the errors in all other SNP effects when estimating a particular SNP effect. Consistent with this interpretation, both fastBayesB and our EM algorithm without accounting for the PEV shrink estimates of SNP effects more severely than emBayesR or BayesR. Most of the current fast algorithms, such as fastBayesB [29], emBayesB [31], em_BSR [37], and MixP [39], ignore the error produced by the estimation of other SNP effects. That is, they use an unrealistic assumption that the current solutions of all other SNPs effects are known without error when estimating the current SNP effect, which is one of the reasons why accuracies of prediction from these algorithms are typically lower than that of their counterpart MCMC methods. MCMC methods account for the error in the  Our results demonstrated the computing speed of emBayesR over the MCMC implementation of BayesR. The time complexity for emBayesR at each iteration is proportional to the number of markers and the number of records, as it is in the MCMC methods. However, much fewer iterations were required for the emBayesR SNP effects to converge than for BayesR to sample sufficiently from the posterior distributions of SNP effects to achieve maximum accuracy of genomic prediction. Specifically, compared with 20 000 iterations of MCMC sampling (Figure 6), emBayesR required only 300 to 1000 iterations with the 630 K real dairy data ( Table 9). As the size of datasets increased, this advantage could be even greater, as shown in Figure 7.
With high-density SNP data (630 K), the prediction accuracy of emBayesR and BayesR was greater than GBLUP only for yield traits. Similar results (an advantage of a Bayesian approach over GBLUP for yield traits only) were obtained using the nonlinear iterative A method with imputed high-density data from 15 842 reference animals and 28 traits [40]. Computing time with high-density data for this nonlinear A method is also O(nm), with reported times similar to emBayesR. One difference between BayesR and the nonlinear A method is that SNP effects can actually be 0 with BayesR, whereas in   nonlinear A, SNPs will always have a non-zero effect, although it may be very small. This difference between the algorithms apparently does not affect accuracies of prediction with the 630 K real data, although it may become more important with whole-genome sequence data, for which the number of variants is much larger. However, this is yet to be demonstrated. It should also be noted that some reduction in computing time can be achieved by "pruning" SNPs that are in very high linkage disequilibrium from the dataset, since these SNPs carry redundant information. For example, Su et al. [41] reduced a dataset from 770 K to 492 K SNPs by pruning SNPs that were in very high linkage disequilibrium in a Nordic Holstein population prior to estimation of SNP effects.
Our aim is to eventually integrate emBayesR into genetic evaluations for Australian dairy cattle. Currently, the Australian National DNA reference population has more than 20 000 cattle, including 3719 Holstein bulls, 9630 Holstein cows, 1017 Jersey bulls and 4249 Jersey cows. For the evaluation of these national reference populations, GBLUP is currently used to calculate the Australia Genomic Breeding Value on 50 K SNP genotypes. However, even with the current data, prediction accuracy is higher with Bayes R than with GBLUP for some traits and GBLUP is unable to take advantage of the extra information that would be contained in whole-genome sequence data. Therefore, we anticipate moving to a Bayesian method to take advantage of whole-genome sequence data and increase prediction accuracies, and we expect that an EM algorithm will be part of this methodology in order to limit computing time.
In this paper, we used only bulls in the reference and validation sets, to avoid the added complexity of weighting bull and cow trait deviations differently. However, further development of the method described in this paper is needed to include appropriate weighting of phenotypes, multi-breed effects, polygenic effects in the model (as implemented in the MCMC version [19]) and to imbed the Bayesian method within a single-step genetic evaluation [42,43], so that it can be applied to the Australian national dairy evaluations. Also, efficient approaches for inversion of the animal by animal matrix to obtain the PEV need to be investigated to retain the efficiency advantage of emBayesR with very large numbers of animals.
Conclusions emBayesR uses an EM-based method to estimate the posterior mode of SNP effects, rather than the MCMC sampling used in BayesR. emBayesR can reduce computing time up to 8-fold compared to BayesR. Results with simulated data and real 630 K SNP dairy cattle data show that genomic prediction accuracy of emBayesR is similar to that of BayesR (0.5% accuracy loss averaged over traits). The computing advantages of emBayesR make it attractive for implementation of genomic prediction in very large datasets.

Additional files
Additional file 1: Calculation of P ik ¼E b ik jy; c Pr k