Parallel Markov chain Monte Carlo  bridging the gap to highperformance Bayesian computation in animal breeding and genetics
 XiaoLin Wu^{1, 2}Email author,
 Chuanyu Sun^{1},
 Timothy M Beissinger^{1, 3},
 Guilherme JM Rosa^{2, 4},
 Kent A Weigel^{1},
 Natalia de Leon Gatti^{3} and
 Daniel Gianola^{1, 2, 4}
DOI: 10.1186/129796864429
© Wu et al.; licensee BioMed Central Ltd. 2012
Received: 7 February 2012
Accepted: 28 August 2012
Published: 25 September 2012
Abstract
Background
Most Bayesian models for the analysis of complex traits are not analytically tractable and inferences are based on computationally intensive techniques. This is true of Bayesian models for genomeenabled selection, which uses wholegenome molecular data to predict the genetic merit of candidate animals for breeding purposes. In this regard, parallel computing can overcome the bottlenecks that can arise from series computing. Hence, a major goal of the present study is to bridge the gap to highperformance Bayesian computation in the context of animal breeding and genetics.
Results
Parallel Monte Carlo Markov chain algorithms and strategies are described in the context of animal breeding and genetics. Parallel Monte Carlo algorithms are introduced as a starting point including their applications to computing singleparameter and certain multipleparameter models. Then, two basic approaches for parallel Markov chain Monte Carlo are described: one aims at parallelization within a single chain; the other is based on running multiple chains, yet some variants are discussed as well. Features and strategies of the parallel Markov chain Monte Carlo are illustrated using real data, including a large beef cattle dataset with 50K SNP genotypes.
Conclusions
Parallel Markov chain Monte Carlo algorithms are useful for computing complex Bayesian models, which does not only lead to a dramatic speedup in computing but can also be used to optimize model parameters in complex Bayesian models. Hence, we anticipate that use of parallel Markov chain Monte Carlo will have a profound impact on revolutionizing the computational tools for genomic selection programs.
Background
In recent decades, Bayesian inference has been increasingly used for analysis of complex statistical models, in part because of increased availability and performance of personal computers and workstations. However, such models are generally not analytically tractable and, hence, computationally demanding numerical techniques are inevitably required. This is especially true of Bayesian computation for genomeenabled prediction and selection, which aims at using wholegenome molecular data to predict the genetic merit of candidate animals for breeding purposes [1]. Typically, implementation of a highdimensional model based on Markov chain Monte Carlo (MCMC) techniques is notoriously intensive in computing and often requires days, weeks, or even months of CPU (Central Processing Unit) time on personal computers and workstations [2]. Therefore, in order to overcome such computational burden, parallel computing becomes appealing [3, 4].
Parallel computing operates on the principle that a large problem can be split into smaller components and solved concurrently (i.e., ”in parallel”), each on a separate processor (or CPU core) [5]. An instance of a computer program and its activities that are taking place on each processor is referred to a process. Thus, parallel computing involves activating multiple processes that concurrently carry out related computing jobs and combining results by the main “controlling” process. Parallel computing can be achieved by programming with C, C++, or Fortran, e.g., using the MPI (Message Passing Interface) library to handle interprocess communication [6]. Highperformance computing communities have developed parallel programs for decades but were previously limited to programs running on expensive supercomputers. In the past twenty years, interest in parallel computing has grown markedly due to physical constraints that prevent frequency scaling [5] and to the need to handle datasets of unprecedented dimensionalities that are being generated [7]. Parallel computing has now become a dominant paradigm in current computer architectures, mainly in the form of multicore processors [8].
Parallel MCMC methods have recently been adopted in statistics and informatics [4, 9] and in image processing [10] but they have not received much attention in animal breeding and genetics. There are several reasons for this gap. First, MCMC algorithms are seemingly serial, and parallelism is not as straightforward as one would expect. Second, many intensive computational tasks in breeding and genetics applications have been handled via some simple data parallelism, implemented through the “multipletasking” mechanism provided by multicore Linux workstations. Multipletasking allows each processor to switch between tasks being executed on it, without having to wait for each task to finish, but this type of “parallel” computing is not scalable with the number of jobs. Recently, parallel MCMC algorithms and strategies have become a focal point for scientific computing in the postgenome era [4]. This is largely due to the need to handle genomic datasets of unprecedented sizes, such as genomewide dense markers or sequences for genomeenabled selection programs [2]. With a set of wholegenome markers (say 50K SNP markers or higher density) in a model, the computing task is highly challenging, particularly with sophisticated Bayesian models via MCMC implementation [1, 11–13].
In this paper, we present a technical description of parallel MCMC methods in the context of animal breeding and genetics. These algorithms typically fall into two categories: running multiple MCMC chains or parallelization within a single chain; some variants of these algorithms are discussed as well. A major purpose of this paper is to advocate the use of parallel MCMC methods, hence infusing highperformance computing technologies into animal selection programs in the postgenome era.
Methods
Parallel Monte Carlo methods
We start with parallel Monte Carlo methods, as a prelude to parallel MCMC. In practice, many statistical problems involve integrating over hundreds or even thousands of dimensions but usually these problems are not analytically tractable. Instead, Monte Carlo simulation methods [14] can be used to tackle highdimensional integrals. Standard Monte Carlo integration algorithms distribute the evaluation points uniformly over the integration regions.
Parallel computing for evaluating integrals
for some highdimensional θ with density $f\left(\mathit{\theta}\right)$. Suppose the integral cannot be evaluated analytically. If n realizations of θ can be sampled independently from $f\left(\mathit{\theta}\right)$ then, according to the strong law of large numbers, the sample average $\frac{1}{n}\underset{i=1}{\overset{n}{\Sigma}}p\left({\mathit{\theta}}^{\left(\mathit{i}\right)}\right)$ provides an approximation to $E\left(\mathit{p}\left(\mathit{\theta}\right)\right)$ when $n\to \infty $.
Simple Monte Carlo algorithms proceed by averaging large numbers of values that are generated independently of each other. Obviously, Monte Carlo simulation is parallel in computing because it can be conducted concurrently. By parallel computing, the entire set of samples can be divided among the available CPU cores and then each core generates a portion and summarizes its local samples. After all processors have finished their tasks, a master program summarizes all the partial data and outputs the final result.

➔ Process 0 (master process):
 (a)
computes and passes m to each process.
 (a)

➔ Each (slave) process (say j):
 (a)
simulates m independent realizations of $\mathit{\theta}\text{;}$
 (b)
computes ${\mathit{S}}_{\mathit{j}}=\underset{i=1}{\overset{m}{\Sigma}}p\left({\mathit{\theta}}^{\left(\mathit{i}\right)}\right)$ _{,} and passes S _{ j } back to the master program.
 (a)

➔ Process 0 (master program):
 (a)
sums S _{ j } and generates the final sum $S=\underset{j=1}{\overset{K}{\Sigma}}{\mathit{S}}_{\mathit{j}}\text{;}$
 (b)
computes the Monte Carlo estimate as $E\left(\mathit{p}\left(\mathit{\theta}\right)\right)=\frac{S}{T}\text{.}$
 (a)
Note that, in this example, the master process does not involve computing the sum of a portion of the data but it actually can. Also, note that each process is given the same number of samples. This works well if all CPU cores process the data at the same speed or approximately so. In practice, however, clock frequencies (i.e., computing speed) can vary markedly among processors. Hence, it can be more effective for each processor (or CPU core) to process a different number of samples, roughly proportional to its computing speed, and then let the master program compute the weighted average of all samples obtained from the K cores.
Parallel computing of singleparameter models
Hence, the posterior mean of ${\mathit{\sigma}}^{2}$ is $\frac{{\upsilon}_{0}{\mathit{\sigma}}_{0}^{2}+n{S}^{2}}{{\upsilon}_{0}+n2}$ for ${\upsilon}_{0}+n>2.$ Numerically, the posterior distribution of ${\mathit{\sigma}}^{2}$ can be inferred based on posterior samples generated from (4). Computing for this singleparameter normal model can follow exactly the same algorithm as parallel Monte Carlo simulation. Briefly, K parallel processes are executed, each generating a portion of the posterior samples of ${\mathit{\sigma}}^{2}\text{.}$ Then, the master process generates the final sum and computes the estimated posterior mean of ${\mathit{\sigma}}^{2}$ as a weighted average of all sample averages.
Parallel computing of multipleparameter models
Many models involve more than one unknown. Although many parameters are involved, conclusions are often drawn about one or only a few parameters at a time. In Bayesian analysis, the aim is to obtain the marginal posterior distribution of each parameter of interest. Often, we can construct the joint posterior distribution of all unknowns and then integrate this distribution over the unknowns that are not of immediate interest, leading to the desired marginal distribution of the parameter of interest.

➔ Sampling ${\mathit{s}}^{2\left(t\right)}$ from (8),

➔ Sampling ${\mu}^{\left(t\right)}$ from (9).
Parallel Markov Chain Monte Carlo
Analytical solutions are not always available for most multipleparameter models. Instead, MCMC simulation can be used to draw samples from the joint posterior distribution and then evaluate sampled values for the parameter(s) of interest while ignoring the values of other unknowns. MCMC methods are a variant of Monte Carlo schemes in which a Markov chain $\left\{{\mathit{X}}_{\mathit{j}},j=1,2\right\}$ is constructed with equilibrium distribution π equal to some distribution of interest, such as a posterior distribution in a Bayesian analysis [16]. Typically, the initial value is not a draw from the distribution π but if the chain is constructed properly, then ${X}_{\mathit{t}}\stackrel{d}{\to}\pi $ (here, d means convergence in distribution) and, under certain conditions, an estimator $\widehat{h}$ converges to ${h}_{\mathit{\pi}}$ as $t\to \infty \text{.}$ However, a Markov chain is sequential by nature because the distribution of ${X}_{\mathit{t}+1}$ depends on the value of ${X}_{\mathit{t}}\text{,}$ where t indexes the order of MCMC iterations. This introduces a difficulty to parallelization of a Markov chain.
Parallel MCMC by running multiple chains
A naive yet natural approach to parallel MCMC is simply to generate several independent Markov chains on different processors and then combine results appropriately [17, 18]. Given that running multiple chains is simple and that they scale well with the number of available processors (or CPU cores), this type of “multiplechain” parallelism is usually the strategy to strive for in the first instance.
Assume that we want to estimate some target distribution$p\left(X\right)$ but samples of X cannot be drawn directly from $p\left(X\right)\text{.}$ Instead, a Markov chain ${X}_{0},{X}_{1},\dots $ can be generated, which, through some transition density $u\left({X}_{\mathit{t}+1}{X}_{\mathit{t}}\right)\text{,}$ converges to $p\left(X\right)$ at equilibrium. Let there be $i=1,2,\dots ,K$ parallel chains, each initialized and burnedin independently for ${B}_{\mathit{i}}$ updating steps before more samples are drawn at intervals. As $K\to \infty $ and all${B}_{i}\to \infty $_{,} it can be shown that the ensemble is ergodic (tending in the limit) to $p\left(X\right)$[19].
where n is the number of iterations after burnin. Thus, parallel MCMC computing by virtue of running multiple chains is rewarding only when ρ is small. However, if ρ is large, the gain in computation through running multiple chains instead of a single long chain can be very disappointing.
Although running multiple Markov chains is theoretically straightforward, chains are not necessarily ergodic. Hence, some variant multiple MCMC methods have been proposed. For example, samples from multiple Markov chains may be confined to isolated modes if the target distribution is multimodal, or the chains may mix poorly when there are strong correlations between variables. Unfortunately, the latter is a common problem of Gibbs samplers [21]. Hence, pooling samples from multiple short chains may not necessarily give a better representation of p(X) than using a single long chain. If several chains are drifting to disparate modes, they will tend to be strongly influenced by the chains that they confine, because the weights will not necessarily be proportional to their relative masses.
Several strategies have been proposed for handling the aforementioned issues for single chains, such as adaptive MCMC algorithms [16, 22] and tempering [23, 24]. Metropoliscoupled MCMC is an algorithm that is related to simulated tempering and tempered transitions [23, 24]. It proceeds by simultaneously running a number of different Markov chains that are governed by different (but related) Markov chain transition probabilities. Occasionally, the algorithm “swaps” values between two different chains, with probability governed by the Metropolis algorithm to preserve the stationarity of the target distribution. These swaps can speed up convergence of the algorithm substantially [4]. Craiu et al. [25] targeted the posterior with an ensemble of chains, using the covariance of samples across all chains to adapt the proposal covariance for a set of MetropolisHastings chains. While these multiplechain methods use synchronous exchange of samples to expedite convergence, Murray [26] proposed mixing in an additional independent proposal, representing some hitherto best estimate or summary of the posterior, and cooperatively adapting across chains. The idea is to construct a global best estimate of the posterior at any given step and then mix this estimate as a remote component with whatever local proposal that a chain has adopted. This does not preclude adaptive treatment or tempering of that local proposal. It also permits a heterogeneous blend of remote proposals, so that the ensemble of chains can mix well.
Parallelization within a single chain
By running multiple Markov chains, we often observe that samplers mix poorly and each chain may require a very long burnin time. Hence, it would be preferable to develop parallelism within a single chain, instead of running multiple chains. As mentioned in the previous section, Markov chain simulation is an iterative procedure, in the sense that simulation of the next value of the chain depends on the current value. This creates difficulty for delivering parallelism for a single Markov chain. Nevertheless, we will show that a single chain can be parallelized, subject to assumptions of conditional independence in the model. The key is to identify such steps that can be implemented in parallel.

➔ Master program:
 (a)
samples a new σ, given realization of θ and the data y, and
 (b)
distributes the new σ to each process.
 (a)

➔ Each process (k):
 (a)
updates a subset of θ s that have been assigned to it, conditional on σ and y,
 (b)
computes summary statistics for the updated $\mathit{\theta}\text{s}\text{,}$ and
 (c)
passes the summary statistics back to the master program.
 (a)
Often, the above algorithm works quite well when the θ are all independent of one another, given σ and y. In practice, however, such independence may not necessarily hold and strategies must be developed to deliver efficient parallel MCMC algorithms given specific dependence between elements [3].
Applications
Parallel simulation for a singleparameter normal model
Intuitively, the posterior mean of θ is expressed as a weighted average of the prior mean (${\mu}_{0}$) and of the sample mean ($\overline{y}$), with weights equal to the corresponding precisions, $\frac{\mathbf{1}}{{\tau}_{0}^{2}}$ and $\frac{n}{{\mathit{\sigma}}^{2}}$, respectively. Because this is a singleparameter model, posterior samples of μ can be simulated in parallel by following the same algorithm as for Monte Carlo simulation.
Posterior summary statistics of average body weight daily gain based on a singleparameter normal model
Sample set  Min  Q1  Median  Mean  Q3  Max 

1  3.357  3.388  3.394  3.394  3.400  3.431 
2  3.356  3.388  3.394  3.394  3.400  3.429 
3  3.352  3.388  3.394  3.394  3.400  3.43 
4  3.357  3.388  3.394  3.394  3.400  3.436 
5  3.353  3.388  3.394  3.394  3.400  3.432 
6  3.355  3.388  3.394  3.394  3.400  3.43 
7  3.355  3.388  3.394  3.394  3.400  3.431 
8  3.354  3.388  3.394  3.394  3.400  3.428 
9  3.356  3.388  3.394  3.394  3.400  3.431 
10  3.358  3.388  3.394  3.394  3.400  3.433 
Pooled  3.352  3.388  3.394  3.394  3.400  3.436 
The purpose of this example was to show parallel computing using the MPI (Message Passing Interface) library. The change in computing time for this example was, however, almost insignificant because sampling from a normal distribution is very quick. In addition, with parallel simulation, interprocess communication requires some extra time as overhead, which offset gains from parallel computing.
MPI is a languageindependent communication protocol used to program parallel computers that is extensively used for highperformance computing. More specifically, MPI is a library of routines for creating parallel programs e.g., in C or Fortran 77, that allow users to create programs that can run on most parallel computer architectures. (Note that there is a language extension to Fortran90 called High Performance Fortran – HPF, which supports highperformance computing.) In the example code, the MPI library was used to handle interprocess communications in the C program. With MPI, each task can have its own local memory during computation (but multiple tasks can reside in the same physical machine and/or an arbitrary number of machines). Typically, tasks exchange data by sending and receiving messages but data transfer usually requires cooperation among processors, that is, a “send” operation must have a matching “receive” operation.
A few details about this program in the Appendix are described in the following. MPI_Comm_rand() is used to find out the ID of all participating processors and MPI_Comm_size() is used to get the number of participating processors. A common pattern of interaction among parallel processors is to use MPI_Send() and MPI_Receive() to allocate work among them. In the present example, however, this was done in a slightly different manner. MPI_Bcast() is used to send common parameter values (e.g., number of simulation steps) to all participating processors. Then, after each processor has finished its work, the subroutine MPI_Reduction() is used to sum up the posterior values from all processors. Subroutine MPI_Reduction() collects data from all processors, reduces the data to a single value (e.g., by summation), and then stores the results on the master process (and on all processes as well). There are several predefined operations that MPI_Reduction() can provide. In addition to summation, it can also conduct multiplication, and find minimum or maximum values. Finally, the master processor computes the means and standard deviation (and other posterior statistics, when relevant) for the mean of the normal model. Note that, in this illustration, we used sequential functions to generate random numbers (http://apps.nrbook.com/c/index.html), with process ID used as the random number seed. Preferably, one can use a parallel random number generator, such as the Scalable Parallel Random Number Generators (SPRNG) Library (http://sprng.cs.fsu.edu/).
Running multiple chains for Bayesian LASSO modeling
where y is a vector of phenotypes, μ is an effect common to all observations, β is a vector of unknown marker effects, X is an incidence matrix, and e is a vector of residuals. The prior specification follows de los Campos et al. [12] but without the terms for additive genetic effects. An R package, BLR, was used to implement the Bayesian LASSO model [28].
The dataset used here consisted of 147 Angus cattle, each genotyped for 37 892 polymorphic SNP (Single Nucleotide Polymorphisms) markers and with estimated breeding values (EBV) for marbling score as response variable. In addition to running a single long chain of 100 000 iterations (after a burnin of 1000 iterations), we also ran 10 chains, each consisting of 10 000 iterations (after a burnin of 1000 iterations). All jobs were submitted and run on a Condor cluster at the University of Wisconsin – Madison [29]. This cluster provides 1860 cores for distributed parallel computing. Among them, 1847 run a Linux operation system and the remaining a Windows operation system. Memory size ranges from 256M to 214G: 7.04% (<1G), 67.58% (13G), 22.69% (48G), and 2.67% (>10G). A Perl script was used, that installs the R system and required libraries (such as the SuppDists package) onto remote nodes prior to the computing and then executes the Bayesian LASSO program. This Perl script served as the executable in the Condor job batch file.
The above is an alternative form of (10) with $b=\rho n$ and K treated as unknown. Ideally, if $b=0\text{,}$ this means “perfectly parallel computing”, in which the potential reduction in runtime is K fold. However, when b is a significant proportion of n, the actual speedup falls well short of its potential.
Thus, when running multiple chains, each with a significant length of burnin, the speedup does not scale well with the number of available CPU cores. Let $n=t\times b\text{.}$ Then, the speedup S is a function of t and K, as depicted in Figure 3b. Clearly, given the fixed relationship between n and b, the speedup will reach a plateau after the number of CPU cores reaches a certain level.
Parallel BayesCpC for wholegenome prediction
Here, ${\mathit{\sigma}}_{\alpha}^{2}$ is a variance common to all nonzero SNP effects, which is assigned a scaled inverse chisquare distribution, ${\chi}^{2}\left({\mathit{v}}_{\alpha},{\text{s}}_{\alpha}^{2}\right)\text{.}$ Furthermore, the value of π is unknown and needs to be inferred, given the prior distribution of π that is taken as uniform between 0 and 1, $\mathit{\pi}\sim U\mathit{n}\mathit{i}\mathit{f}\mathit{o}\mathit{r}\mathit{m}\left(0,1\right)\text{.}$ A Bernoulli indicator variable, ${\delta}_{j}\text{,}$ is introduced to facilitate sampling from the mixtures of the SNP effects, $p{\delta}_{\mathit{i}}\pi ={\pi}^{1{\delta}_{\mathit{i}}}{\left(1\mathit{\pi}\right)}^{{\delta}_{\mathit{i}}}\text{.}$ Hence, unconditionally, the variable ${\alpha}_{\mathit{j}}$ follows a multivariatet distribution, $t\left(0,{S}_{\mathit{\alpha}}^{2},{\upsilon}_{\mathit{\alpha}}\right)\text{,}$ if ${\delta}_{\mathit{j}}=1\text{,}$ or equals zero otherwise [32]. Posterior inference of unknown parameters in the Bayesian model via Markov chain Monte Carlo implementation is described by Habier et al. [11].
With a subset of, say $k\le p$ markers selected in a certain iteration of the MCMC for the BayesC π model, then the next iteration assumes that all the k selected SNP have nonnull effects on the quantitative trait. The above defines BayesC model with $\pi =0\text{,}$ which takes the same form as (15) but with $\pi =0$ and p replaced by k. Posterior inference in BayesC $\left(\mathit{\pi}=0\right)$ is the same as for BayesC$\pi \text{,}$ except that π is fixed at zero and sampling indicator variables is no longer relevant.
Typically, Kfold CV is often used to evaluate predictive models, in which the whole dataset is divided into K portions of approximately equal size. The model is then trained in the set of K1 portions of the data and predicted in the remaining one portion. Portioning of training and testing sets is then rotated K times in each CV experiment. Furthermore, each CV experiment can be randomly replicated a number of times in order to increase the stability of model evaluation (but we did not do that in the present study).
As an example, we used the parallel BayesCpC package to select the optimal number of SNP to predict rib eye area in a beef cattle population. The data consisted of 2919 animals each with estimated breeding values for rib eye area and genotypes obtained from the Illumina 50K SNP Beadchip. After data editing and data quality control, 46 723 polymorphic SNP markers were used. The optimal panel search started with the top 50 SNP markers according to the posterior model probability for having a nonzero effect for this marker, and then increased from the top 100 to the top 3000 markers at an increment of 100 markers each time. We did not exhaust all possible panels beyond 3K because the prediction accuracy showed a constant trend of decrease after the panel reached 1000 SNP.
In the parallelBayesCpC package, MPI is used for data quality control and parallel MCMC is employed by both FS and CV. In our example, distributed jobs were submitted to a local Condor pool with 128 cores. These are Linux workstations, with Intel® Xeon® CPU (2.67 GHz), > 8G memory per slot and a cache size of 12 288 KB. The submit machine is also a Linux workstation with Intel® Xeon® CPU (3.00 GHz), 16G memory and 6144 KB cache size. Each parallel MCMC chain for feature selection consisted of 10 000 iterations after a burnin of 2000 iterations, thinned every onetenth. Each parallel MCMC for CV consisted of 25 000 iterations, after a burnin of 5000 iterations, thinned every onetenth.
We examined three computing strategies to search for an optimal panel size for predicting genetic merit using SNP markers. The first strategy executed 30 metajobs in parallel, each consisting of one round of parallel FS jobs using all SNP markers and one round of parallel postFS inference and CV for a specific panel (X = 50, 100, 200, …, 3000, respectively). In the second strategy, one round of parallel FS jobs was executed, followed by 30 rounds of parallel postFS inference and CV jobs conducted sequentially for all panel sizes. Parallel CV jobs for the 500SNP panel started only when parallel CV jobs for the 400SNP jobs had finished. The third strategy consisted of 30 metajobs executed in series, with each metajob consisting of one round of parallel FS jobs using all SNP markers and one round of parallel postFS inference and CV for a specific panel. The difference between the second and the third strategy is that different optimal panel sizes were selected based on the same set of FS results with the second computing strategy but each panel was selected based on a different set of FS results with the third computing strategy.
With the Bayesian regression models explored here, feature selection may be important since a model with all SNP does not necessarily give the best predictions. This situation is unlike ridge regression best linear unbiased prediction [34] or prediction using the GBLUP method [35], for which a model with all markers would typically be favored. While selecting models of varying dimensions may be an issue to explore, it brings tremendous challenges to computing, particularly when the dataset is large. In this regard, highperformance computing offers a markedly competitive edge, not only in reducing computing time but also in tuning optimal models for wholegenome prediction.
Discussion
To date, almost all statistical software packages for animal breeding and genetics have been developed for serial computation. In such programs, only one instruction is executed at a time and after that instruction is finished, the next instruction begins. Hence, serial computing performance depends heavily on the speed (clock frequency) of CPU, and the runtime of a program is approximately equal to the number of instructions multiplied by the average time per instruction. Keeping everything else constant, higher clock frequency leads to faster computing speed and thus decreased runtime for all computationbounded programs [36]. This was the situation with the performance enhancement of microprocessors based on a single CPU from the 1980s to the early 2000s. However, rate of improvement has slowed down since 2003 due to hardware limitations incurred by energy consumption and heat dissipation. On the other hand, parallel computing has gained impetus as a result of increasingly available multiplecore computers, computer clusters, and networking and has been referred to as the concurrency revolution [37]. In theory, multiple threads of execution can cooperate to complete the work faster than a serial setting.
Parallel computing uses multiple processing elements concurrently to solve a problem. To implement parallel computing, one first needs to break the problem into discrete “chunks” of work, so that they can be distributed to run on multiple processors. This is known as task decomposition or partitioning. The next fundamental step in designing the parallel algorithm involves identifying independencies that are assumed explicitly in the model. Without loss of generality, let P_{ i } and P_{ j } be two program jobs, where i < j indexes the order of execution. Bernstein’s conditions [38] can be used to identify whether or not the two jobs are independent and can be executed in parallel [39]. Let I_{ i } and O_{ i } be the input variables and output variables, respectively, of P_{ i }. Likewise, the same definitions hold for I_{ j } and O_{ j } of P_{ j } . Then, P_{ i } and P_{ j } are independent if they satisfy (1) ${\mathit{I}}_{\mathit{j}}\cap {\mathit{O}}_{\mathit{i}}=\varnothing \text{,}$ (2) ${\mathit{I}}_{\mathit{i}}\cap {O}_{\mathit{j}}=\varnothing \text{,}$ and (3) ${O}_{\mathit{i}}\cap {O}_{\mathit{j}}=\varnothing \text{.}$ Violation of condition (1) introduces a flow dependency because the results from the first job (P_{ i }) are used by the second job (P_{ j }). Violation of condition (2) introduces an antidependency, because P_{ j } would overwrite a variable needed by P_{ i } . The third condition represents an output dependency: when two statements write to the same location, the final result is determined by the last executed statement.
MCMC algorithms have revolutionized the application of Bayesian inference, because it tackles a large range of complex inferential problems that were previously not considered possible, tractable. In the meantime, statisticians are becoming ever more ambitious in the range (complexity) of models they consider and MCMC algorithms for large complex models often require enormous amounts of computing power. Consequently, effective exploitation of parallel algorithms is of high relevance to Bayesian computation. A difficulty, however, is that MCMC algorithms are serial by nature and do not easily migrate onto a parallel system. Nevertheless, various strategies can be used to design parallel MCMC methods. The key is to identify steps with data independence or conditional independence on which parallelism can reside. Several algorithms or strategies exist for running parallel MCMC, which is straightforward and it requires minimal interprocess communication. However, poorly mixing MCMC algorithms with long burnin periods are not ideally suited to this situation, because a long period of burnin must be repeated on every available CPU core. Thus, it is often desirable to explore strategies for parallelism within single chains.
The actual performance of parallel MCMC depends on several issues. Among them, interprocess communication is a primary factor. Many parallel applications require processes to share data with each other, which is known as intertask communication and implies overhead. Intertask communication can offset the gain in computing speed from parallel computing, because it frequently requires some type of synchronization between tasks, causing processes to spend time “waiting”. In the worst case, competing communication traffic can saturate the available network bandwidth, leading to poor parallel computing performance. Thus, there is always a need to balance the distribution of work among tasks. There is no simple rule for this type of load balancing. Ideally, all tasks are to be kept busy all of the time, so that task idle time is minimized [9, 10].
The ratio of computation to communication is qualitatively measured by the concept of granularity (https://computing.llnl.gov/tutorials/parallel_comp/). On one hand, a low computation to communication ratio (finegrain parallelism) facilitates load balancing, as relatively small amounts of computational work are done between communication events but this can imply high communication overhead and less opportunity for performance enhancement, because communication and synchronization between tasks may take longer than the computation. On the other hand, a high computation to communication ratio (coarsegrain parallelism) allows more opportunity for performance enhancement; because relatively large amounts of computational work are done between communication and synchronization events. However, it is difficult to balance loads efficiently with coarsegrain parallelism and computing time may differ dramatically between computer cores. Therefore, there is a tradeoff between computing and communication and the optimal granularity depends on the problem at hand. In most parallel MCMC problems, it is advantageous to have coarse granularity because the overhead associated with communication and synchronization is high relative to execution speed, but finegrain parallelism can sometimes help reduce overhead due to load imbalance.
Conclusions
In this paper, we have shown the principles and examples of parallel MCMC, with applications to wholegenome prediction of breeding values. Parallel computing operates on the principle that a large problem can be split into smaller components and solved concurrently (i.e., “in parallel”), each on a separate processor (or CPU core). In the context of parallel MCMC, two basic algorithms exist: running multiple chains and parallelism within a single chain, yet some variants can be useful as well. In principle, all Bayesian models can be parallelized in computing but the associated algorithms and strategies may differ, leading to varied computing efficiencies. Although many technical details have yet to be explored, we expect that the use of parallel MCMC methods will revolutionize computational tools for research and breeding programs for animals in the postgenome era.
Appendix
(a) Example C code using MPI for parallel simulation of a singleparameter normal model with unknown mean and known variance
/* This is an example C program using MPI for parallel computing of *
* a singleparameter normal model with unkown mean and known *
* variance. For illustration purpose, the step for data input is omitted. *
* Instead, the sample mean and standard deviation, as well as the prior *
* mean and standard are used. *
* Contact: XL Wu, nick.wu@ansci.wisc.edu; UWMadison, 09122011 */
#include < stdio.h>
#include < math.h>
#include < mpi.h>
#include “ranNum.h”
main(int argc, char **argv)
{int proc_id, root_process, nprocs, ierr, niters, i;
double xi, xi2, sum, psum, sum_xi2, psum_xi2;
double tar0, tar1, sdn, tarn, varn, mun, mumu, sdmu;
MPI_Status status;
/* Prior mean and standard deviation */
double mu0 = 4.000;
double sd0 = 1.000;
/* sample size, mean and variance */
int nind = 7670;
double mu1 = 3.394;
double sd1 = 0.580;
/* compute posterior statistics */
tar0 = 1.0/(sd0*sd0);
tar1 = (1.0*nind)/(sd1*sd1);
varn = 1.0/(tar0 + tar1);
sdn = sqrt(varn);
mun = varn * (tar0*mu0 + tar1*mu1);
/* Parallel simulation begins, starting from here */
/* process 0 as the root process. */
root_process = 0;
/* Replicate this process to create parallel processes. */
ierr = MPI_Init(&argc, &argv);
/* Allocate memory for random seed variable*/
long* idum;
MPI_Alloc_mem(sizeof(long), MPI_INFO_NULL, &idum);
/* Find out process ID and number of participating processes. */
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &proc_id);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
/* Root process gets the number of simulation steps */
if(proc_id == root_process) {
printf(“Please enter the number of simulation steps: ”);
scanf(“%i”, &niters);}
/* Broadcast the number of simulation steps to all participating processes */
ierr = MPI_Bcast(&niters, 1, MPI_INT, root_process,MPI_COMM_WORLD);
/*process id as the random number seed */
*idum = proc_id;
/* Each process computes a partial sum of simulated values */
psum = 0;
psum_xi2 = 0;
for(i = proc_id + 1; i < niters + 1; i + = nprocs) {
xi = mun + sdn * randomnormal(idum);
xi2 = xi * xi;
psum = psum + xi;
psum_xi2 = psum_xi2 + xi2;}
printf(“proc %i computes: %f\n”, proc_id, (float)psum);
/* Do a reduction in which all partial sums are combined into the grand sum */
ierr = MPI_Reduce(&psum, &sum, 1, MPI_DOUBLE,
MPI_SUM, root_process, MPI_COMM_WORLD);
ierr = MPI_Reduce(&psum_xi2, &sum_xi2, 1, MPI_DOUBLE,
MPI_SUM, root_process, MPI_COMM_WORLD);
/* Finally, the root process prints posterior mean and standard error of mu. */
if(proc_id == root_process) {
mumu = sum / niters;
sdmu = sqrt((sum_xi2  niters*mumu*mumu)/(niters1));
printf(“The posterior mean and variance of mu is %f and %f\n”, (float)mumu,(float)sdmu);}
/* Close down this processes. */
ierr = MPI_Finalize();}
(b) Condor job batch files (Note: these Condor scripts were written automatically be a R scripts based on a input parameter file. Using the BayesCpC package, a user does not need to write this type of Condor job batch files.)
# Condor DAGMan job
JOB input job_InputData
JOB selection job_Selection
SCRIPT POST selection /usr/local/wgse_beta/V2/cmd_postscript_SelectionSummary
JOB validation job_Validation
SCRIPT POST validation /usr/local/wgse_beta/V2/cmd_postscript_OutputResults
PARENT input CHILD selection
PARENT selection CHILD validation
# Data input and quality control
Universe = vanilla
Executable = /usr/local/wgse_beta/V2/rungeneric.pl
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p1_data_input_n_processing.R ‐‐unique = input
Log = step_input.log
Output = step_input.out
Error = step_input.error
notification = NEVER
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files =(data path and files are omitted), /home/nickwu/BayesCpC/V2_UW/data/datafile.csv, /usr/local/wgse_beta/V2/rungeneric.pl, /usr/local/wgse_beta/V2/SLIBS.tar.gz, /usr/local/wgse_beta/V2/cmd_data_input_n_processing, /usr/local/wgse_beta/V2/p1_data_input_n_processing.R,
Queue
# Feature selection
Universe = vanilla
Executable = /usr/local/wgse_beta/V2/rungeneric.pl
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_1 1
Log = step_validation.$(process).log
Output = step_validation.$(process).out
Error = step_validation.$(process).error
notification = NEVER
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = .linkPar, /usr/local/wgse_beta/V2/rungeneric.pl, /usr/local/wgse_beta/V2/SLIBS.tar.gz, /usr/local/wgse_beta/V2/cmd_cross_validation, /usr/local/wgse_beta/V2/p4_BayesCpC_validation.R,
Queue
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_2 2
Queue
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_3 3
Queue
# PostFS inference and crossvalidation
Universe = vanilla
Executable = /usr/local/wgse_beta/V2/rungeneric.pl
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_1 1
Log = step_validation.$(process).log
Output = step_validation.$(process).out
Error = step_validation.$(process).error
notification = NEVER
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = .linkPar, /usr/local/wgse_beta/V2/rungeneric.pl, /usr/local/wgse_beta/V2/SLIBS.tar.gz, /usr/local/wgse_beta/V2/cmd_cross_validation, /usr/local/wgse_beta/V2/p4_BayesCpC_validation.R,
Queue
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_2 2
Queue
Arguments = ‐‐new ‐‐type = R ‐‐tarball = builtsl5R2.10.1.tar.gz ‐‐installfrom = R2.10.1 ‐‐cmdtorun = p4_BayesCpC_validation.R ‐‐unique = validation_3 3
Queue
Declarations
Acknowledgements
This research was supported by the University of Wisconsin (UW) Foundation, a genomic selection grant by Merial Ltd., and National Research Initiative competitive grant no. 20093520505099 from the USDA National Institute for Food and Agriculture Animal Genome Program (Washington, DC, USA). Drs. Stewart Bauck and Brent Woodwart were partially involved in this study. Dr. Jeremy F. Taylor is acknowledged for kindly providing the beef cattle data used in this study. William Taylor at the UW Center for HighThroughput Computing (CHTC) is thanked for his excellent technical assistance in running multiple chains for Bayesian LASSO on the CHTC Condor pool and for helping design the SOAR interface for the BayesCPC. Special thanks go to the two anonymous reviewers and the editors (Drs. Helene Hayes and Jack Dekkers) whose edits and suggestions have improved this paper a lot. In particular, Dr. Jack Dekkers is acknowledged who has thoroughly and critically revised the whole manuscript; his inputs are not only editorial but highly technical as well. KW acknowledges financial support from the National Association of Animal Breeders (Columbia, MO).
Authors’ Affiliations
References
 Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001, 157: 18191829.PubMed CentralPubMed
 Wu XL, Beissinger TM, Bauck S, Woodward B, Rosa GJM, Weigel KA, de Leon Gatti N, Gianola D: A primer on highthroughput computing for genomic selection. Front Genet. 2011, 2: 4PubMed CentralView ArticlePubMed
 Wilkinson DJ: Parallel Bayesian computation. Handbook of Parallel Computing and Statistics. Edited by: Kontoghiorghes EJ. 2005, Chapman and Hall/CRC, Boca Raton, FL, USA, 477508.View Article
 Rosenthal JS: Parallel computing and Monte Carlo algorithms. Far East J Theor Stat. 2000, 4: 207236.
 Almasi GS, Gottlieb A: Highly Parallel Computing. 1989, BenjaminCummings publishers, Redwood City
 Gropp W, Lusk E, Skjellum A: Using MPI: Portable Parallel Programming with the Message Passing Interface. 1999, The MIT Press, Cambridge, 2
 Benson DA, KarschMizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2011, 39: D32D37. 10.1093/nar/gkq1079. Database issuePubMed CentralView ArticlePubMed
 Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA, University of California at Berkeley: The landscape of parallel computing research: A view from Berkeley. Technical Report No. UCB/EECS‐‐ 2006183. 2006,http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS2006183.pdf,
 Brockwell AE: Parallel Markov chain Monte Carlo simulation by prefetching. J Comput Graph Stat. 2006, 15: 246261. 10.1198/106186006X100579.View Article
 Ye J, Wallace A, Thompson J: Proceedings of the 17th European Signal Processing Conference: 2428 August 2009; Glasgow. Parallel Markov chain Monte Carlo computation for varying dimension signal analysis. 2009, 26732677.
 Habier D, Fernando RL, Dekkers JCM: Genomic selection using lowdensity marker panels. Genetics. 2009, 182: 343353. 10.1534/genetics.108.100289.PubMed CentralView ArticlePubMed
 de los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, Weigel K, Cotes JM: Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics. 2009, 182: 375385. 10.1534/genetics.109.101501.PubMed CentralView ArticlePubMed
 Zhong S, Dekkers JCM, Fernando RL, Jannink JL: Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: a barley case study. Genetics. 2009, 182: 355364. 10.1534/genetics.108.098277.PubMed CentralView ArticlePubMed
 Fishman GS: Monte Carlo: Concepts, Algorithms, and Applications. 1995, Springer, New York
 Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. 2004, Chapman and Hall, New York
 Gilks WR, Roberts GO, Sahu SK: Adaptive Markov chain Monte Carlo through regeneration. J Am Stat Assoc. 1998, 93: 10451054. 10.1080/01621459.1998.10473766.View Article
 Gelman A, Rubin DB: Inference from iterative simulation using multiple simulations. Stat Sci. 1992, 7: 457511. 10.1214/ss/1177011136.View Article
 Bradford R, Thomas A: Markov chain Monte Carlo methods for family trees using parallel processor. Stat Comput. 1996, 6: 6775. 10.1007/BF00161575.View Article
 Geyer CJ: Markov chain Monte Carlo maximum likelihood. Proceedings of the 23rd Symposium on the Interface: Computing Science and Statistics: 2124 April 1991; Seattle. Edited by: Keramidas E. 1991, 156163.
 Amdahl GM: Validity of the single processor approach to achieving largescale computing capabilities. In Proceedings of the American Federation of Information Processing Societies: 1416 November; Anaheim. 1967, 30: 483485.
 Geman S, Geman D: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell. 1984, 6: 721741.View ArticlePubMed
 Andrieu C, Thoms J: A tutorial on adaptive MCMC. Stat Comput. 2008, 18: 343373. 10.1007/s112220089110y.View Article
 Marinari E, Parisi G: Simulated tempering: a new Monte Carlo scheme. Europhys Lett. 1992, 19: 451458. 10.1209/02955075/19/6/002.View Article
 Neal RM: Sampling from multimodal distributions using tempered transitions. Stat Comput. 1996, 6: 353366. 10.1007/BF00143556.View Article
 Craiu RV, Rosenthal JS, Yang C: Learn from thy neighbor: Parallelchain and regional adaptive MCMC. J Am Stat Assoc. 2009, 104: 14541466. 10.1198/jasa.2009.tm08393.View Article
 Murray L: Proceedings of Neural Information Processing Systems Workshop on Learning on Cores, Clusters and Clouds: 11 December 2010; Mt. Currie South. Distributed Markov chain Monte Carlo.http://lccc.eecs.berkeley.edu/papers.html,
 Park T, Casella G: The Bayesian Lasso. J Am Stat Assoc. 2008, 103: 681686. 10.1198/016214508000000337.View Article
 Perez P, Delos Campos G, Crossa J, Gianola D: Genomicenabled prediction based on molecular markers and pedigree using the Bayesian linear regression package in R. Plant Genome. 2010, 3: 106116. 10.3835/plantgenome2010.04.0005.PubMed CentralView ArticlePubMed
 Thain D, Tannenbaum T, Livny M: Distributed computing in practice: the Condor experience. Concurrency Computat: Pract Exper. 2005, 17: 323356. 10.1002/cpe.938.View Article
 Wu XL, Yao C, Long N, Stewart B, Woodward B, Mujibi DFN, Rosa GJ, Weigel KA, Gianola D: Proceedings of the Plant and Animal Genome XIX Conference: 1519 January 2011; San Diego. Highthroughput computing for genomeenabled selection – Preliminary deployment of a HTC pipeline for postgenomeera breeding programs.http://www.intlpag.org/2013/index.php/abstracts/abstractsarchive,
 Wu XL, Hayrettin O, Duan H, Beissinger T, Bauck S, Woodward B, Rosa GJM, Weigel KA, de Leon Gatti N, Taylor J, Gianola D: Proceedings of the Plant and Animal Genome XX Conference: 1418 January 2012; San Diego. ParallelBayesCpC on OSG: Gridenabled Highthroughput computing for genomic selection in practice.https://pag.confex.com/pag/xx/webprogram/Paper4104.html,
 Gianola D, Sorensen D: Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes. Genetics. 2004, 167: 14071424. 10.1534/genetics.103.025734.PubMed CentralView ArticlePubMed
 Bishop CM: Pattern Recognition and Machine Learning. 2006, Springer, New York
 Piepho HP: Ridge regression and extensions for genomewide selection in maize. Crop Sci. 2009, 49: 11651176. 10.2135/cropsci2008.10.0595.View Article
 Luan T, Woolliams JA, Lien S, Kent M, Svendsen M, Meewissen THE: The accuracy of genomic selection in Norwegian red cattle assessed by crossvalidation. Genetics. 2009, 183: 11191126. 10.1534/genetics.109.107391.PubMed CentralView ArticlePubMed
 Hennessy JL, Patterson DA: Computer Architecture: A Quantitative Approach. 2002, Morgan Kaufmann Publishers, San Francisco, 3
 Sutter H, Larus J: Software and the concurrency revolution. Association for Computing Machinery Queue. 2005, 3: 5462.
 Bernstein AJ: Analysis of programs for parallel processing. IEEE Trans Computers. 1966, EC15: 757762.View Article
 Roosta SH: Parallel Processing and Parallel Algorithms: Theory and Computation. 2000, Springer, New YorkView Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.