A comparison of strategies for Markov chain Monte Carlo computation in quantitative genetics

In quantitative genetics, Markov chain Monte Carlo (MCMC) methods are indispensable for statistical inference in non-standard models like generalized linear models with genetic random effects or models with genetically structured variance heterogeneity. A particular challenge for MCMC applications in quantitative genetics is to obtain efficient updates of the high-dimensional vectors of genetic random effects and the associated covariance parameters. We discuss various strategies to approach this problem including reparameterization, Langevin-Hastings updates, and updates based on normal approximations. The methods are compared in applications to Bayesian inference for three data sets using a model with genetically structured variance heterogeneity.


INTRODUCTION
Given observations of a trait and a pedigree for a group of animals, the basic model in quantitative genetics is a linear mixed model with genetic random effects. The correlation matrix of the genetic random effects is determined by the pedigree and is typically very high-dimensional but with a sparse inverse. Maximum likelihood inference and Bayesian inference for the linear mixed model are well-studied topics [16]. Regarding Bayesian inference, with appropriate choice of priors, the full conditional distributions are standard distributions and Gibbs sampling can be implemented relatively straightforwardly. The assumptions of normality, linearity, and variance homogeneity are in many cases not valid. One may then consider generalized linear mixed models where the genetic random effects enter at the level of the linear predictor. San Cristobal-Gaudy et al. [15] proposed another extension of the linear mixed model introducing genetic random effects influencing the log residual variances of the observations thereby producing a genetically structured variance heterogeneity. Considerable computational problems arise when abandoning the standard linear mixed model. Classical maximum likelihood and Bayesian inference is complicated since it is not possible to evaluate explicitly the likelihood function. Using conventional Gibbs sampling to approximate the posterior distribution is difficult since the full conditional distributions are not anymore of standard forms.
The aim of this paper is to discuss strategies to obtain efficient Markov chain Monte Carlo (MCMC) algorithms for non-standard models of the kind mentioned in the previous paragraph. Such algorithms may be used either to approximate posterior distributions or to generate samples for importance sampling approximations of the likelihood [6]. In particular we focus on the problem of constructing efficient updating schemes for the high-dimensional vectors of genetic random effects. We review the methodological background and discuss the various algorithms in the context of the heterogeneous variance model. Apart from being a model of great interest in its own right, this model has proven to be a hard test for MCMC methods. We compare the performances of the different algorithms when applied to three real datasets which differ markedly both in size and regarding the inferences concerning the genetic covariance parameters. Section 2 discusses general strategies for obtaining efficient MCMC algorithms while Section 3 considers these strategies in the specific context of the San Cristobal-Gaudy et al. [15] model. Section 4 presents results of applying two MCMC schemes to data sets with pig litter sizes, rabbit litter sizes, and snail weights. Some concluding remarks are given in Section 5.

MCMC STRATEGIES FOR HIGH-DIMENSIONAL PROBLEMS
We initially discuss MCMC strategies in a rather general framework where given the vector of random additive genetic effects a = (a 1 , . . . , a M ) and a parameter vector β, the vector y of observed traits follows some density f (y|a, β). As usual in quantitative genetics, a is assumed to be zero mean normal with covariance matrix σ 2 a A, where A is the additive genetic relationship matrix that reflects the family structure, typically known, and σ 2 a is the additive where we for notational convenience omit β and σ 2 a on the left hand side. An algorithm for sampling a can typically easily be extended with updates of the lower dimensional quantities β and σ 2 a in order to sample the full posterior distribution of (a, σ 2 a , β) (Sect. 2.5). An introduction to MCMC can be found e.g. in [16].

The Metropolis-Hastings algorithm
The Metropolis-Hastings algorithm generates a Markov chain a 1 , a 2 , . . . as follows. Given the current value a i of the Markov chain, a proposal a prop is generated from a proposal density q(a prop |a i ). With probability the new state a i+1 is given by a prop ; otherwise a i+1 = a i . Under weak conditions of regularity and after a suitable 'burn-in', the generated Markov chain provides a dependent sample from the posterior distribution of a. The question is now how to choose a suitable proposal density q. A simple and often used proposal density is a multivariate normal density centered at the current value a i of the chain and with covariance matrix hI where h is a user-specified proposal variance and I is the identity matrix, i.e. q(a prop |a i ) is the density of N(a i , hI). The resulting Metropolis-Hastings algorithm is known as a random-walk Metropolis algorithm. In high-dimensional problems, the random-walk Metropolis algorithm may converge very slowly and produce highly auto-correlated samples.
A simple step forward is to use gradient information in the proposal density. The proposal distribution of a Langevin-Hastings algorithm [1,11] is given by where d da log p(a|y) is the gradient of the log posterior densityi.e. the vector of derivatives (d log p(a|y)/da 1 , . . . , d log p(a|y)/da M ).
Intuitively, the use of gradient information helps to direct the algorithm towards regions of high posterior density. In applications in spatial statistics [2] the Langevin-Hastings algorithm has proven superior to the random walk Metropolis algorithm. In the context of quantitative genetics, Langevin-Hastings has been successfully applied to implement Bayesian inference in [3,9,12,17].
When choosing the proposal variance h, rules of thumb suggest that one should aim at acceptance rates of about 25% for random walk and 60% for Langevin-Hastings updates. Single-site schemes where the components in a are updated in turn may lead to poorly mixing Markov chains due to high correlation between the components of a.

Reparameterization
Simulation studies in [7] show that Langevin-Hastings updates may not work well if the components of a have very different posterior variances. In applications in quantitative genetics, the individuals may contribute with different numbers of observations and may have different numbers of relatives with records. Hence posterior variances may be very different. The correlation structure of the Langevin-Hastings proposal described in the previous section moreover typically differs markedly from the posterior correlation structure where the components are not independent. It may therefore be useful to transform a into a quantity whose components are less correlated a posteriori. Using the factorisation A = TDT T [8], one may let a = σ a γB T where B = TD 1/2 and γ is a priori standard normal N(0, I) (note that we regard vectors as row vectors). The posterior correlation matrix of γ given y is then closer to the correlation matrix I of the Langevin-Hastings proposal. Note that we compute a = σ a γB T by solving a(T −1 ) T = σ a γD 1/2 with respect to a. This computation is fast due to the sparseness of T −1 .
The posterior of γ given y is of the form where f is the sampling density in (1) and p Γ (γ) denotes the multivariate standard normal density of γ. Given a current value γ i , the Langevin-Hastings proposal is in analogy with (3) of the form where −(h/2)γ i = (h/2) d dγ log p Γ (γ), a i = σ a γ i B T , i is N(0, hI) distributed, and the chain rule for differentiation is used to obtain the second equation.
Letting q Γ denote the corresponding proposal density, the Metropolis-Hastings ratio becomes A posterior sample a 1 , a 2 , . . . is straightforwardly obtained by backtransforming the sample γ 1 , γ 2 , . . . drawn from (4). If one prefers to work with the original posterior (1), an equivalent approach to obtain a sample a 1 , a 2 , . . . is to use a proposal a prop obtained by transforming the proposal (5). More specifically, given the current value a i , first com- (5), and finally a prop = σ a γ prop B T . The covariance matrix of a prop then becomes hσ 2 a BB T = hσ 2 a A, i.e. h times the prior covariance matrix. Moreover, the Metropolis-Hastings ratio in (2) coincides with the ratio (6) Jacobian for the transformation from γ to a. A more general perspective on the use of reparameterization is given in Appendix A.

Normal approximation of the posterior
Suppose for a moment that q(a prop |a i ) is equal to the target density p(a prop |y). The Metropolis-Hastings algorithm then produces independent draws from the posterior. This indicates that an efficient Metropolis-Hastings algorithm might be obtained by constructing a proposal density which is a good approximation of the posterior density.
Consider the second-order Taylor expansion (for appropriate row vectors x, m, b and a positive definite matrix C), the exponential of the right hand side of (7) is proportional to the density of a multivariate normal distribution with mean μ(â) =â + d da log p(â|y)H(â) −1 and precision matrix H(â).
Several options are available for choosingâ. One may e.g. letâ be given by the current value a i in the Markov chain or by the result of one Newton-Raphson step starting from a i . Whenâ depends on a i then so does the valueâ prop , say, used when evaluating the proposal density Thus in the Metropolis-Hastings ratio (2) we need to evaluate the ratio of determinants |H(â prop )| 1/2 /|H(â)| 1/2 using e.g. sparse matrix methods (see Sect. 2.4). Alternatively one may iterate Newton-Raphson to convergence so thatâ becomes the mode of the posterior which does not depend on a i . Then μ(â) =â and the ratio of determinants conveniently becomes one. Sampling from a normal approximation is discussed in Section 2.4.

Implementation of the normal approximation
Typically, the genetic random effects enter the sampling density f (y|a) via a linear predictor η = aZ (i.e. f (y|a) =f (y|η)) where Z is an incidence matrix relating the observed traits to the random effects. The precision matrix H(â) in the normal approximation proposal density (8) then takes the form The normal approximation proposal distribution is thus formally equivalent to the conditional distribution of random effects in a 'virtual' linear normal model , and˜ ∼ N(0, Σ),ỹ represents 'virtual' noise and data. Hence we may sample from the normal approximation by applying the García-Cortés and Sorensen [4] algorithm based on the decomposition where the 'prediction error' e = (a prop − E[a prop |ỹ]) and the 'prediction' μ(â) = E[a prop |ỹ] are independent. Hence if e sim is a simulation of e then is a conditional simulation of a prop givenỹ. The simulated prediction error e sim may be generated as follows: 1) simulate (a sim ,ỹ sim ) from the joint distribution of (a prop ,ỹ) (using the Henderson factorization A = TDT T ); 2) compute μ(â sim ) = E[a prop |ỹ sim ] by solving the standard mixed model equa- Alternatively, one may exploit the sparseness of H(â) which enables fast computation of the Cholesky factorization of H(â), (see [13] and [14]). For the latter approach the c library GMRFLib (www.math.ntnu.no/∼hrue/GMRFLib/ and Appendix B in Rue and Knorr-Held [14]) provides an extensive suite of procedures for computation of and sampling from normal approximations. Using this library, sophisticated MCMC algorithms can be constructed with little programming effort. Note that with sparse Cholesky factorizations, it is straightforward to evaluate ratios of determinants |H(â prop )| 1/2 /|H(â)| 1/2 possibly appearing in the Metropolis-Hastings ratio (2). GMRFlib is used in [18] to implement Bayesian inference for a multiple trait model.

Updating σ 2 a
Introducing updates of the parameters σ 2 a and β is in principle straightforward but regarding σ 2 a there are some issues concerning whether we consider the original random effects a or reparameterized random effects γ, cf. Section 2.2.
Let p(σ 2 a ) denote the prior for σ 2 a and suppose that given a current value σ 2,i a , we generate a proposal σ which do not coincide. In particular, note that updating σ 2 a with reparameterized random effects effectively updates a too with a i+1 = σ i+1 a γ i B T . This suggests a joint update of (σ 2 a , a) in the case of no reparameterization. If (σ 2,i a , a i ) denotes the current joint value, the proposal is (σ Note that the M + 1 dimensional proposal is generated using a one-dimensional proposal density. Hence we need a slightly more general version of the Metropolis-Hastings algorithm reviewed in [19] and in [16]. The Metropolis-Hastings ratio becomes where (σ becomes χ −2 too. Hence a Gibbs update might be used in which case we sample the new value directly from the full conditional. Note, however, that in cases of a of high dimension M, the full conditional (12) becomes highly concentrated around the maximum likelihood estimate aA −1 a T /M = arg max σ 2 a p(a|σ 2 a ) given a. Hence, a and σ 2 a are strongly correlated a posteriori and it is our experience that it is advantageous to use instead the approach with reparameterized random effects/joint update of (a, σ 2 a ). In our data examples we could apply a 3-5 times larger proposal standard deviation in q Σ when used in combination with (10) while maintaining the same acceptance rates as for the case without reparameterization (9). More discussion on joint updates of random effects and variance parameters can be found in [10].

Comparison of samplers in terms of Monte Carlo error and computational cost
Given an MCMC sample (a 1 , θ 1 ),(a 2 , θ 2 ), . . . , (a n , θ n ) from the posterior distribution of (a, θ), θ = (β, σ 2 a ), and some function h(a, θ), the posterior expectation E[h(a, θ)|y] = h(a, θ)p(a, θ|y)da dθ is estimated by the averageh n = n i=1 h(a i , θ i )/n. The Monte Carlo variance ofh n is given by V asymp /n, where is the so-called asymptotic variance given in terms of the posterior variance Var[h(a, θ)|y] and the Markov chain lag-m autocorrelations To attain a given Monte Carlo variance of size V, we need a sample size of n V = V asymp /V and if the cost of generating one sample is c then the total cost becomes cn V = cτ Var[h(a, θ)|y]/V where τ = 1 + 2 ∞ m=1 ρ m is the integrated autocorrelation. Thus cτ is an appropriate performance measure for an MCMC algorithm. The so-called effective sample size is given by n/τ. The integrated autocorrelation can be estimated as suggested e.g. in [5].
Note that the ratio τ 2 /τ 1 of integrated autocorrelations for two MCMC samplers is equal to the ratio n 2 /n 1 of numbers of iterations n 2 and n 1 required to obtain the same MCMC variance V with the two samplers.

A MODEL WITH GENETICALLY STRUCTURED VARIANCE HETEROGENEITY
We now discuss the methods of the previous sections within the context of the [15] model for genetically structured variance heterogeneity. For ease of presentation systematic and environmental effects are omitted. Let a * denote random effects affecting the residual variance of y. Given a and a * , the components y i of y are independent N(μ + az T i , exp(μ * + a * z T i )) where z i is an incidence vector with components equal to zero or one. Hence the same type of linear model is applied for the mean and the log residual variance of y i where the mean depends on a and the log residual variance on a * . The joint distribution of a and a * is zero mean normal with covariance matrix G ⊗ A where The correlation between the two types of random effects is given by ρ, and σ 2 a and σ 2 a * are the variances for the genetic random effects.

Reparameterizations
Various reparameterization strategies are possible for this model. Let U G denote the upper triangular Cholesky factor of G (i.e. G = U G T U G ), (a, a * ) = (γ, γ * )U G ⊗ B T where (γ, γ * ) is a standard normal vector and B is defined in Section 2.2. Langevin-Hastings updates for the vector of a priori uncorrelated random effects (γ, γ * ) are used in [9,12,17]. An alternative reparameterization is based on (a, a * ) = (a, αa + u) where α = ρσ a * /σ a and u = a * − E[a * |a]. Then u is N(0, σ 2 a * (1 − ρ 2 )A) and a priori independent of a. Hence one might update a and u in turn hoping that these quantities are only weakly correlated a posteriori. Note that it is not guaranteed that the Hessian matrix with respect to u is positive definite.

Normal approximations
is the conditional variance given a * for the ith observation and The precision matrix in the normal approximation of the posterior for (a, a * ) is then where W = ZR and g i j are the entries of G −1 . Due to the factor 1/2 in the lower right block, this matrix cannot be recognized as the covariance matrix of a conditional normal distribution and there is in fact no guarantee that it is positive definite. This is further illustrated in the toy example in Section 3.3 which shows that the joint posterior of (a, a * ) can be far from multivariate normal. In Section 4 we use the normal approximation in turn for a and a * separately. In this case the covariance matrices are given by the diagonal blocks in (13).

Toy example
We illustrate the various reparameterization strategies in the very simple case where y = (−2.62, −2.42) consists of two observations, and a and a * are one-dimensional. Figure 1 shows the posterior densities of (a, a * ), (γ, γ * ) and (a, u) in the case where μ = 0, μ * = −1, σ 2 a = 1, σ 2 a * = 0.25, and ρ = 0.75. The plots demonstrate for the given parameter settings that (a, a * ) are highly  In both plots, the bold dot represents a current state a i and 75% of the proposals fall within the circle. In the right plot the proposal mean is given by the current value plus h d da log p(a i |y)/2 indicated by the arrow.
correlated a posteriori and that the joint posterior distribution of (a, a * ) is not well approximated by a normal distribution. The transformed random effects γ and γ * seem approximately uncorrelated but have different posterior variances. As expected, a and u are less correlated a posteriori than a and a * but the joint distribution is far from normal. The plots in Figure 2 illustrate the random walk and Langevin-Hastings proposals. The Langevin-Hastings proposal mean lies in a region of higher posterior density than the current value. This means that proposals in a rather large region around the proposal mean have a good chance of being accepted. Figure 3 shows that the normal approximation is poor for the joint posterior of (a, a * ) while it works well for the conditional posteriors of a|y, a * and a * |y, a separately.

EXAMPLES
In this section we compare the performance of Langevin-Hastings and normal approximation MCMC algorithms applied to three data sets which have been previously analyzed in [9,12,17]. The first data set originates from a selection experiment for pig litter size and contains 10 060 litter size records from 4149 sows. The pedigree file includes 6437 individuals. The second data set contains 2996 litter sizes from a divergent selection experiment for rabbit uterine capacity with 1161 individuals in the pedigree. The third and largest data set consists of weights for each of 22 033 adult snails and the pedigree file includes 22 454 individuals. For all three datasets we consider the model from Section 3 extended with systematic and environmental effects and impose prior distributions on the unknown location and variance parameters. The posterior means of the genetic covariance parameters are given in the last three columns of Table II. The genetic correlation parameter ρ is negative for the pig and rabbit litter size data while it is positive for the snail weights. More details on the data, priors, and posterior results can be found in [9,12,17].
The first algorithm LH is the one employed in [9,12,17] where reparameterized random effects (γ, γ * ) are considered and the components of the posterior distribution are updated in turn using Langevin-Hastings updates for (γ, γ * ) and either Langevin-Hastings or random walk updates for the other components (for the positive variance parameters we apply the random walk updates to the log variances). The second algorithm NX is as LH except that the Langevin-Hastings update for (γ, γ * ) is replaced with normal approximation updates for a and a * separately. Following Section 2.5, for NX we use a joint update of the genetic random effects and the variance parameters in order to obtain an update that is equivalent to the update of the variance parameters for the LH algorithm. The normal approximation updates are implemented using GMRFLib. The proposal variances for the Langevin-Hastings and random walk updates are chosen according to the rules of thumb mentioned in Section 2.1. Table I shows ratios of integrated autocorrelations τ obtained using respectively LH and NX. The integrated autocorrelations are evaluated for quadratic forms involving a and a * and the first two components of a and a * . Considering e.g. the random effect a 1 for the pigs data, the integrated autocorrelation is 873 times larger for LH than for NX. This means that 873 times more iterations are needed with LH to obtain the same precision as for the NX algorithm (see Sect. 2.6 for details concerning integrated autocorrelation and MCMC precision). Columns 2-4 in Table II show ratios of integrated autocorrelations for the genetic covariance parameters using respectively LH and NX. Regarding integrated autocorrelation, NX clearly outperforms LH.
To evaluate the performance of the algorithms, computing time c must be taken into account. Column 5 in Table II shows that the computing time for one MCMC iteration is between 20 to 100 times higher for the NX algorithm than for the LH algorithm. The ratios of the products cτ for respectively LH and NX are between 3-6 for the rabbits, 1-9 for the pigs, and 5-13 for the snails depending on the parameters considered. Hence, NX is superior for all datasets.
For the data set of snail weights with high-dimensional a * , the acceptance rate for the normal approximation updates of a * is quite small -around 5%while larger acceptance rates 50% and 30% are obtained for the smaller rabbits and pig data sets. If the MCMC algorithm is initialized in values far from the posterior mode, the acceptance probability for the normal approximation update of a * may be very small in which case a large burn-in is needed.
The long computing times for NX and the large integrated autocorrelations for LH suggest considering an algorithm where one alternates between normal approximation and Langevin-Hastings updates. More specifically we obtain an algorithm NXLH by replacing the normal approximation update by a Langevin-Hastings update in every second iteration. The Langevin-Hastings update is obtained via a Langevin-Hastings update for the transformed genetic effects as described in the end of Section 2.2. We considered NXLH for the rabbits data and compared with NX, the computing cost is roughly halved while the integrated autocorrelations only slightly increase. The ratios of cτ for LH and NXLH are between 4 and 9 so that NXLH in fact works better than both NX and LH. This algorithm may also be advantageous in cases where NX requires a long burn-in.
We also tried out normal approximations for the (a, u) reparameterization but this did not offer any improvement.

DISCUSSION
Normal approximation proposal distributions are intuitively appealing and provide smaller integrated autocorrelations than Langevin-Hastings updates in the examples in Section 4. A distinct advantage of the normal approximation updates is moreover that they do not require user tuning of proposal variances. The computing time for the normal approximation updates is high but sensitive to the choice of implementation. The normal approximation updates used in Section 4 are implemented using general routines in GMRFLib based on numerical methods for sparse matrices. This approach very much reduces the programming effort but one loses the computational advantages offered by the specific structure of the genetic correlation matrix. For the rabbit data, [9] reduces the computing cost of the NX algorithm by a factor of three using the approach described in Section 2.4 where samples from the normal approximation are obtained using the García-Cortés and Sorensen [4] algorithm and the Henderson factorization.
Finally, using a mixture of Langevin-Hastings and normal approximation updates may work even better than pure Langevin-Hastings or normal approximation algorithms since the use of Langevin-Hastings saves computing time while the normal approximation maintains small integrated autocorrelations. This option is also helpful in cases where a pure normal approximation algorithm requires a long burn-in.