A bivariate quantitative genetic model for a threshold trait and a survival trait

Many of the functional traits considered in animal breeding can be analyzed as threshold traits or survival traits with examples including disease traits, conformation scores, calving difficulty and longevity. In this paper we derive and implement a bivariate quantitative genetic model for a threshold character and a survival trait that are genetically and environmentally correlated. For the survival trait, we considered the Weibull log-normal animal frailty model. A Bayesian approach using Gibbs sampling was adopted in which model parameters were augmented with unobserved liabilities associated with the threshold trait. The fully conditional posterior distributions associated with parameters of the threshold trait reduced to well known distributions. For the survival trait the two baseline Weibull parameters were updated jointly by a Metropolis-Hastings step. The remaining model parameters with non-normalized fully conditional distributions were updated univariately using adaptive rejection sampling. The Gibbs sampler was tested in a simulation study and illustrated in a joint analysis of calving difficulty and longevity of dairy cattle. The simulation study showed that the estimated marginal posterior distributions covered well and placed high density to the true values used in the simulation of data. The data analysis of calving difficulty and longevity showed that genetic variation exists for both traits. The additive genetic correlation was moderately favorable with marginal posterior mean equal to 0.37 and 95% central posterior credibility interval ranging between 0.11 and 0.61. Therefore, this study suggests that selection for improving one of the two traits will be beneficial for the other trait as well.


INTRODUCTION
Because of their economic and ethical importance, functional traits have been given increasing priority in breeding programs for livestock during the last decade. Functional traits can generally be regarded as those traits, which increase net income by reducing the cost of input rather than increasing the output of saleable products. Numerous functional traits are considered in dairy cattle breeding including longevity, conformation scores, calving difficulty, and resistance to diseases (e.g. [10]). In pig breeding focus has mainly been on leg characteristics [22,30] and resistance to diseases [19].
As with many other traits, it is assumed that the genotypic value affecting a functional trait results from the sum of a very large number of independent contributions from independently segregating loci, each with a small effect. The Central Limit Theorem leads to the result that the additive genetic value is approximately normally distributed [4,14]. However, phenotypically these traits often have non-normal distributions, and many of the functional traits considered in animal breeding can be analyzed as threshold traits or survival traits with examples including disease resistance, conformation scores, calving difficulty and longevity.
Analysis of threshold characters often relies on the threshold liability concept first proposed by Wright [41]. Application of this model in animal breeding dates back to Robertson and Lerner [34]. During the last decade, survival analysis based on the proportional hazards model has become the method of choice for inferring longevity [11]. Survival analysis was first proposed in animal breeding by Smith and Quaas [35] for studying longevity of dairy cows. Since then survival models have also been used to infer environmental and genetic aspects of resistance to diseases in beef bulls [23], in fish [20] and in pigs [19].
Knowledge of genetic parameters such as heritabilities and genetic correlations are required to predict response to selection, to select among various breeding programs based on e.g. their economic revenue, and to estimate breeding values of selection candidates.
Multivariate quantitative genetic models for inferring an arbitrary number of threshold characters, survival traits and linear Gaussian traits only exist for censored linear Gaussian survival traits [25]. A recent methodology contribution includes a bivariate quantitative genetic model for a linear Gaussian trait and a Weibull survival trait [8].
The objective of this study was to extend the methodology of Damgaard and Korsgaard [8] to a bivariate quantitative genetic model of a threshold character and a survival trait that are genetically and environmentally correlated. Firstly, the Bayesian model is presented and the fully conditional distributions needed for implementing the Gibbs sampler are described. Secondly, the Gibbs sampler is tested by simulation and a joint analysis of longevity and calving difficulty of dairy cattle is presented for illustration of the model.

MATERIALS AND METHODS
Let Y 1i be a random variable of the ordered categorical trait of animal i for i = 1, . . . , n, where n is the total number of animals with records. Y 1i can take values in one out of K for K ≥ 2 mutually exclusive ordered categories. The outcome of Y 1i is equal to k if τ k−1 < L 1i ≤ τ k for k = 1, . . . , K, where L 1i is a continuous unobserved random variable often denoted the liability and τ = (τ 0 , τ 1 , . . . , τ K ) is a vector with K + 1 thresholds defined on the liability scale with τ 0 = −∞ and τ K = ∞. For the survival trait let T i and C i be random variables representing a survival time and a censoring time. In what follows, we assume that all animals have records of both traits such that data on animal i is given by (y 1i , y 2i , δ 2i ), where y 1i is an observed value of Y 1i , y 2i is an observed value of Y 2i = min(T i , C i ), and δ 2i is the outcome of a censoring indicator variable equal to 1 if T i ≤ C i and 0 otherwise. Later we consider the case where data on one of the two traits are missing at random.
In this paper we augment the joint posterior distribution with the vector of unobserved liabilities. By doing so the model specification is very similar to the one already given for a bivariate model of a survival trait and a Gaussian trait [8]. In this paper we define parameters and give the prior distribution and the augmented posterior distribution. Regarding the fully conditional posterior distributions we will only explicitly give them for liabilities and thresholds. For the remaining parameters they are identical to the ones already given for a bivariate model of a Gaussian trait and survival trait [8] if the sampled liabilities are considered as data from a Gaussian process.
The sampling distribution for the bivariate model will be represented by the conditional hazard function of T i and the joint distribution of L 1 and e 2 λ i (t|θ, e 2 ) = ρt (ρ−1) exp x 2i (t)β 2 + z 2i a 2 + e 2i where e 2 with elements (e 2i ) i=1,...,n is a vector of residual effects of the survival traits on the log-frailty scale, which accounts for variation in log-frailty not otherwise accounted for by the specification of the model with covariates and random effects. Here λ i (t|θ, e 2 ) is the hazard function of T i conditional on model parameters (θ, e 2 ), where θ = (ρ, β 1 , β 2 , τ, a 1 , a 2 , G, R e ). The Weibull baseline hazard function is generally given as λ ρ ρt (ρ−1) with parameters ρ and λ. Here the term λ ρ is included on the log-frailty scale as ρ log(λ) and is the first element of the vector β 2 . The p 1 dimensional vector β 1 and the p 2 dimensional vector β 2 represent systematic effects of the threshold trait and the survival trait. a 1 and a 2 of dimension q are the vectors of additive genetic effects, where q is the total number of animals in the pedigree, and the vectors x 1i , x 2i , z 1i , z 2i are incidence arrays relating parameter effects to observations.
is the genetic covariance matrix, and R e = R 11 R 12 R 21 R 22 is the residual covariance matrix.

Prior specification
A priori model parameters (β 1b ) b=1,...,p 1 , (β 2b ) b=1,...,p 2 , τ, ρ, (a 1 , a 2 , G) and (e 1 , e 2 , R e ) are assumed to be mutually independent. For K ≥ 3 we assume that all elements of the residual covariance matrix R e are stochastic, implying that only K − 3 thresholds can be identified [36]. We set τ 1 = 0 and τ K−1 = 1, and assume a priori that the remaining unknown thresholds are distributed as order statistics from a uniform (0, 1) distribution according to p(τ 2 , . . . , For the binary response (K = 2) it is necessary for reasons of identifiability to constrain both the threshold and the residual variance component: we set τ 1 = 0 and R e22 = 1. Improper uniform priors are assigned to β 1b b=1,...,p 1 , β 2b b=1,...,p 2 and ρ over their range of positive support. The prior distribution of additive genetic effects is by assumption of the additive genetic infinitesimal model [4] assumed to be multivariate normally distributed a 1 , a 2 |G ∼ N 0, G ⊗ A q , where A q is the additive genetic relationship matrix. Finally the covariance matrices G and R e are a priori assumed to be inverse Wishart distributed according to G ∼ IW(F g , f g ) and R e ∼ IW(F e , f e ).

Augmented posterior distribution
The augmented posterior distribution of (θ, e 2 , L 1 ) is obtained using Bayes' theorem. Here we augment the parameter vector with the vector of unobserved liabilities L 1 = (L 11 , . . . , L 1n ) and posterior distribution is given by where in the second step we used that (Y 2 , δ 2 ) and (Y 1 , L 1 ) are assumed to be conditional independent given θ and e 2 . Further, conditional on (θ, e 2 ), censoring is assumed to be independent and non-informative [3] implying ds is the conditional survival function. Under the assumption of a proportional Weibull log-normal animal frailty model, p(y 2 , δ 2 |e 2 , θ) is up to proportionality given by It follows that the distribution p(y 1 |l 1 , e 2 , θ) is degenerate because the probability that a categorical record falls in a given category conditional on liabilities and thresholds is completely specified. p(y 1 |l 1 , e 2 , θ) is written as Finally, the conditional distribution of L 1 , e 2 |θ is multivariate normally distributed according to (1).

Fully conditional distributions
The fully conditional distributions of each or groups of model parameters were obtained up to proportionality by retaining from the posterior distribution (2) the terms depending on the parameter of interest. By regarding the sampled values of the liabilities as observations from a linear Gaussian trait, the model parameters, with exception of thresholds, have fully conditional posterior distribution which are identical to those given for a bivariate model of a linear Gaussian trait and a survival trait [8]. For the liabilities assume that Y 1i = k for k ∈ {1, . . . , K} , then the fully conditional distribution of a liability L 1i for i = 1, . . . , n follows a truncated normal distribution on the interval [τ k−1 < l 1i ≤ τ k ) with mean µ l 1i and variance V l 1i before truncation given by The fully conditional distribution of a threshold τ k for k = 2, . . . , K − 2 is uniformly distributed on the interval Note that this notation accommodates for the possibility of missing observations in one or more categories [1].

Model with missing observations
Missing observations for one of the two traits are often the case in field data. If the observations are missing at random [27], then a common approach in Bayesian analysis is to augment the joint posterior distribution with the residual effects associated with missing records (e.g. [36]). The augmented residuals are treated as unknown parameters, so at each iteration of the Gibbs sampler the augmented residual effects are sampled from their fully conditional posterior distributions. This approach was described in details for the bivariate model of a linear Gaussian trait and a survival trait [8]. Again if we regard the sampled values of liabilities as observations from a linear Gaussian trait the fully conditional distributions of augmented residuals are normally distributed with the same mean and variance as those given for the bivariate model of a linear Gaussian trait and a survival trait [8].

Implementation
A Gibbs sampler for the bivariate animal model (1) with no random environmental effects was implemented in Fortran 90 for data without missing observations of the two traits. The implementation is an extension of the one described already for a bivariate model of a linear Gaussian trait and a survival trait [8] and will therefore only be described briefly in this study. Inferences of model parameters were based on a single Gibbs chain. Updating of the model parameters β 1 , (τ) i=2,...,K−2 , (a 1i ) i=1,...,n , G, R e and of the liabilities (L 1i ) i=1,...,n for which the fully conditional distributions can be recognized in closed form, was performed using standard methods. Note that the elements of the vector β 1 and the elements of the matrices G and R e were jointly updated. The parameters ((β 2i ) i=2,...,p 2 , (a 2i ) i=1,...,n , (e 2i ) i=1,...,n ) for which the fully conditional distributions could not be recognized in closed form, were updated univariately using adaptive rejection sampling (ARS) [16]. Finally, the two Weibull baseline parameters (ρ, β 21 ) were updated jointly by a Metropolis-Hastings step [18,32] using a large sample bivariate normal distribution as proposal distribution [8].

Simulation study
The proposed bivariate model was illustrated in a simulation study in which the same model was used to generate and analyze data. Thus, focus was on the estimation of parameters under conditions where all model assumptions were satisfied.
Records of both traits were generated for 6000 animals after 100 unrelated sires each having 60 offspring (balanced half-sib design) using the bivariate model (1). The number of categories for the threshold character was three with observed frequencies; 1: (32%), 2: (45%) and 3: (23%). Lifetimes higher than 1500 were right censored resulting in a data set with approximately 15% censored records. The model of the threshold character included a mean effect, a sire effect and a residual effect. The model for the survival trait included the two baseline Weibull parameters, a systematic effect with two levels, a sire effect and a residual effect [8]. We adopted improper uniform priors for the genetic sire covariance matrix (G s ) and the residual covariance matrix (R e ).
The model parameters used to simulate data and the results from the Bayesian analysis are given in Table I. Starting values of the Gibbs sampler of the sire and residual variances and of the two Weibull parameters were set equal to the true values used in the simulation of data. All of the remaining model parameters were initially set to zero. A Gibbs chain of length, 600 000 iterations, was run. The first 10 000 iterations were considered as burnin and therefore discarded from the post Gibbs analysis. The interval between saved sampled values was 100, so that the total number of iterations kept was 4900. Effective number of samples (N e ) of each parameter was calculated by the method of batching based on 30 batches (e.g. [36]). Table I. Posterior summary statistics for the simulation study: marginal posterior mode, mean, 2.5% and 97.5% percentiles, and effective sample size (N e ) of Weibull parameters (ρ, β 21 ), of time-independent systematic effects (β 22 , β 23 = 0), of sire variance components (G s11 , G s22 ) and sire correlation (ρ G s ), of residual variance components (R e11 , R e22 ) and residual correlation (ρ R e ).

Parameter
True

Data
Since 1985, Danish dairy farmers have recorded calving difficulty in one out of five ordered categories; 1: easy without assistance, 2: easy with assistance, 3: difficult without veterinary assistance, 4: difficult with veterinary assistance, and 5: caesarian delivery. Because of few observations, categories 3, 4, 5 were grouped together so that calving difficulty in first lactation was defined by three categories. Note that here we only analyzed calving difficulty in first lactation with observed frequencies; 1 : 53%, 2: 38% and (3, 4, 5): 9%. Longevity was defined as time from first calving until culling.
Data was extracted from the Danish national database for cattle [5] and consisted of records of both traits from 16 345 Danish Holstein cows originating from 33 herds. Only cows having their first calving in the period from January 1990 to June 2002 were included in the analysis. The herds were selected so that the number of first calvings in 1990 was higher than 100 and in the following years was within plus and minus 15% of the level in 1990. This strategy was chosen in order to avoid herds with substantial changes in herd size during the study period. Lifetimes of cows sold or still alive at the time of last registered milk recording date in the extracted data were right censored, corresponding to 26% censored records. Pedigree of the cows with records was traced back only to their sires, implying that sires were assumed unrelated (half-sib design). The cows were daughters after 590 sires, and the daughtergroup size ranged between 5 and 939 with an average of 28.

Model and data analysis
The data was analyzed using a sire model equivalent to the animal model (1). The threshold character was modelled with an effect of herd and an effect of year at first calving with 12 levels, which were defined by the first of January every year for 1991 to 2001. The survival trait was modelled with a timeindependent effect of herd and a time-dependent effect of stage of lactation with four levels (β S L1 , . . . , β S L4 ). The stage of lactation effect changed at each calving and at 60, 180 and 305 days after calving in all lactations. A sire effect and a residual effect were included for both traits allowing for additive genetic and environmental correlation between the two traits. We adopted improper uniform priors for the genetic sire and residual covariance matrix.
Two independent Gibbs chains of length, 600 000 iteration each, were run. Based on visual inspection of all trace plots the first 10 000 iterations were considered as burnin and therefore discarded from the post Gibbs analysis. The interval between saved sampled values was 100, so that the total number of iterations kept was 5900 for each chain. The starting values of the parameters of the first chain were ρ = 1.7, β 21 = −12.0, G s11 = 0.02, G s12 = 0.006, G s22 = 0.05 and R e11 = 0.5, R e12 = 0.02 and R e22 = 0.2. All of the remaining model parameters were started at zero. In the second chain, the starting values of the two Weibull parameters were changed according to ρ = 1.2, β 21 = −8.5, whereas the remaining parameters were started as for the first chain. These starting values are to a higher degree similar to the ones used in the Danish routine evaluation of dairy cows (ρ = 1.07 and β 21 = −8.1) [9].
The marginal posterior summary statistics of the first chain were very similar to those of the second chain and in the following we will therefore only give results from the second chain (Tab. II). The agreement between the two chains provides evidence that the Gibbs sampler converged and that samples can be regarded as generated from the posterior distribution of interest.
Because of large computation time in the joint mixed model (1) the convergence of the Gibbs sampler for survival parameters were prior to the joint analysis also assessed in univariate systematic analyses of longevity. This was done by varying the starting values for β 21 (−6 to −15) and for R e22 (0.02 to 0.8). In all combinations tested, the marginal posterior summary statistics were basically the same and similar to the corresponding ones obtained in the joint analysis. This further suggests satisfactory convergence of the Gibbs sampler in the joint analysis. Table II. Posterior summary statistics for the analysis of calving difficulty and longevity: marginal posterior mode, mean, 2.5% and 97.5% percentiles, and effective sample size (N e ) of Weibull parameters (ρ, β 21 ), of time-dependent stage of lactation effects (β S L1 , β S L2 , β S L3 = 0, β S L4 ), of sire variance components (G s11 , G s22 ) and sire correlation (ρ G s ), of residual variance components (R e11 , R e22 ) and residual correlation (ρ R e ), of heritability of calving difficulty on the liability scale (h 2 1 = 4G s11 /(G s11 + R e11 )), and of heritability of longevity on the log-frailty scale (h 2 2 = 4G s22 /(G s22 + R e22 )).

Simulation study
The results from the simulation study showed that parameters of the bivariate model (1) can be correctly inferred using the proposed methodology. The central posterior density (CPD) regions [6] defined by 2.5% and 97.5% percentiles covered well the parameter values used in the simulation of data (Tab. I). For example, there is 95% marginal posterior probability that the additive genetic correlation lies between 0.25 and 0.70, which covers the true value of 0.5 used in the simulation of data.

Example, calving difficulty and longevity
For calving difficulty in first lactation, the marginal posterior mean of heritability on the liability scale of 0.15 is slightly larger than previously reported heritabilities obtained from univariate threshold models (0.07 to 0.12) [17,31,38]. For longevity the marginal posterior mean of heritability on the log-frailty scale (h 2 nor = 4G s21 /(G s22 + R e22 )) is 0.82 [24]. This shows that the residual effect mainly describes additive genetic variation (i.e. genetic differences between dams and Mendelian segregation). Note that the heritability on the log-frailty scale for the survival trait ignores the underlying extreme value variation (π 2 /6) and therefore is higher than the heritabilities most commonly reported in survival studies [11]. The reason why we prefer to define heritability on the additive linear scale is that it is the most general definition of heritability that can be used both in semi-parametric and parametric proportional hazards models, also when extended to time-dependent genetic effects [7]. Further discussion on heritabilities and their interpretation can be found in [24,42].
The moderate positive marginal posterior mean of the additive genetic correlation (0.37) suggests that selection for improving one of the two traits (i.e. easier calving or reduced risk of culling) will have a beneficial effect on the other trait as well. Note that a negative sire value for longevity corresponds to low risk of culling, and that a low value for calving difficulty means less problems at calving.
The low marginal posterior mean of the residual correlation (0.023) is a bit unexpected as a calving with severe difficulties is expected to increase the risk of culling immediately after calving. A possible explanation for the low residual correlation is that the estimate obtained here represents the average residual association between calving difficulty and risk of culling in course of an animals lifetime. Therefore a high momentary residual association between calving difficulty and risk of culling will not be exploited by the model applied. Clearly this points to the need for alternative survival models that allow for a time varying residual effect. Figure 1 shows the hazard function for the first five lactations conditional on a zero value for the systematic time-independent effects, the sire effect and the residual effect. The average lifetime at second, third, fourth and fifth calving defined the calving times. The hazard function shows that at the end of each lactation the risk of culling is elevated. This result agrees with the fact that voluntary culling, which is assumed to be the major reason for culling, mainly takes place late in lactations [13]. Figure 2 shows the three conditional survival functions corresponding to sire effects equal to zero, minus and plus two standard deviations of the estimated sire variance (−0.38, 0.38), a zero value for the time-independent systematic effect and the residual effect. These survival curves clearly illustrate that daughters from the best sire, say s best = −0.38, have a substantial better longevity than daughters of the worst sire, say s worst = 0.38. For example at the time of fourth calving (day 1134) 49% of the daughters of the best sire is still alive compared to only 22% of the daughters of the worst sire.

DISCUSSION
We have presented a Gibbs sampler for joint Bayesian analysis of a threshold character and a survival trait. Simulation results established that the estimated marginal posterior distributions covered well the true values used in the simulation of data. In conclusion the model parameters and functions thereof including the additive genetic and environment correlations can be correctly inferred applying the method proposed.
The proposed method allows inferring the additive genetic correlation between a threshold character and a survival trait under the assumption of the additive genetic infinitesimal model [4,14] without having to rely on approximations. The additive genetic correlation provides information about how selection for an ordered categorical trait simultaneously may affect a survival trait and vice versa. This information is essential for planning and administrating of genetic improvement programs. For example, in pig breeding the implemented breeding strategy implies that breeding sows are culled already after their first or second litter so that direct selection for improved longevity is hardly feasible. In that case the model can be used to identify and evaluate correlated threshold characters (e.g. locomotion traits), which can be used to select indirectly for improved longevity. The bivariate model has also been suggested as a way to improve the accuracy by which especially young sires of dairy cattle are evaluated for longevity [26,39,40]. The idea is to combine longevity records with, for instance, correlated ordered categorical type traits, which can be recorded early in life, and thereby increase the information on longevity.
The joint analysis of calving difficulty and longevity provides a first illustration of a bivariate analysis of a threshold character and a survival trait. We did not assess here the important step of evaluating the plausibility of the posited model [15]. However, in agreement with previous studies we found that genetic variation exists for both traits [13,17].
For the genetic parameters of longevity, it is important to note that the marginal Weibull sire model proposed in this study is different from the univariate Weibull sire frailty model proposed by Ducrocq and Casella [11].
The model proposed here includes as an extra effect a normally distributed residual effect in log-frailty. This effect is assumed to account for unobserved individual heterogeneity due to omitted covariates, and three quarters of the genetic variation in a sire model. Moreover, if the residual effect is ignored then the univariate Weibull sire model is inconsistent with assumptions of the additive genetic infinitesimal model [2,4,14], and simulation results suggest that the estimated parameters will be biased towards zero [8]. This may in part explain why the marginal posterior means of the Weibull parameter ρ = 1.70 and the sire variance G s22 = 0.037 obtained in this study are greater than the ones estimated and used in the Danish routine genetic evaluation of sires for longevity (ρ = 1.07, σ 2 s = 0.030) [10]. These parameters are obtained from an univariate Weibull sire model without a residual effect in log-frailty [11] using the Survival Kit [12]. Another part of the explanation may be that the data analyzed in this study only is a subset of the data used for the Danish routine genetic evaluations. Furthermore, the systematic part of the two models is not exactly the same. However, for the purpose of ranking of selection candidates simulation studies suggest that the difference between the two survival models (with or with a residual effect) is inconsiderable for a sire variance in the range 0.03 to 0.05 [11].
In this study, we augmented the parameter vector with the unobserved liabilities associated with the threshold trait. The advantage of this approach is that the fully conditional distributions of parameters associated with the threshold trait reduce to standard distribution, which are easy to sample from. While this approach facilitates programming it is also known to cause slow mixing properties of particular thresholds (e.g. [17,37]). Joint updating of parameters is known to improve mixing in hierarchical models [28,29,33]. Along this line, Sorensen et al. [37] suggested sampling the thresholds and the liabilities jointly to improve the mixing of thresholds. In this study we were only concerned with analyzing categorical data with three categories so we were not obliged with the problem of slow mixing thresholds because they were all fixed for reason of identifiability [37].
For some of the parameters in this paper we assumed improper uniform prior distributions. A disadvantage with improper priors is that the posterior distribution may turn out to be improper as well. In the case where the posterior distribution is unavailable in closed form it is very difficult to demonstrate its property. If instead all parameters are given proper priors then the posterior distribution is guarantied to be proper [21]. A frequent used alternative to improper priors is to use bounded proper priors [37]. The two univariate models that define the bivariate model are both identifiable and that together with the fact that the gibbs sampler fot the bivariate model worked satisfactory suggests that the associated posterior distribution is proper and therefore is valid for drawing inferences The method described in this study can easily be extended to allow for both direct and maternal additive genetic effects, QTL effects, time dependent additive genetic effects for the survival trait, and an arbitrary baseline hazard function. Damgaard and Korsgaard [8] discussed how to implement these extensions in the context of a bivariate model of a linear Gaussian trait and a survival trait. They also sketched how this bivariate model could be generalized to an arbitrary number of ordered categorical characters, survival traits and linear Gaussian traits. This paper provides a first step towards such an analysis.