Estimation of covariance components between one continuous and one binary trait

On decrit une methode pour estimer les composantes de variance et de covariance, dans le cas de deux caracteres, l'un continu et l'autre binaire. La procedure d'estimation des variances et covariances genetiques, et de la variance residuelle du caractere continu est equivalente a celle d'un maximum de vraisemblance restreint dans le cas normal


INTRODUCTION
Discrete traits are predominant in certain areas of animal production, e.g., fertility, prolificacy, viability and animal health. These traits are of increasing importance since many producers are under quota systems and a means of increasing income is to improve efficiency through these non-production traits. However, traditional traits like milk yield, milk composition, growth and feed efficiency are still important, and so are the relationships between the discrete and the traditional traits, most of which are continuous. Non-linear procedures based on the threshold concept (Dempster & Lerner, 1950) and Bayesian arguments have become available to analyze categorical observations Harville & Mee, 1984).
Bayesian methodology has been established as a general framework for the analysis of any type of observation arising in animal production (Foulley & Gianola, 1984;Gianola et al., 1986;H6schele et al., 1986, 1987Foulley et al., 1987aFoulley et al., , 1987b. Observation structures that include both continuous and binary traits have been discussed as far as the estimation of genetic and environmental effects is concerned (Foulley et ad., 1983). The objective of this paper is to present a method for the estimation of dispersion parameters of the joint distribution of continuous and binary traits. A simulation study was conducted to study the effects of small subclass size on parameter estimates.

MODEL AND DATA STRUCTURE
Each animal is recorded for two different traits. Let y. il be the continuous trait 1 (e.g., milk yield) for the i-th animal, and let c i be the response for binary trait 2 (e.g., mastitis), which is either 1 or 0. Categories of response for the binary trait are assumed to be mutually exclusive and exhaustive. Observations for trait 1 form the s x 1 vector y i , and the observations for the binary trait can be represented as an s x 1 vector, c'= (c l , c 2 , ..., c,)', where c i = 0 if no disease is observed, A non-observable underlying continuous variate, Y i 2 , is assumed, and Yil and y 22 follow a multivariate normal distribution. According to the threshold concept (Dempster & Lerner, 1950), y 22 can be thought of as liability to contract the disease. Only if the value of y 22 exceeds a fixed threshold on the underlying scale, does animal i show the disease.
Let y 2 be an s x 1 vector containing Yi2 for animals 1 to s, then the linear bivariate model in matrix notation is where : p j = p x 1 vector of fixed effects for trait j, u j = q x 1 vector of additive genetic effects for trait j, ej= s x 1 vector of residuals for trait j, X = s x p incidence matrix corresponding to P j , Z = s x q incidence matrix corresponding to u j , p = the number of levels of all fixed effects, and q = the number of additive genetic effects, which could be greater than or equal to s. The expectations and variance-covariance matrices of the random variables are: where: G = 2 x 2 additive genetic variance-covariance matrix, A = q x q additive genetic relationship matrix, R = 2 x 2 residual variance-covariance matrix, I = s x s identity matrix, * = direct (Kronecker-) product, gjj= additive genetic variance of trait j, r!! = residual variance of trait j, 912 = additive genetic covariance between traits 1 and 2, and p e = residual correlation between traits 1 and 2.
The unknown parameters are the location parameters, and the dispersion parameters The residual variance structure is a function of only two parameters, because Yi2 is a conceptual variable and, therefore, an arbitrary value for r 22 can be chosen. Note that this model assumes that every animal is observed for both traits and that the same model applies to each trait.

METHODS OF INFERENCE
Assume that P has a flat prior distribution (i.e., nothing is known a priori about the elements of !), and that u has a multivariate normal distribution with expectation null and variance-covariance matrix equal to G * A. The vectors P and u are a priori assumed to be independent. The joint posterior distribution of all unknowns can be written as: where f ( Y ) is the joint prior density of the dispersion parameters (Foulley et al., 1987a).

Estimation of location parameters
Integrating T out of (2) would hardly be possible, as was pointed out by Gianola et al. (1986). An alternative is to replace !y by some value, -y!t!, so that as elaborated by Foulley et al. (1983). Ignoring the superscript [t] and assuming that there is only one continuous and one binary trait and that both traits are described by the same model, then following the arguments of Foulley et al. (1983), the residual variance of the conceptual variable is chosen to be: and therefore: For lp, < 1. R is positive definite and a lower triangular matrix T exists, so that TT' = R. T-1 is used to perform a Cholesky transformation to remove the residual covariance between transformed variables where Let the tilde symbol, -, above a variable indicate that same variable on the transformed scale, then for i = 1 to s, and similarly for elements ofp, u, and e. On the transformed scale, variances and covariances are Foulley et al. (1983) show that under multivariate normality where x' t and z' are rows of X and Z pertaining to animal i, and the residuals on the transformed scale are uncorrelated, one can proceed as in a single binary trait analysis, a simplication of  from several categories to two, computing where 0(.) and 4)(.) are the density and distribution function of a standard normal variate, respectively. Let: then from (4), The mode of the posterior density (3) can be calculated by applying the Newton-Raphson algorithm to the log of (3), which leads to the nonlinear system of equations for the [k + l]th round of iteration  where an s x s diagonal matrix with w i in the diagonal and Note, that W and Y2 in round [k + 1] are computed based on the solutions from the preceding round !k!.
Let and : Iteration on (6) is continued until d' d < e, where e is an arbitrarily small number. Suppose this criterion is met in round !k'!, then the converged solutions have to be transformed back to the original scale (the superscript [t] indicating the set of dispersion parameters used): Estimation of genetic variances and covariances Foulley et al. (1983) discuss briefly the case of the variance-covariance structure being unknown and suggest the application of the restricted maximum likelihood (RE1VIL) algorithm described by Schaeffer et al. (1978) to estimate the entire genetic variance-covariance structure as well as the residual variance of the continuous trait only, and the residual covariance between continuous and binary traits.
Methods to estimate dispersion parameters have been developed from a Bayesian background for different data structures. Procedures for single polychotomous traits have been described by Harville & Mee (1984) and by H6schele et al. (1987).
Methods for multivariate binary traits have been suggested by Foulley et al. (1987a), and for multivariate continuous traits by Gianola et al. (1986). Essentially, all authors recommend an algorithm analogous to the expectation-maximization (EM) algorithm (Dempster et al., 1977) for REML and its multitrait extension, respectively.
Let which requires integration of (2) with respect to 0. Foulley et al. (1987a) show that (7) can be written as where E e is the expectation taken with respect to f (u ! Y, y l , T ). Furthermore, Foulley et al. (1987a) show that whenever a flat prior for g is used, the joint posterior distribution of T can be maximized with respect to g by maximizing E!{ln f( u I g) I at each iterate. The resulting iterative scheme for the [t + 1]th round is where tr (.) is the trace operator and These terms cannot be derived explicitly, but approximations have been suggested (Harville & Mee, 1984;Stiratelli et al., 1984). Making use of these approximations and performing one estimation step on the transformed scale, we replace in (9) and C ii , by the U t x u i , part of the inverted left hand side of (6). The new estimates are on the transformed scale and need to be transformed back: Foulley et al. (1987a) have shown that if the initial G is positive definite, then the subsequent estimates of G are also positive definite, which holds in combination with the applied Cholesky transformation.
Estimation of the residual variance of the continuous trait Maximum a posteriori predictors tend to converge to trivial results when flat priors for the dispersion parameters are used (Lindley & Smith, 1972). As an alternative Gianola et al. (1986) suggest estimators arising from an approximate integration of the variances. For g these are equivalent to the algorithms described in section B as long as flat priors for the variances are used. For the residual variance of the continuous trait estimator is which can be shown to lead to the same results as iterating on assuming full column rank of X. Estimators on the original scale are obtained by: which yields Estimation of the residual correlation between traits Foulley et al. (1983) suggest the use of restricted maximum likelihood estimation as described by Schaeffer et al. (1978) for multivariate normal data. However, this requires that the residual correlations between continuous and binary traits be known. Foulley et al. (1983) show that proportionately except for a constant.
The problem is to find y [t] and 8 (which is a function of T 14 ) which maximize this log posterior density. Estimates for 0, g, and r ll can be obtained under the pretense that p e = pe similar to the one dimensional grid search technique proposed by Smith & Graser (1986). In this case the strategy would be: -1) choose a set of possible values of p e and a set of starting values for g, and r ll ; -2) for ith value of p e in the set, iterate on (6) using the dispersion parameters (g, r ll , Pe i) until convergence of 0, then do one step of (9) and (10) and continue to iterate until convergence of (g and r ll ) is achieved. Then compute the value of (11) defined as h (p ei ). Repeat for all p, i in the set; -3) to find the p, i that maximizes the log likelihood, fit a second degree polynomial through the points (p ei , h (p e2 )). Add the mode of this curve to the set of possible values of p e and repeat the second step with this new value. Continue this iterative process until convergence of p e is achieved. This strategy is computationally very demanding as convergence is very slow. An alternative empirical approach using a maximum likelihood procedure to estimate p e was suggested by equation 4.4 of Tate (1955) for the biserial correlation problem. The [t + l]th improved estimate of p e is computed as: where v = s x 1 vector with elements v i , i = 1, s, 1 = s x 1 vector of I's, Yi ( y i-1 * ! I ) * l/( d tl) l / 2 , ! I = phenotypic mean of y 21 for all animals, and OW ] is the estimated value of the threshold which can be calculated from !lk'] analogously to a least squares mean. Note that this procedure yields a new estimate of p e on the original scale, and therefore, no backtransformation is needed.

Computing algorithm
Combining all the different steps described, an obvious computing strategy would be: -1) choose a set of starting values yl'], set (t] = 1; -2) iterate on (6) until convergence to get 9 ! Y = T 14 ; -3) do one step each of (9), (10) and (12) to get a new set of dispersion parameters T l' + '], set [t] = [t + 1] and continue with the precedent step. Iterate on the above scheme until the estimates of y are converged. Empirically, (10) and (12) were found to converge more quickly than (9), and the estimates in r seemed to be relatively independent of those in g. This suggests the use of a different strategy: -1) choose a set of starting values g l] and rI l ], set [t] = 1 and [s] = 1; -2) do one round of (6) to get an improved estimate of 0 g = gI t ], r = r1 4 ; -3) do one step each of (10) and (12) to get a new set of parameters r Is+1] , set [s] = [s + 1] and continue with the precedent stage -4) if convergence in the second step is reached, do one step of (9) to get a new estimate of gI t+1] , set [t] = [t + 1] and continue with the second step.
Iteration is stopped when the estimates in g are converged. This algorithm was found to be much faster than the first, which is paralleled by the findings of Foulley et al. (1987a) for the multiple binary trait case. Iterating on r only for some rounds with every new value of g improved the procedure even further, as the estimates of r converge quite rapidly.

Model and methods
A simulation study was done to assess the sampling properties of the proposed estimators. The assumed true model was: where: -Y ij kl = phenotypic record (i = 1) or underlying liability value (i = 2) of animal I in herd j with sire k, pz = mean of trait i, -hij = effect of herd j on trait i, sz! = effect of sire k on trait i, and -e2!! = residual term.
Pairs of herd, sire and residual effects were generated for each animal as random samples from a bivariate normal distribution. Trait 2 was then dichotomized, applying a threshold at 1.282 standard deviations, which assigns 90 and 10 per cent of the observations to the phenotypic classes 0 and 1, respectively. Records were generated for 10,000 individuals per replicate, which were randomly assigned to sires (of which there were 50) and to herds (of which there were either 100 or 1000). With only 100 herds there were 13.5% empty sire by herd subclasses, and with 1 000 herds there were 81% empty sire by herd subclasses. The objective of having two data sets was to study the effect of small sire by herd subclass size on the parameter estimates from the above methods.
The true dispersion parameters were: The model of analysis was equal to (13), except that herd effects were treated as fixed and the non observable y 2 was replaced by the phenotypic observation in the binary trait. The starting values of iteration were and the stopping criterion was considering the relative change to account for the different magnitude of the true values. Due to computing time restrictions, only twenty-five replications of each pair of data sets were generated and analyzed.

Results
The mean estimates and 95% confidence intervals are given in Table I. Fisher's x-transformation for correlations was applied to hi, h', r 9 and p e , and a t-test was applied to examine differences between true and estimated parameters. Estimates from data set I with 100 herds agreed quite closely with the true parameters. The only significant bias was observed for the residual correlation, which is partly due to the small confidence interval for this estimate. The genetic correlation had a very large confidence interval compared with the other estimates.
The results for data set II (1000 herds) were similar with the exception of a 50% overestimation of the heritability of the binary trait. H6schele et al. (1987) reported similar results in a simulation study considering estimation of heritability for binary traits. The bias was observed when the average subclass size was below 2, which agrees with the present results, as the mean subclass sizes were approximately 2.3 and 1.1 in data set I and II, respectively. H8schele et al. (1987) considered the approximation of the conditional distribution in the derivation of (9) as the main cause of the bias, as this approximation is based on a normality assumption (Stiratelli et al., 1984), which does not hold with small cell size.

DISCUSSION
A method has been presented for the estimation of variances and covariances for the simplest case of one continuous and one binary trait. Theoretically, generalization to more complex observation structures is straightforward. For the case of n continuous and one binary trait, the method to estimate location parameters is as given by Foulley et al. (1983), and the procedures to estimate dispersion parameters can be readily extended, using results of Hannan & Tate (1965) for the maximum likelihood estimation of residual correlations. Further generalizations, including the use of informative priors, may be derived through Bayesian arguments as employed by Gianola et al. (1986) and Foulley et al. (1987a).
In practice, however, an analysis of even the simplest case poses a formidable computational task with a large body of data, in which case the optimization of computing strategies is of crucial importance. Actually, the methodology presented was developed for a joint analysis of more than 200,000 records on production and disease data in dairy cattle.
Observations on both traits for all animals were assumed complete throughout the data, which seldom is the case in practice, where selection and hence missing data due to selection play an important role. This has been addressed by Henderson (1975)  The multiple trait methodology allows for sequential selection where the decision whether an animal is given the opportunity to be observed for the discrete trait is made on the basis of its performance for the continuous trait, but not vice versa, as was indicated by Foulley et al. (1983). General approaches to the more complex types of selection are needed, as most of the reasons for inevitable natural selection (diseases, fertility) are of categorical nature.
The simulation study was an illustration of the methodology for two special cases. A general assessment of the sampling properties of the estimates would require a series of simulations over a wide range of parameter combinations and more realistic population structures as was done by Hbschele et al. (1987), for the single categorical trait case. Since the general framework of the methods is the same, a similar behaviour of the estimates can be expected, which was shown for the biased estimate of heritability for the binary trait when the average size of the smallest subclass was less than two. This is a serious problem in practical applications, e.g. when dairy progeny test data are analyzed by a model in which sire and herd x year x season effects are cross-classified.
The complexity of the Bayesian approach in the context of the given model and observation structure makes it necessary to use certain assumptions and restrictions as discussed by Gianola et al. (1986) for the multivariate normal case. Although the approach of basing the inference about 0 on the conditional distribution f (9 ! Y, Yl'Y * ) where y * is the mode of f (r ! Y, y l ) has proven to be very useful H6schele et al., 1987), alternative approaches are possible as discussed by Foulley et al. (1987a). Thus, further development of Bayesian methodology may suggest changes in the proposed algorithm.