Likelihood inferences in animal breeding under selection: a missing-data theory view point

The Editorial Board here introduces a new kind of scientific report in the Journal, whereby a current field of research and debate is given emphasis, being the subject of an open discussion within these columns. As a first essay, we propose a discussion about a difficult and somehow trouble some question in applied animal genetics: how to take proper account of the observed data being selected data? Several attempts have been carried out in the past 15 years, without any clear and unanimous solution. In the following, Im, Fernando and Gianola propose a general approach that should make it possible to deal with every problem. In addition to the interest of an original article, we hope that their own discussion and response to the comments given by Henderson and Thompson will provide the reader with a sound insight into this complex topic. This paper is dedicated to the memory of Professor Henderson, who gave us here one of his latest contributions. Summary-Data available in animal breeding are often subject to selection. Such data can be viewed as data with missing values. In this paper, inferences based on likelihoods derived from statistical models for missing data are applied to production records subject to selection. Conditions for ignoring the selection process are discussed. animal genetics-selected data-missing data-likelihood inference Résumé-Les méthodes d'inférence fondées sur la vraisemblance en génétique animale: prise en compte de données issues de la sélection au moyen de la théorie des données manquantes. Les données disponibles en génétique animale sont souvent issues d'un processus préalable de sélection. On peut donc considérer comme manquants les attributs (non observés) associés aux individus éliminés, et analyser les données recueillies comme provenant d'un échantillon avec données manquantes. Dans cet article, on développe les méthodes d'inférence fondées sur les vraiserrebdances, en explicitant dans leur calcul le processus, dû à la sélection, qui induit les données manquantes. On discute les conditions dans lesquelles on peut ignorer la sélection, et donc considérer seulement la vraisemblance des données e,!'ective!rcent recueillies.


INTRODUCTION
Data available in animal breeding often come from populations undergoing selection. Several authors have considered methods for the proper treatment of data subject to selection in animal breeding. Examples are Henderson et al. (1959), Curnow (1961), Thompson (1973), Henderson (1975), Rothshild et al. (1979), Goffinet (1983), Meyer and Thompson (1984), Fernando and Gianola (1989), and Schaeffer (1987). Data subject to selection can be viewed as data with missing values, selection being the process that causes missing data. The statistical literature discusses missing data that arise intentionally. Rubin (1976) has given a mathematically precise treatment which encompasses frequentist approaches that are not based on likelihoods as well as inferences from likelihoods (including maximum likelihood and Bayesien approaches). Whether it is appropriate to ignore the process that causes the missing data depends on the method of inference and on the process that causes the missing values. Rubin (1976) suggested that in many practical problems, inferences based on likelihoods are less sensitive than sampling distribution inferences to the process that causes data. Goffinet (1987) gave alternative conditions to those of Rubin (1976) for ignoring the process that causes missing -data when making sampling distribution inferences, with an application to animal breeding. The objective of this paper is to consider inferences based on likelihoods derived from statistical models for the data and the missing-data process, in analysis of data from populations undergoing selection. As in Little and Rubin (1987), we consider inferences based on likelihoods, in the sense described above, because of their flexibility and avoidance of ad-hoc methods. Assumptions underlying the resulting methods can be displayed and evaluated, and large sample estimates of variances based on second derivatives of the log-likelihood taking into account the missing data process, can be obtained.
MODELING THE MISSING-DATA PROCESS Ideas described by Little and Rubin (1987) are employed in subsequent developments. Let y, the realized value of a random vector Y, denote the data that would occur in the absence of missing values, or complete data. The vector y is partitioned into observed values, y obs , and missing values, y i .. Let be the probability density function of the joint distribution of Y = (Y obs; Y!i!), and 0 be an unknown parameter vector. We define for each component of Y an indicator variable, R i (with realized value r t ), taking the value 1 if the component is observed and 0 if it is missing. In order to illustrate the notation, 3 types of missing data are described in table 1. Consider 2 correlated traits measured on n unrelated individuals; for example, first and second lactation yields of n cows. The 'complete' data are y = (y2!), where y ij is the realized value of trait j in individual i (j = 1,2; i = 1... n). Suppose that selection acts on the first trait (case (a) in Table  I). As a result, a subset of y, y obs , becomes available for analysis. The pattern of the available data is a random variable. For example, if the better of two cows (n = 2) is selected to have a second lactation, the complete data would be Then when y l > y 21 : t and when y ll < Y 21 : 1 Thus, in analysis of selected data, the pattern of records available for analysis, characterized by the value of r, should be considered as part of the data. If this is not done, there will be a loss of information.
To treat R = (R i ) as a random variable, we need to specify the conditional probability that R = r, f (rly, 41), given the 'complete' data Y = y; the vector 41 is a parameter of this conditional distribution. The density of the joint distribution of Y and R is The likelihood ignoring the missing-data process, or marginal density of y obs in the absence of selection, is obtained by integrating out the missing data y mis from (equ.(l)) ---- The problem with using f(y obs [0) as a basis for inferences is that it does not take into account the selection process. The information about R, a random variable whose value r is also observed, is ignored. The actual likelihood is The question now arises as to when inferences on 0 should be based on the joint likelihood (equ.(4)), and when can it based on equ.(3), which ignores the missing data process. Rubin (1976) has studied conditions under which inferences from equ.(3) are equivalent to those obtained from equ.(4). If these hold, one can say that the missing data process can be ignored. The conditions given by Rubin (1976) are: 1) the missing data are missing at random, ie, /(r!yobs,ymis) 4*) = /(r!yobs) 4 l) for all 4o and Ymi s evaluated at the observed values r and y ob g; and 2) the parameters 0 and + are distinct, in the sense that the joint parameter space of (0, ,) is the product of the parameter space of 8 and the parameter space of !. Within the contexte of Bayesian inference, the missing data process is ignorable when 1) the missing data are missing at random, and 2) the prior density of 0 and, is the product of the marginal prior density of 0 and the marginal prior density of ,.

IGNORABLE OR NON-IGNORABLE SELECTION
Without loss of generality, we examine ignorability of selection when making likelihood inferences about 0 for each of the three examples given in Table I. Suppose individuals 1, 2 ... m (< n) are selected.

Cases (a)
Selection based on observations on the first trait, which are a part of the observed data and all the data used to make selection decisions are available. The likelihood for the observed data, ignoring selection, is Because selection is based on the observed data only, the conditional probability .f (r!Y! !) -f (rlYb!, +) because it does not depend on the missing data. Applying this condition in equ.(4) one obtains as likelihood function It follows that maximization of equ.(7) with respect to 0 will give the same estimates of this parameter as maximization of equ.(6). Thus, knowledge of the selection process is not required, i.e., selection is ignorable. Note that with or without normality, /(y obs! 8) can always be written as equ.(5) or (6). Under normality of the joint distribution of Y il and Y 2 , Kempthorne and Von Krosigk (Henderson et al., 1959) and Curnow (1961) expressed the likelihood as equ.(6). These authors, however, did not justify clearly why the missing data process could be ignored.
In order to illustrate the meaning of the parameter 41 of the conditional probability of R = r given Y = y, we consider a 'stochastic' form of selection: individual i is selected with probability g(o o +!i2/ti)t so + = ('Ij; o , 'lj;1) ' This type of selection can be regarded as selection based on survival, which depends on the first trait via the function g(O o + 'lj;1 Yil). We have for the data in Table I The actual likelihood for the observed data y obs and r is It follows that when 4 o and 0 are distinct, inference about 8 based on the actual likelihood, f( Yobs , riO, «1'), will be equivalent to that based on the likelihood ignoring selection, f(y obs1 0). As shown in equ.(8), the two likelihoods differ by a multiplicative constant which does not depend on 0.
It should be noted that in general, although the conditional distribution of R i2 given y does not depend on 0, this is not with the marginal distribution. For example, when Y il is normal with mean pi and variance er 2, and g is the standard normal function (lF) we have Goffinet (1987) for ignoring the process that causes missing data is not satisfied in this situation.

Cases (b)
Data are available only in selected individuals because observations are missing in the unselected ones. In what follows, we will consider truncation selection: individual i is selected when y 21 > t, where t is a known threshold.
The likelihood of the observed data (y obs ) ignoring selection is The conditional probability that R = r given Y = y depends on the observed and on the missing data. We have where l!t !i(y21) = 1 if yii > t, and 0 if yi l < t.
The actual likelihood, accounting for selection, is Comparison of equs. (9) and (10) indicates that one should make inferences about 0 using equ.(10), which takes selection into account. If equ. (9), is used, the information about 8 contained in the second term in equ.(10) would be neglected.
Clearly selection is not ignorable in this situation.

Cases (c)
Often selection is based on an unknown trait correlated with the trait for which data are available (Thompson, 1979). As in case (c) in Table I, suppose the data are available for the second trait on selected individuals only, following selection, e.g. by truncation, on the first trait. The likelihood ignoring selection is

We have
The likelihood of the observed data, y obS and r is Inferences based on the likelihood (equ.(11)) would be affected by a loss of information represented by the second and the third terms in equ. (12).
Under certain conditions one could use /(y obs! 8) to make inferences about parameters of the marginal distribution of the second trait after selection. Suppose the marginal distribution of the second trait depends only on parameters 8 2 , and that the marginal and conditional (given the second trait) distributions of the first trait do not depend on 8 2 . In this case, likelihood inferences on 0 2 from equs.(11) and (12) will be the same.
In summary, the results obtained for the 3 cases discussed indicate that when selection is based only on the observed data it is ignorable, and knowledge of the selection process is not required for making correct inferences about parameters of the data. When the selection process depends on observed and also on missing data, selection is generally not ignorable. Here, making correct inferences about parameters of the data requires knowledge of the selection process to appropriately construct the likelihood.

Selection based on data
In this section, we consider the more general type of selection described by Goffinet (1983) and Fernando and Gianola (1987). The data y o are observed in a 'base population' and used to make selection decisions which lead to observe a set of data, Ylobs , among n l possible sets of values Yll , Yl2 ...
Yini -Each yl!(k = I ... n i ) is a vector of measurements corresponding to a selection decision. The observed data at the first stage, y iobs , are themselves used (jointly with y o ) to make selection decisions at a second stage, and so forth. At stage j (j = 1 ... J), let y j be the vector of all elements from y!l ...Y!n!, without duplication. The vector y j can be partitioned as where Yiobs and y jinis are the observed and the missing data, respectively. For the J stages, the data can be partitioned as y = (Yobs, Ymis), where and are the observed and missing parts, respectively, of the complete data set. The complete data set y is a realized value of a random variable Y. When the selection process is based only on the observed data, y obs , the observed missing data pattern, r is entirely determined by y obs . Thus, and the actual likelihood can be written as in equ.(7). In this case, the selection process is ignorable and inferences about 0 can be based on the likelihood of the observed data, f ( Y ,, b , 10). This agrees Gianola and Fernando (1986)  .

Selection based on data plus 'externalities'
Suppose that external variables, represented by a random vector E, and the observed data y obs are jointly used to make selection decisions. Let /(y,e!6,!) be the joint density of the complete data Y and E, with an additional parameter ! such that 8 and are distinct. The actual likelihood, density of the joint distribution of Y obs and R, is where j(rI Yobs , e, cJI) is the distribution of the missing data process (selection process).
In general, inferences about 0 based on j(Yobs, r[0, ç, «1') are not equivalent to those based on /(y obs! 8). However, if for the observed data, y obs for all Ylll is and e, then equ.(13) can be written as Thus, under the above condition, which is satisfied when Y and E are independent, inferences about 0 based on the actual likelihood j( Yobs , r[0, ç, «1') and those based on /(y obs! 0) are equivalent. Consequently, the selection process is ignorable. Note that the condition for all Ymi s and e does not require independence between Y and E because it holds only for the observed data y obs and not for all values of the random variable Y obs .
The results can be summarized as follows: 1) the selection process is ignorable when it is only on the observed data, or on observed data and independent externalities; 2) the selection process is not ignorable when it is based on the observed data plus dependent externalities. In the latter case, knowledge of the selection process is required for making correct inferences.

DISCUSSION
Maximum likelihood (ML) is a widely used estimation procedure in animal breeding applications and has been suggested as the method of choice (Thompson, 1973) when selection occurs. Simulation studies (Rothschild et al., 1979, Meyer andThompson, 1984) have indicated that there is essentially no bias in ML estimates of variance and covariance components under forms of selection, e.g., data-based selection. Rubin's (1976) results for analysis of missing data provide a powerful tool for making inferences about parameters when data are subject to selection. We have considered ignorability of the selection process when making inferences based on likelihood and given conditions for ignoring it. The conditions differ from those given by Henderson (1975) for estimation of fixed effects and prediction of breeding value under selection in a multivariate normal model. For example, Henderson (1975) requires that selection be carried out on a linear, translation invariant function. This requirement does not appear in our treatment because we argue from a likelihood viewpoint.
In this paper, the likelihood was defined as the density of the joint distribution of the observed data pattern. In Henderson's (1975) treatment of prediction, the pattern of missing data is fixed, rather than random, and this results in a loss of information about parameters (Cox and Hinkley, 1974). It is possible to use the conditional distribution of the observed data given the missing data pattern. Gianola et al. (submitted) studied this problem from a conditional likelihood viewpoint and found conditions for ignorability of selection even more restrictive that those of Henderson (1975). Schaeffer (1987) arrived to similar conclusions, but this author worked with quadratic forms, rather than with likelihood. The fact that these quadratic forms appear in an algorithm to maximize likelihood is not sufficient to guarantee that the conditions apply to the method per se.
If the conditions for ignorability of selection discussed in this study are met, the consequence is that the likelihood to be maximized is that of the observed data, i.e., the missing data process can be completely ignored. Further, if selection is ignorable f (y obs , r, 10) oc j( YobsI O), so Efron and Hinkley (1978) suggested using observed rather than expected information to obtain the asymptotic variance-covariance matrix of the maximum likelihood estimates. Because the observed data are generally not independent or identically distributed, simple results that imply asymptotic normality of the maximum likelihood estimates do not immediately apply. For further discussion see Rubin (-1976).
We have emphasized likelihoods and little has been said on Bayesian inference. It is worth noticing that likelihoods constitute the 'main' part of posterior distributions, which are the basis of Bayesian inference. The results also hold for Bayesian inference provided the parameters are distinct, i.e., their prior distributions are independent. For data-based selection, our results agree with those of Gianola and Fernando (1986)   who used Bayesian arguments.
In general, inferences based on likelihoods or posterior distributions have been found more attractive by animal breeders working with data subject to selection than those based on other methods. This choice is confirmed and strengthened by application of Rubin's (1976) results to this type of problem. COMMENT C.R. Henderson t * The paper by Im, Fernando and Gianola provides an interesting and invaluable contribution to estimation and prediction in an almost universal situation in animal breeding. Very few data are available for parameter estimation or prediction of breeding values that have not arisen from either selection experiments or from field data in herds that have undergone selection.
For several years after the adoption of BLUP, a mixed linear model was assumed, and the usual description of the model was that E(Y) and E(e) are both null, and in an additive genetic model Var(U) = A Q a. The assumption of E(U) = 0 is clearly untenable, because if selection has been effective, the expectations of a subvectors for successive generations are increasing. Our models differ in that mine is considerably more restrictive, requiring as it does, a fixed incidence matrix with conceptual repeated sampling. This of course is the traditional approach taken by classical statisticians. The problem is more difficult, however, with selection problems, as compared to nicely designed experimental situations. No attempt was made in the 1975 paper to solve the problem of estimation of variances and covariances. Rather, I solved the problem of BLUE of estimable functions of ( 3 and BLUP of random variables, given multivariate normality and with variances and covariances known to proportionality. I pointed out that, in contrast to no selection models, the estimators and predictors are biased if incorrect ratios are employed. Thus, it is critical to obtain the best possible of these parameters. Im et al. address this problem. Several workers have speculated that REML applied to a selection model estimates the variances and covariances that existed prior to selection and which may have been altered by selection. In contrast to most of these speculations, I suggested that when selection is on observed records, the linear selection functions should be translation invariant. I think this is true under my selection model but may well not be true for other selection models. Im et al. strongly emphasize the desirability of likelihood methods. I agree with them, and in many meetings and papers have recommended these methods over some of my own, such as Method 3. I doubt the accuracy of the last sentence of the paper under review which states that animal breeders find likelihood methods more attractive. A study of animal breeding literature of the past 5 years would probably disclose that animal breeders have used Method 3 much more often than * Formerly of the Department of Animal Science, Cornell University Ithaca, NY, USA. REML or ML. If this is true, I certainly agree with minority and with Im et al.
The fact is that BLUE and BLUP under my selection model are ML estimators of / 3 and of the conditional mean of U.
I should now like to discuss how results compare under my selection model and under the model of the present authors. We agree partially regarding estimation when selection is on observable records. The authors' model clearly shows ignorability of selection in this case. Under my model, linear selection functions of Y must either be translation invariant or it must be true that E(L'Y s ) = E(L'Y u ) when Y 9 and Y u refer to selection and to no selection, respectively. This difference is simply a consequence of different models.
We agree that if selection is on unobserved random variables, selection is not ignorable. A special case of this has been of interest to me. Base population animals have been selected on translation invariant linear functions of data, but these are not available for analysis. Assuming that such selection results in E(U6) ! 0 a simple modification of the regular mixed model equations leads to BLUE and BLUP, and presumably these modified equations could be used to derive REML estimation of the variances and covariances, Henderson (1988). I believe that this final question is justified, namely, &dquo;What are the operating characteristics of the authors' estimators?&dquo; Likelihood methods for variance estimation have known desirable properties only in large samples. We need studies for various methods of bias, MSE, and maximization of selection progress using BLUP with estimated variances and covariances. Probably this can be done only through extensive simulation for a wide range of parameter values, selection intensity, etc.
The authors have made a valuable contribution to the problem of estimation in selection models. This paper should motivate further studies on this problem. It is valuable to know when selection is ignorable. I have always found it confusing that in extensions of case (a) a likelihood approach would say that selection on y 2 , is always ignorable but Henderson (1975) suggests that selection is only ignorable if selection is on a culling variate (w) that is translation invariant.
In an interesting paper the same 3 authors (Gianola et al., 1988) have constructed the joint density of the data and random effects conditional on the culling variate (D!). Inferences based on D c suggest that selection can be ignored only if it is based on functions of the data that do not depend on the fixed effects. It would have been instructive to relate D! to terms used in the present paper, as presumably r can be related to the culling variate and might help to answer 3 comments I have on the use of Dc.
First, Gianola et al. (1988) condition on w, the culling variate, by integrating over y and the random effects. I am not sure of the need to integrate over the random effects. One might sometimes want to consider repeated samples over (or conditioning on) all possible genetic material and only repeated over the same genetic material. Henderson (1988) has recently suggested a procedure that involves no integration over y of the random effects, i.e. conditioning on the observed value of w. What should one do?
Secondly, Gianola et al. (1988) highlighted differences between using D! and Henderson's (1975) approach when selection is on random effects or residuals (w = L'u or L'e). This case is artificial in the sense that random effects and residuals will never be known exactly. But if selection is on known random effects it scarcely seems necessary to predict them using only the data. It might be more interesting to compare the 2 predictions. Similarly, if w = L'e is known, this known value could improve estimation and prediction of the other parameters.