Mixture model for inferring susceptibility to mastitis in dairy cattle: a procedure for likelihood-based inference

A Gaussian mixture model with a finite number of components and correlated random effects is described. The ultimate objective is to model somatic cell count information in dairy cattle and to develop criteria for genetic selection against mastitis, an important udder disease. Parameter estimation is by maximum likelihood or by an extension of restricted maximum likelihood. A Monte Carlo expectation-maximization algorithm is used for this purpose. The expectation step is carried out using Gibbs sampling, whereas the maximization step is deterministic. Ranking rules based on the conditional probability of membership in a putative group of uninfected animals, given the somatic cell information, are discussed. Several extensions of the model are suggested.


INTRODUCTION
Mastitis is an inflammation of the mammary gland associated with bacterial infection. Its prevalence can be as large as 50%, e.g., [16,30]. Its adverse economic effects are through a reduction in milk yield, an increase in veterinary costs and premature culling of cows [39]. Milk must be discarded due Routine recording of mastitis is not conducted in most nations, e.g., France and the United States. Instead, milk somatic cell score (SCS) has been used in genetic evaluation as a proxy measure. Heritability estimates of SCS average around 0.11 [29]. Pöso and Mäntysaari [32] found that the genetic correlation between SCS and clinical mastitis ranged from 0.37 to 0.68. Hence, selection for a lower SCS is expected to reduce prevalence of mastitis. On this basis, breeders have been encouraged to choose sires and cows having low estimated breeding values for SCS.
Booth et al. [4] reported that 7 out of 8 countries had reduced bulk somatic cell count by about 23% between 1985 and 1993; however, this was not accompanied by a reduction in mastitis incidence. Schukken et al. [38] stated that a low SCS might reflect a weak immune system, and suggested that the dynamics of SCS in the course of infection might be more relevant for selection. Detilleux and Leroy [8] noted that selection for low SCS might be harmful, since neutrophils intervene against infection. Also, a high SCS may protect the mammary gland. Thus, it is not obvious how to use SCS information optimally in genetic evaluation.
Some of the challenges may be met using finite mixture models, as suggested by Detilleux and Leroy [8]. In a mixture model, observations (e.g., SCS, or milk yield and SCS) are used to assign membership into groups; for example, putatively "diseased" versus "non-diseased" cows. Detilleux and Leroy [8] used maximum likelihood; however, their implementation is not flexible enough.
Our objective is to give a precise account of the mixture model of Detilleux and Leroy [8]. Likelihood-based procedures are described and ranking rules for genetic evaluation are presented. The paper is organized as follows. The second section gives an overview of finite mixture models. The third section describes a mixture model with additive genetic effects for SCS. A derivation of the EM algorithm, taking into account presence of random effects, is given in the fourth section. The fifth section presents restricted maximum likelihood (REML) for mixture models. The final section suggests possible extensions.
Mixture models with random effects 5

FINITE MIXTURE MODELS: OVERVIEW
Suppose that a random variable y is drawn from one of K mutually exclusive and exhaustive distributions ("groups"), without knowing which of these underlies the draw. For instance, an observed SCS may be from a healthy or from an infected cow; in mastitis, the case may be clinical or subclinical. Here K = 3 and the groups are: "uninfected", "clinical" and "sub-clinical". The density of y can be written [27,45] as: where K is the number of components of the mixture; P i is the probability that the draw is made from the ith component ( K i=1 P i = 1); p i (y|θ i ) is the density under component i; θ i is a parameter vector, and θ = θ 1 , θ 2 , ..., θ K , P 1 , P 2 , ..., P K includes all distinct parameters, subject to K i=1 P i = 1. If K = 2 and the distributions are normal with component-specific mean and variances, then θ has 5 elements: P, the 2 means and the 2 variances. In general, the y may be either scalar or vector valued, or may be discrete as in [5,28].
Methods for inferring parameters are maximum likelihood and Bayesian analysis. An account of likelihood-based inference applied to mixtures is in [27], save for models with random effects. Some random effects models for clustered data are in [28,40]. An important issue is that of parameter identification. In likelihood inference this can be resolved by introducing restrictions in parameter values, although creating computational difficulties. In Bayesian settings, proper priors solve the identification problem. A Bayesian analysis with Markov chain Monte Carlo procedures is straightforward, but priors must be proper. However, many geneticists are refractory to using Bayesian models with informative priors, so having alternative methods of analysis available is desirable. Hereafter, a normal mixture model with correlated random effects is presented from a likelihood-based perspective.

Motivation
Detilleux and Leroy [8] argued that it may not be sensible viewing SCS as drawn from a single distribution. An illustration is in [36], where different trajectories of SCS are reported for mastitis-infected and healthy cows. A randomly drawn SCS at any stage of lactation can pertain to either a healthy or 6 D. Gianola et al. to an infected cow. Within infected cows, different types of infection, including sub-clinical cases, may produce different SCS distributions.
Genetic evaluation programs in dairy cattle for SCS ignore this heterogeneity. For instance, Boichard and Rupp [3] analyzed weighted averages of SCS measured at different stages of lactation with linear mixed models. The expectation is that, on average, daughters of sires with a lower predicted transmitting ability for somatic cell count will have a higher genetic resistance to mastitis. This oversimplifies how the immune system reacts against pathogens [7].
Detilleux and Leroy [8] pointed out advantages of a mixture model over a specification such as in [3]. The mixture model can account for effects of infection status on SCS and produces an estimate of prevalence of infection, plus a probability of status ("infected" versus "uninfected") for individual cows, given the data and values of the parameters. Detilleux and Leroy [8] proposed a 2−component mixture model, which will be referred to as DL hereafter. Although additional components may be required for finer statistical modelling of SCS, our focus will be on a 2−component specification, as a reasonable point of departure.

Hierarchical DL
The basic form of DL follows. Let y and a be random vectors of observations and of additive genetic effects for SCS, respectively. In the absence of infection, their joint density is p 0 y, a|β 0 , A, σ 2 a , σ 2 e = p 0 y|β 0 , a,σ 2 e p a|A, σ 2 a .
The subscript 0 denotes "no infection", β 0 is a set of fixed effects, A is the known additive genetic relationship matrix between members of a pedigree, and σ 2 a and σ 2 e are additive genetic and environmental components of variance, respectively. Since A is known, dependencies on this matrix will be suppressed in the notation. Given a, the observations will be supposed to be conditionally independent and homoscedastic, i.e., their conditional variancecovariance matrix will be Iσ 2 e . A single SCS measurement per individual will be assumed, for simplicity. Under infection, the joint density is where subscript 1 indicates "infection". Again, the observations are assumed to be conditionally independent, and β 1 is a location vector, distinct (at least some elements) from β 0 . DL assumed that the residual variance and the distribution of genetic effects were the same in healthy and infected cows. This can be relaxed, as described later. 7 The mixture model is developed hierarchically now. Let P be the probability that a SCS is from an uninfected cow. Unconditionally to group membership, but given the breeding value of the cow, the density of observation i is where y i and a i are the SCS and additive genetic value, respectively, of the cow on which the record is taken, and β = β 0 , β 1 . The probability that the draw is made from distribution 0 is supposed constant from individual to individual.
Assuming that records are conditionally independent, the density of all n observations, given the breeding values, is The joint density of y and a is then and the marginal density of the data is p y|β, σ 2 a , σ 2 e , P = p y, a|β, σ 2 a , σ 2 e , P p a|σ 2 a da.
When viewed as a function of the parameters θ = β 0 , β 1 ,σ 2 a , σ 2 e , P , (5) is Fisher's likelihood. This can be written as the product of n integrals only when individuals are genetically unrelated; here, σ 2 a would not be identifiable. On the other hand, if a i represents some cluster effect (e.g., a sire's transmitting ability), the between-cluster variance can be identified.
DL assume normality throughout and take y i |β 0 , a,σ 2 e ∼ N 0 x 0i β 0 + a i ,σ 2 e and y i |β 1 , a,σ 2 e ∼ N 1 x 1i β 1 + a i ,σ 2 e . Here, x 0i and x 1i are known incidence vectors relating fixed effects to observations. The assumption about the genetic effects is a|A,σ 2 a ∼ N 0, Aσ 2 a . Let now z i ∼ Bernoulli (P) , be an independent (a priori) random variable taking the value z i = 0 with probability P if the datum is drawn from process N 0 , or the value z i = 1 with probability 1 − P if 8 D. Gianola et al. from N 1 . Assuming all parameters are known, one has is the probability that the cow belongs to the "infected" group, given the observed SCS, her breeding value and the parameters.
A linear model for an observation (given z i ) can be written as A vectorial representation is where Diag (z i ) is a diagonal matrix with typical element z i ; X 0 is an n × p 0 matrix with typical row x 0i ; X 1 is an n × p 1 matrix with typical row x 1i ; a = {a i } and e = {e i }. Specific forms of β 0 and β 1 (and of the corresponding incidence matrices) are context-dependent, but care must be exercised to ensure parameter identifiability and to avoid what is known as "label switching" [27]. For example, DL take X 0 β 0 = 1µ 0 and X 1 β 1 = 1µ 1 .

Joint distribution of missing and observed data
We extremize (5) with respect to θ via the expectation-maximization algorithm, or EM [6,25]. Here, an EM version with stochastic steps is developed.
The EM algorithm augments (4) with n binary indicator variables z i (i = 1, 2, ..., n), taken as independently and identically distributed as Bernoulli, with probability P. If z i = 0, the SCS datum is generated from the "uninfected" component; if z i = 1, the draw is from the other component. Let z = [z 1 , z 2 , ..., z n ] denote the realized values of all z variables. The "complete" data is the vector a , y , z , with [a , z ] constituting the "missing" part and y representing the "observed" fraction. The joint density of a, y and z can be written as p a, y, z|β 0 , β 1 ,σ 2 a , σ 2 e , P = p (z|P) p a|σ 2 a p y|z, a, β 0 , β 1 ,σ 2 e .
Given z, the component of the mixture generating the data is known automatically for each observation. Now for i = 1, 2, ..., n. Then, (7) becomes

Fully conditional distributions of missing variables
The form of (8) leads to conditional distributions needed for implementing the Monte Carlo EM algorithm.
• The density of the distribution [z|β 0 , β 1 , a, σ 2 a , σ 2 e , P, y] ≡ [z|β 0 , β 1 , a, σ 2 e , P, y] is This is the distribution of n independent Bernoulli random variables with probability parameters (6). • The density of distribution a, z|β 0 , β 1 ,σ 2 a , σ 2 e , P, y can be written as 10 D. Gianola et al. As shown in the Appendix, the density of the distribution [a|z, β 0 , β 1 , with λ as in (41). This indicates that the distribution is the Gaussian process Recall that, given z, the probability P does not intervene in any distribution. • Integrating (42) in the Appendix with respect to a gives the conditional distribution z|β 0 , β 1 ,σ 2 a , σ 2 e , P, y with probability mass function where Computing the denominator of (11) is tedious because of the many sums involved.

Complete data log-likelihood
The logarithm of (8) is called the "complete data" log-likelihood Mixture models with random effects 11

The E and M steps
The E−step [6,25,27] consists of finding the expectation of L complete taken over the conditional distribution of the missing data, given y and some current (in the sense of iteration) values of the parameters, say , where k denotes round of iteration. This expectation is known as the Q function The M−step finds parameter values maximizing E a,z|θ [k] ,y L complete . These are called the "complete data" maximum likelihood estimates. Taking partial derivatives of Q β,σ 2 a , σ 2 e , P; θ [k] with respect to all elements of θ gives ∂E a,z|θ [k] ,y Q β,σ 2 a , σ 2 e , P; and Setting derivatives to 0 and solving for the parameters leads to the "complete data" maximum likelihood estimates; these provide new values for the next and where E [k] expression = E a,z|θ [k] ,y expression .

Monte Carlo implementation of the E-step
In mixture models without random effects (i.e., with fixed a), calculation of E z|θ [k] ,y expression is direct, as the distribution z|a, θ [k] , y is Bernoulli with probability (6). In the fixed model a is included in the parameter vector, assuming it is identifiable; e.g., when the a s represent fixed cluster effects. Here, the iterates are linear functions of the missing z, and the computation involves replacing z i by its conditional expectation, which is (6) evaluated at θ = θ [k] . This was employed by DL, but it is not correct when a is random.
In a mixture model with random effects, the joint distribution a, z|θ, y , with density (42) is not recognizable, so analytical evaluation of E a,z|θ [k] ,y expression is not possible. We develop a Monte Carlo E−step using the Gibbs sampler. Observe that distributions z|a, θ, y and a|z, θ, y are recognizable. Given a, each of the elements of z is independent Bernoulli, with probability (6). Likewise, a|z, θ, y is multivariate normal, as in (10). The Gibbs sampler [35,41] draws from a, z|θ, y by successive iterative sampling from z|a, θ, y and a|z, θ, y . For example, draw each of the z i from their Bernoulli distributions, then sample additive effects conditional on these realizations, update the Bernoulli distributions, and so on. The process requires discarding early samples ("burn-in"), and collecting m additional samples, with or without thinning. The additive effects can be sampled simultaneously or in a piece-wise manner [41].
Let the samples from a, z|θ [k] , y be a ( j,k) , z ( j,k) , j = 1, 2, ..., m, recalling that k is iterate number. Then, form Monte Carlo estimates of the complete data 13 maximum likelihood estimates in (18)-(22) as and The E and M steps are repeated until parameter values do not change appreciably.
When m → ∞, the limiting form of the algorithm is the standard EM. The deterministic EM algorithm increases the likelihood function monotonically [6,25] until convergence to some stationary point, although this may not be a global maximum. Monte Carlo implementations of EM are discussed in [12,46]. For finite m, however, it is not true that the likelihood increases monotonically, although this may reduce the chance that the algorithm gets stuck in some local stationary point of little inferential value. Tanner [44] suggests keeping m low at early stages of the algorithm and then increase sample size in the neighborhood of some maximum. Due to Monte Carlo error, convergence may be declared when fluctuations of the iterates appear to be random about some value. At that point it may be worthwhile to increase m [44].

Inference about additive genetic effects
Genetic evaluation for SCS would be based on a i , the ith element of a, this being the mean vector of the distribution a|β 0 = β 0 , β 1 = β 1 , σ 2 a = σ 2 a , σ 2 e = σ 2 e , y . While β 0 , β 1 , σ 2 e , σ 2 a and P follow from the maximum likelihood 14 D. Gianola et al. procedure, a must be calculated using Monte Carlo methods. From standard theory Using (10), one then has that where z i is the mean of m draws from z i |β 0 = β 0 , β 1 = β 1 ,σ 2 a = σ 2 a , σ 2 e = σ 2 e , y , obtained from running a Gibbs sampler applied to the process a, z|θ = θ, y . One can also estimate (28) directly from the m draws for a obtained from such sampler. In theory, however, (29) is expected to have a smaller Monte Carlo error.
Another issue is how the SCS information is translated into chances of a cow belonging to the "uninfected" group. A simple option is to estimate (6) as For cow i (30) induces the log odds-ratio The first term is constant across cows, so rankings based on (30) are driven by the ratios between the densities of the SCS record of the cow in question under the "healthy" and "diseased" components. Statistically (30) does not take into account the error of the maximum likelihood estimates of all parameters. If the likelihood function is sharp and unimodal (large samples), this would be of minor concern. However, asymptotic arguments in finite mixtures are more subtle than in linear or generalized linear models [27] and multimodality is to be expected in small samples, as illustrated in [1].

General
Maximum likelihood ignores uncertainty associated with simultaneous estimation of nuisance parameters [41]. For example, if variance components are inferred, no account is taken of degrees of freedom lost for estimating fixed effects [31]. This is clear when casted from a Bayesian perspective [14]. Maximum likelihood estimates can be viewed as components of the mode of a joint posterior distribution obtained after assigning flat (improper) priors to all parameters. On the other hand, restricted maximum likelihood estimates (of variance components) correspond to the mode of the posterior distribution of the variance components after integration of fixed effects, these viewed as nuisances. We extend this idea to our mixture model, i.e., account for uncertainty about fixed effects.

Distributions
The sampling model is as in (3). Following the EM−RE ML algorithm in [6], we treat β 0 and β 1 as missing data and assign an improper uniform prior to each of these two vectors. The missing data includes now β 0 , β 1 , a and z, and the joint density of the missing data and of the observations is proportional to (8). Pertinent distributions follow.
• From (31) it can be deduced that the distribution β 0 |a, z, β 1 ,σ 2 a , σ 2 e , P, y has density Given z, the only observations contributing information about β 0 are those for which z i = 0. Hence, (32) can be recognized [41] as the density of the normal distribution where is the density of the normal distribution where D z = Diag (z i ) .

The E and M steps
The complete data log-likelihood, apart from a constant, is as in (12). Since the fixed effects must now be integrated out, the pertinent complete data maximum likelihood estimates are those of P, σ 2 e and σ 2 a . The integration over the missing data β 0 , β 1 , a and z needed for the E−step is done by sampling from a, z, β 0 , β 1 |P [k] , σ 2[k] e , σ 2[k] a , y . The Gibbs sampler draws successively from (6), (10), (33) and (35), leading to samples a ( j,k) , z ( j,k) , β Convergence issues are as for full maximum likelihood.

Inference about additive genetic effects
From now on, let the REML estimates be P, σ 2 e , σ 2 a . Genetic evaluation for SCS may be based on Monte Carlo estimates of the mean vector of a| P, σ 2 e , σ 2 a , y . These can be obtained from running a second Gibbs sampler applied to a, z, β 0 , β 1 | P, σ 2 e , σ 2 a . This takes into account uncertainty about β 0 and β 1 (in the usual REML sense), but not that associated with the variance component estimates and with P.

EXTENSIONS OF MODEL
A first extension is to allow for heterogeneous residual variance in the two components. The parameter vector would now be θ = β 0 , β 1 ,σ 2 a , σ 2 e0 , σ 2 e1 , P , and the complete data log-likelihood would take the form The Monte Carlo complete data full maximum likelihood estimates (23)- (27) have the same form as before, except for the residual components of variance, which are now falling into each of the two groups in sample j at iteration k. The imputations of z and a in the E-step need to be modified as well. The probability of membership (6) must be amended to reflect residual heteroscedasticity; this is straightforward. The density of the distribution a, z|β 0 , β 1 ,σ 2 a , σ 2 e0 , σ 2 e1 , P, y , counterpart of (9), can be expressed as Using similar algebra as before, this can be written as Then, as in (40) and (42), as in the Appendix, one is led directly to With a|β 0 , β 1 , z,σ 2 a , σ 2 e0 , σ 2 e1 , P, y recognized as above, and with z|β 0 , β 1 , a,σ 2 a , σ 2 e0 , σ 2 e1 , P, y ≡ z|β 0 , β 1 , a, σ 2 e0 , σ 2 e1 , P identified, this completes the ingredients for carrying out the E−step via Gibbs sampling.
A second extension is as follows. It is known from microarray studies (e.g., [47]) that there is differential expression of genes in diseased and healthy individuals. For example, some genes are expressed in tissues sampled from cancerous tumors whereas other genes are expressed in negative biopsies. Mixture models for diseases with a genetic basis should be flexible enough to allow for differential expression. A cow cannot be mastitic and healthy at the same time, but there may be genes that are expressed when infection takes place, but to a different extent in the absence of disease. This can be modelled by introducing a genetic correlation, much in the same way that one can think of a genetic covariance between ovulation rate and scrotal circumference in sheep [24], or between direct and maternal effects in cattle [48]. In the context of mastitis, and in a 2−component mixture, a sufficient condition for statistical identification of the genetic correlation is that some healthy animals have relatives that contract the disease. If a K−component mixture model is fitted, one would need that at least some "families" (in a loosely defined sense) are represented in all components of the mixture. In dairy cattle, where large half-sib families are common, this requirement can be met, unless K is large. Also, the assumption that genetic variances are the same for each of the two components of the mixture can be relaxed.
In mixture models, it is not always clear whether asymptotic properties of the maximum likelihood estimates hold well when one departs from standard settings (e.g., [15,22]). Hosmer [20] reported poor mean squared error behavior of parameter estimates in a 2−component mixture model, especially when the distributions overlapped and when P was close to 1. DL found that estimates of P were not reliable when the means of the mixture components were close. This implies that posterior probabilities based on maximum likelihood estimates should be viewed with caution. DL reported that posterior probabilities of infection were lower in "truly" non-infected that in infected cows, with the differences increasing as mixing proportion increased. Posterior probabilities were close to zero both for non-infected and infected cows with small SCS values, and close to 1 for infected cows with large SCS. From plots in [8], it seems that if one used the rule that the posterior probability should be larger than 1 2 to assign a cow to the "possibly infected" class, only a few cows would be false positives. It is difficult to assess the proportion of false negatives (infected cows classified as non-infected) that would result from such an assignment. DL assumed homogeneous variances in the two populations, and the maximum likelihood estimates agreed with the simulated values, particularly when the two distributions were well separated and P was far away from 1 2 . The assumption of homogeneity of variances was to ensure a global maximum of the likelihood, but perhaps at the expense of realism, as one would expect the variances to be larger in the "infected" component, merely from scaling considerations. A Bayesian analysis, exploring the entire posterior distribution of the parameters, could be useful here. If the model calls for heterogeneous dispersion parameters, likelihood-based inference would run the risk of placing the (conceptual) asymptotic distribution at a local stationary point, whereas the Bayesian analysis would integrate over the entire parameter space.
Our approach does not allow for hierarchical modelling of the prior probability of membership to the components, assumed constant for every observation. In the context of longitudinal information, evidence indicates that the incidence of mastitis is much higher around calving than at other stages of lactation (e.g., [18]). In such a setting, it would be sensible to allow for a time-varying probability of membership.
Mastitis is complex, and more than two components may be needed for adequate modelling. Different types of bacterial agents may lead to pathogenspecific distributions of SCS. Hence, a mixture model with an unknown number of components may be in order, although the difficulty of implementing such a specification must be recognized. For example, a Bayesian approach with an unknown number of components may require a reversible jump Markov chain Monte Carlo algorithm [11,34], but it is difficult to tune this type of procedure [13]. A possibly more feasible alternative is to fit a sequence of models with 2, 3, ...K components, and carry out the analysis for each of the settings. Then, assuming that models are equi-probable a priori, choose the one with the largest Bayes factor support [21] or infer breeding values or future observations using Bayesian model averaging [19]. Bayesian model averaging accounts for uncertainty about the models, and is expected to lead to better prediction of future observations in some well-defined sense [26,33].
Although SCS is claimed to be nearly normally distributed [2], the problem of outlier protection in multivariate data is a difficult one [23]. A possibility is to use thick-tailed distributions, instead of normal processes, in mixture model analysis. Robust distributions were introduced for cross-sectional data in quantitative genetics by [10,42,43]; extensions are in [37,41]. McLachlan and Peel [27] have described mixtures using the t−distribution, one of the many possible thick-tailed processes, and warn that this distribution, like the normal, cannot accommodate asymmetrically distributed errors. However, an asymmetric t-distribution [9], or skewed-Gaussian distributions can be used in mixture models without much difficulty. Bayesian Markov chain Monte Carlo procedures allow all these extensions.
A mixture model may be even more powerful when other traits are brought into the picture. For example, the analysis might involve SCS, protein yield and some udder type trait simultaneously. A mixture model for vectors would need to be fitted here.

CONCLUSION
In conclusion, finite mixture models provide an elegant manner of dealing with SCS data in the context of genetic improvement of resistance to mastitis. Irrespective of whether Bayesian or likelihood-based methods are employed, the main challenges may reside in developing meaningful mixture models. These will need to be validated using information from cows where the disease outcome is known. In short, research is needed to establish their usefulness, to identify models having an adequate predictive ability, and to develop feasible computational procedures. For example, a non-Bayesian analysis may be carried out more efficiently with algorithms other than EM.
The density of the conditional distribution a|z, β 0 , β 1 ,σ 2 a , σ 2 e , P, y is arrived at by fixing z in (42)