The analysis of disease biomarker data using a mixed hidden Markov model (Open Access publication)

A mixed hidden Markov model (HMM) was developed for predicting breeding values of a biomarker (here, somatic cell score) and the individual probabilities of health and disease (here, mastitis) based upon the measurements of the biomarker. At a first level, the unobserved disease process (Markov model) was introduced and at a second level, the measurement process was modeled, making the link between the unobserved disease states and the observed biomarker values. This hierarchical formulation allows joint estimation of the parameters of both processes. The flexibility of this approach is illustrated on the simulated data. Firstly, lactation curves for the biomarker were generated based upon published parameters (mean, variance, and probabilities of infection) for cows with known clinical conditions (health or mastitis due to Escherichia coli or Staphylococcus aureus). Next, estimation of the parameters was performed via Gibbs sampling, assuming the health status was unknown. Results from the simulations and mathematics show that the mixed HMM is appropriate to estimate the quantities of interest although the accuracy of the estimates is moderate when the prevalence of the disease is low. The paper ends with some indications for further developments of the methodology.


INTRODUCTION
Studies have shown variability among cows for natural resistance to intramammary infection (IMI). Selection is therefore possible but direct measures of IMI are not readily available. Usually, information on IMI is based upon biomarkers such as somatic cell scores (SCS), electrical conductivity, immunoglobulin or acute phase proteins (reviewed in [8]). One important difficulty in using these biomarkers to find the most resistant animals is that factors known to influence their expression may be different in healthy (IMIÀ) and in infected (IMI+) cows. Since these are usually unidentified, breeding values tend to be biased. To reduce this bias and to infer more precisely the cows' individual probabilities to be IMIÀ or IMI+, several authors have used the mixture model methodology on SCS [2,9,12,17]. A generalization of the mixture model is the hidden Markov model (HMM) that presents the advantages of not only estimating individual probabilities of being infected but also of predicting individual probabilities of new infection and of recovery. Both are useful to compute epidemiological measures of IMI spread within a population and to assist mastitis control programs.
The objective of this study was to present the mathematical formalism behind the HMM methodology as it may apply to the analysis of infectious disease biomarkers assumed to be dependent upon the genetic make-up of the cows. The fit of the HMM was assessed on simulated data based on parameters obtained in a survey of clinical mastitis cases. Bayesian estimates of the parameters were obtained using the Gibbs sampler. Finally, limitations and possible extensions of the current approach are discussed.

MATERIALS AND METHODS
Throughout, k indexes the individual cow, t (t = 1-T ) is the follow-up time point during the lactation (e.g., month-in-milk), y t k is the value of the biomarker observed at t on animal k, and z t k is the corresponding unknown health status (IMIÀ or IMI+). Let z t k ¼ 0 if y t k is from an unknown IMIÀ sample and z t k ¼ 1 if y t k is from an unknown IMI+ sample. For simplicity, T is assumed constant for all cows. We use the notation of Ødegård et al. [17] in their finite mixture model, with slight modifications.

General formulation of the model
Conditionally on the unknown vector z, it was assumed that the vector of observations y could be described by the linear model: where y is the (NT · 1) data vector of y t k , l 0 and l 1 are (T · 1) vectors of fixed effects for data on an IMIÀ or IMI+ cow, respectively, a is the (Na · 1) vector of random additive genetic effects; M 0 is the (NT · T) matrix with elements = 1 if z t k ¼ 0 and ¼ 0 otherwise; M 1 is the (NT · T) matrix with elements = 1 if z t k ¼ 1 and ¼ 0 otherwise; e is the (NT · 1) vector of residuals; Z is the (NT · Na) incidence matrix relating a to y, N is the number of animals with data and Na is the number of animals with pedigree records. The conditional distribution of y, given the vector z, the location, and scale parameters, was assumed to be: where F i is the (NT · NT) diagonal matrix with elements = 1 if z t k ¼ i and = 0 otherwise. The parameters r 2 0 and r 2 1 are the residual variances associated to a record on an IMIÀ and IMI+ cow, respectively. For the additive effects, it was assumed that ðajr 2 a Þ $ N ½0; A r 2 a , where r 2 a is the additive genetic variance and A is the matrix of additive genetic relationship between animals.

Sampling distribution of the observations given group status
The density of the vector y for the subset of the N i observations with z t k ¼ i, i.e. {z = i}, given the location parameters and the residual variances, can be written as:

Prior distributions of parameters and of the unknown status vector
For i = 0 or 1, normal prior densities were assumed for the location parameters: where 1 is the (T · 1) vector of 1. The prior density for the additive effects, conditionally on the additive variance, was: : Under simple mixture models, the individual elements of the classification vector z are assumed to be independent a priori and to follow the same Bernoulli distribution with the mixing proportion as the parameter. Here, under an equally simple mixed HMM, the variables z t k do not follow the same distribution. The first element of the series ðz 1 k Þ follows a Bernoulli distribution with k k as the parameter while the other elements follow Bernoulli Mixed hidden Markov model 493 distributions with state transition probabilities from z tÀ1 k to z t k as parameters. Formally, the unknown state at time t may be decomposed in: where pðz t k ¼ i z tÀ1 k ¼ jÞ are the state transition probabilities with i, j = 0 or 1. The state transition probabilities are assumed to possess the first-order Markov property namely that, given the present state, the future and past states are independent or that the current value z t k À Á depends solely on the most recent past value ðz tÀ1 k Þ. Transition probabilities are also independent of the actual time at which the transition takes place (stationarity assumption). Then, we have pr z t

Priors for variance components and probabilities
Scale-inverse chi-square distributions with m degrees of freedom and scale parameters; ðs 2 a ; s 2 0 , and s 2 1 Þ were used for the variance components: Finally, k k , p 00 k , and p 01 k were assigned uniform (i.e. Beta(1, 1)) prior distributions.

Fully conditional posterior distributions
The conditional posterior distributions of each parameter (or block of parameters) are required for implementing a Gibbs sampler. Conditional on y and z, these conditional posterior densities are analytical because they only involve one of the possible realizations in the space of all possible sequences of z. For the location parameters, we have: where H refers to values of all parameters that the conditional distributions depend upon (i.e. all parameters except the one under consideration), g i;t k is the number of cows with IMIÀ (i = 0) or IMI+ (i = 1) unknown state at the tth time.
Let W ¼ ½Z M 0 M 1 and the vector of parameters h ¼ ½a l 0 l 1 0 . Hence, one can write the model as: y = Za + M 0 l 0 + M 1 l 1 + e = Wh + e. By partitioning the parameter vector h as h 1 ¼ a and h 2 = ½ l 0 l 1 0 , we can compute Mixed hidden Markov model 495 the conditional posterior distribution of the vector of additive genetic values as ðajH; y; zÞ $ N ðâ 1 ; C À1 11 Þ withâ ¼ C À1 11 r 1 À C 12 h 2 ½ and r 1 , C 11 , C 12 = the corresponding partition of C = [W 0 R À1 W + A À1 /r 2 a ] and r = W 0 R À1 y. The fully conditional posterior density of the genetic variance is: which is in the form of a scale-inverse chi-square density, with [N + m] degrees of freedom and scale parameter [a 0 A À1 a þ ms 2 a ]. Likewise, the fully conditional densities of the residual variances for IMIÀ and IMI+ observations are: which are in the form of scale-inverse chi-square densities, with [N i + m] degrees of freedom, and with scale parameter ¼ ms For the kth cow, the fully conditional posterior densities of the parameters k k , p 00 k , and p 01 k are: prðk k jH; y; zÞ / k K 0;1 k þ1 ð1 À kÞ K 1;1 k þ1 ; prðp 00 k jHÞ / ðp 00 k Þ n 00 k þ1 ð1 À p 00 k Þ n 10 k þ1 ; prðp 01 k jH; y; zÞ / ðp 01 k Þ n 01 k þ1 ð1 À p 01 k Þ n 11 k þ1 which are in the form of beta distributions. Finally, one must compute the fully conditional distribution for individual z t k . These may be obtained either from the pr(z| H; y) or by considering represent the hidden vector z without z t k , as suggested by one referee. Under the first alternative, pr zjH ð Þ can be decomposed as: which leads to a stochastic version of the forward-backward algorithm in which z 1 k is sampled from a Bernoulli distribution with parameter pr z 1 k may be stored gradually as t increases from 1 to T: The forward and backward probabilities can be efficiently calculated by the following recursion formulae [10]: Note that this alternative uses T different components while the first alternative generates a realization of z directly from its conditional pðzjy; H) it presents also a more complicated correlation structure (since each z t k depends on both z tÀ1 k and z tþ1 k ) than the first alternative, which may lead to a slower mixing chain.

Implementation of a Gibbs sampler
The following steps describe how a Gibbs sampling can be implemented for our model, using the stochastic version of the forward-backward algorithm to sample z: (1) Set initial values for parameters as needed.
(3) Replace l i (i = 0 and 1) with Compute and store f 0j;t k for t = 2, ..., T and j = 0 or 1. Then, sample z t Sample k k and p ij k , from their corresponding beta distributions with parameters K i;1 k þ 1 and n ij k þ 1, for i, j = 0 and 1, respectively. (9) Repeat (2)-(8) q times for burn-in as needed. Then, sample all parameters d times. The total number of cycles is q + d.

Simulations
The model was evaluated using simulated values for the biomarker (here, SCS) with genetic effects considered as having the same distributions for cows with IMI+ and IMIÀ samples. Each simulation was replicated 10 times. Simulated rather than real data were used because a negative diagnosis, even based on the absence of bacteria in cell culture, is not a guarantee of health and the opposite has also been observed [22].

Simulated data
The results from the field study of de Haas et al. [6,7] on pathogen-specific somatic cell count (SCC) curves among multiparous cows were used to simulate the means of monthly samples from IMIÀ and IMI+ cows. Figure 3b of de Haas's paper [6], shows that in cows clinically infected with Escherichia coli, SCC increase rapidly after infection occurring around the second month-in-milk, peak at 2000 cells per lL above pre-infection values, and return to pre-infection levels one month later. On the contrary, the presence of a long increased SCC, without recovery within four consecutive months, was common in lactations with clinical Staphylococcus aureus mastitis. In the cows without clinical mastitis, SCC followed an approximate inverse lactation curve. The SCC values were log 2 -transformed in SCS and used to simulate the SCS means, as explained below. In the simulations, it was also considered that cows might be classified as high and moderate responders on the basis of the extent of their immune 498 response to a particular infection [14]. Therefore, SCS were considered at higher values and of longer duration in high than that in moderate responders (Fig. 1).
In the simulations, three discrete generations were considered with 400 cows per generation. No selection was applied, sires were selected from 30 different bulls, each cow was replaced by a daughter and mating was at random. Breeding values for base animals were sampled from a normal distribution with null mean and additive variance of 0.15 or 0.25. Values for the additive variance were taken from the literature [4]. Breeding values for non-base animals were sampled from a normal distribution with the mid-parent value as mean and variance = 0.15/2 or 0.25/2. Inbreeding was ignored.
Somatic cell scores under healthy (SCS 0 ) and infected (SCS 1 ) states were simulated as follows: where l 0 and l 1 are the (T · 1) vector means of both distributions, a is the (N · 1) vector of breeding values (computed as above), and M 0 and M 1 are the incidence matrices relating l 0 and l 1 to SCS 0 and SCS 1 , respectively. The number of observations per cow was set at T = 10 or 20. The vectors e 0 and e 1 were sampled from two normal distributions with null means and residual variances set at 1.0 and 1.4. The values for the residual variances were found in the literature [13]. Each element of l 0 and l 1 was taken from the curves observed in cows without and with mastitis, and for high and low responders (Fig. 1). The cows were assigned to a group (IMI+ or IMIÀ) at random using appropriate membership probabilities: the proportion of cows with at least one IMI+ sample was set at P cow = 20 and 50% and, among IMI+ cows, the proportion infected with E. coli was set at P coli = 0, 50, and 100% (the other IMI+ cows were considered infected with S. aureus). If a cow was assigned to the IMI+ group, the time at which the clinical episode starts (= t*) was sampled from an exponential distribution with a scale parameter 3, which is in agreement with the reported median time of first occurrence of mastitis, i.e. two to three months [6].

Evaluation of the accuracy of the estimates
The estimates ðl t i ;r 2 0 ;r 2 1 ;r 2 a ;âÞ of the parameters ðl t i ; r 2 0 ; r 2 1 ; r 2 a ; aÞ were computed, after burn-in, as the means of the posterior distributions. Their accuracies were assessed over the range of parameter values (sensitivity analysis) as follows. For the predicted breeding values, the Spearman correlation coefficient (corr BV ) with the true breeding values was computed for each replicate and averaged over the 10 replicates. For residual and additive variances, the differences (bias r0 , bias r1 , and bias ra ) between estimates and simulated values were computed for each replicate and averaged over the 10 replicates. For the location parameters, the biases (bias l0 and bias l1 ) were calculated between the estimates and y t i , where y t i ¼ P k¼1;n i t ðy t k z t k ¼ iÞ=n i t is computed with known values for z t k : Finally, sensitivity (SE), specificity (SP), and probability of correct classification (PCC), were computed at each iterative step as: After burn-in, these were averaged over the d Gibbs rounds and the 10 replicates.

RESULTS AND DISCUSSION
Results are shown in Tables I and II of the appendix. Visual inspection of the algorithmic convergence showed that a total of 1000 cycles and a burn-in (q) 500 of 200 runs were sufficient to remove the influence of the prior values and obtain stable estimates. Thus, all results presented correspond to the last (d = 800) runs of the Gibbs algorithm. This may seem very few cycles but results were checked for three simulated data sets over a higher number of cycles of the Gibbs sampler. Convergence rates were also checked with an EM algorithm and the Gibbs sampler on models similar to those used in the simulation of this study but without genetic covariance structure (SCS i = M i l i + e i ). Explanations may be linked to the simplicity of the pedigree structure, small number of cows and the fact that values for m 0 and s 2 p were obtained from the data.

Overall accuracy of the estimates
Overall, the sensitivity was high (SE~90%) but the specificity low (SP6 0%). Because of this high sensitivity, we can be confident that a cow witĥ z t k ¼ 0 is healthy and spare the costs of further testing (e.g. bacteriological cultures) or useless treatment. On the other end, the low specificity indicates that cows withẑ t k ¼ 1 should be further tested to confirm the clinical suspicion. These observations may suggest some economic interest in HMM.
Before any testing, the probability for a cow to be IMI+ can only be estimated from the prevalence of the disease in the population, while, after testing, this probability is estimated from the posterior probability of being IMI+ given a positive test (also called the positive predictive value). With SE = 90% and SP = 60%, the difference between prior and posterior probabilities is maximum at disease frequencies between 20 and 50%, with posterior probabilities 20% higher than the prior probabilities. These frequencies are within the range of prevalence typically reported for mastitis, as illustrated in the following few studies. In Finland, Pitkälä et al. [18] reported 31% of cows with SCC > 300 000 mL À1 (mastitis) in 2001. In Switzerland, Roesch et al. [19] reported 40% cows showing at least one positive California Mastitis Test in at least one quarter at 31 days and 102 days post partum. In a survey of clinical and subclinical mastitis in England and Wales, the mean incidence of clinical mastitis recorded by the farmer was 47 cases per 100 cows per year [3]. In Canada, Sargeant et al. [21] have observed that 19.8% of cows experienced one or more cases of clinical mastitis during a two-year observational study. Therefore, HMM may also be of interest in field studies, when it is necessary to precisely identify infected cows.
Breeding values from the HMM seemed accurate in predicting the true additive genetic merit of the cows. Indeed, the correlation (corr BV ) between simulated and estimated breeding values varied from 65 to 79% over the whole data sets. This is close to the correlations of 70-75% computed as the square root of the coefficient of determination (CD), where CD ¼ 1 À PEV=V, PEV = prediction error Mixed hidden Markov model variance = ½W 0 R À1 W þ A À1 =r 2 a À1 and V = true additive variance = Ar 2 a [11]. The PEV was computed with the values of the parameters used in the simulation and weighted by the true proportion of IMIÀ and IMI+ per cow. On the contrary, the HMM was less efficient in estimating the parameters for the IMI+ group. Indeed,r 2 1 had a tendency to underestimate andl t 1 to overestimate the values used in the simulation. The biases varied from À1.33 to À0.13 (mean = À0.59) forr 2 1 and from À0.02 to 3.26 (mean = 1.14) for l t 1 . The magnitude of the biases decreased when the amount of information available on the IMI+ cows increased, as discussed in the sensitivity analyses below.

Sensitivity analyses
The robustness of the HMM approach was assessed by computing the biases in the estimates over a wide range of values for the simulated parameters. Overall, estimates of means and variances were rather insensitive to the values of the corresponding simulated values but they were sensitive to the proportion of cows with at least one IMI+ sample (P cow ) and to the proportion of E. coli among infected cows (P coli ). This suggests that HMM estimates are sensitive to the amount of data available to compute them. For example, biases in the estimation of both location parameters ðl t 0 ;l t 1 Þ were the highest when P cow was the lowest (Fig. 2), suggesting that it is necessary to have a sufficient number of observations per cow when the disease prevalence is low.
Similarly, SE, SP, and PCC decreased as the proportion of E. coli infection (P coli ) increased (Fig. 3). This was not surprising because, in cows infected with

502
E. coli, only a few simulated SCS were higher than SCS for the IMIÀ samples, as is observed in naturally occurring E. coli infections usually of short duration. The level of response to infection influenced estimates of transition probabilities, on the contrary to estimates of both location parameters and breeding values. For example, SE and PCC were higher among high (SE = 92%; PCC = 64%) than moderate (SE = 80%; PCC = 60%) responders, suggesting that HMM is more accurate when IMIÀ and IMI+ distributions are further apart. Conversely, accuracy ofr 2 1 worsened when the distance between IMIÀ and IMI+ distributions increased with bias r1 = À0.51 for moderate and bias r1 = À0.80 for high responders.
Note that SE and SP were insensitive to change in disease frequency (P cow ), as they should be by definition, conversely to PCC that is, by definition, a function of the disease frequency: PCC = [SE * pr(IMI+)] + [SP * pr(IMIÀ)].
Finally, note that SE and SP reported here are different from SE and SP in Ødegård et al. [17] in which where PPM i is the posterior mean of the estimates of z i averaged over Gibbs samples (after burn-in), t i = 0 if IMIÀ, t i = 1 if IMI+, and i = 1-n cows.

GENERAL DISCUSSION
The main advance of this paper is the presentation of an HMM in which genetic random effects are added to the conditional model for the observed data. In the subject-area literature, HMM with random effects have been used in a very limited way. Only recently, has Altman [1] introduced a mixed HMM to study lesion counts in multiple sclerosis patients. In her model, parameters for the observed and hidden data are allowed to vary randomly among patients, although they are assumed independent from each other (no genetic relationship). This suggests a natural extension of the present HMM, i.e., allowing the parameters of the hidden Markov chain to vary randomly among cows. However, the interpretation of the results of such an extended model will be delicate because sets of identical genes may be associated to both IMI and SCS (confounding effects). Stated otherwise, the total genetic effects on SCS would be a combination of the effects of genes responsible for presence or absence of IMI (resistance to infection) and for the magnitude of the SCS response after IMI (tolerance after infection).
Structural equation modeling is a technique to evaluate models with different hypothesized relationships among variables. In this context, it would be interesting to evaluate the different models proposed in Figure 4 to determine the amount of relationships between genes insuring tolerance or resistance to infection.
In the model proposed here, the biomarker value at one specific time is independently influenced by the IMI status and by some genes. However, both the IMI status and the biomarker values could also be under the influence of this same set of genes (model b of Fig. 4). The relationship between genes, biomarker, and IMI status can become even more complicated with different sets of correlated genes influencing the expression of both traits (model e).This is important for the long term because some epidemiological models predict that selection for resistant cows (no infection) may not be as durable as selection for tolerant (infection but no disease) cows [16,20]. Increased resistance would reduce disease transmission, reducing the fitness advantage of carrying the resistant genes, and possibly impose pressure upon the pathogen to evade the control strategy. By contrast, as genes conferring disease tolerance spread within a population, the disease incidence rises, increasing the evolutionary advantage of carrying the tolerance genes, without leading to genetic changes in the parasite population.
Other extensions of the HMM are possible. Trends and seasonality in SCS can be readily accommodated to relax the assumption of timeindependence between transition probabilities [15]. Prior information on the parameters can be included to increase accuracy and speed up convergence.

504
Location parameters can be made more realistic by considering the effects affecting SCS values, such as age, herd or season. Elements of the M matrices could take different values than zero or ones to reflect the different effects on SCS for different parts of the lactation. The genetic variance could also be different for IMIÀ and IMI+ samples and would allow for genetic difference in the response in SCS to IMI.
The first-order Markov assumption is also a limiting feature of the HMM and mechanisms of transmission of the IMI between cows could also be considered more precisely in deriving the transition probabilities. Indeed, transmission of infection is a complex process that involves the mixed structure of the population (as it determines the probability of contact between animals), the infectiousness of the contagious animal (or infective dose), and the susceptibility of a healthy cow (i.e., its probability of getting infected after contact with a contagious animal). To solve these issues, Cooper and Lipsitch [5] have proposed to model the transition probabilities of the hidden Markov chain in terms of the parameters of epidemiological models used to describe the transmission of an infectious disease at the population level.

CONCLUSIONS
In summary, it is shown that the mixed HMM provides a good fit to the data sets simulated in this study. The advantages of the HMM over other approaches are the prediction of health or disease status, the reduction of confirmatory diagnosis costs and the increased accuracy in breeding values. However, future work is necessary to extend the HMM proposed here, one of the most important   508  Table II. Accuracy of the estimates of the mixed HMM as a function of the level of response to infection, high (H) or moderate (M), number of samples per cow (T), percentage of cows with at least one IMI+ sample (P cow ), percentage infected with E. coli (P coli ) and residual and additive genetic variances (r 2 0 ; r 2 1 ; r 2 a ). The accuracy is determined by using the differences between values used in the simulations and estimates of means (bias l0 , bias l1 ) and residual variances (bias r0 , bias r1 ) in IMIÀ and IMI+ cows, respectively; the differences between values used in the simulations and estimates of additive genetic variance (bias ra ); and the correlation between predicted and simulated breeding values (corr BV ). Data sorted by corr BV . corr BV bias r0 bias r1 bias ra bias l0 bias l1 T P cow P coli r 2