A periodic analysis of longitudinal binary responses: a case study of clinical mastitis in Norwegian Red cows

A Bayesian procedure for analyzing longitudinal binary responses using a periodic cosine function was developed. It was assumed that, after adjustment for "seasonal" effects, the oscillation of the underlying latent variables for longitudinal binary responses was a stationary series. Based on this assumption, a single dimension sinusoidal analysis of longitudinal binary responses using the Gibbs sampling and Metropolis algorithms was implemented in a study of clinical mastitis records of Norwegian Red cows taken over five lactations.


INTRODUCTION
The study of stationary time series by means of auto-covariance functions is often referred to as a "time domain" analysis. This can be shown to be equivalent to a "frequency domain" analysis, which is based on a spectral representation of a time series [2]. In the latter, characteristics of a process are summarized in terms of frequency, amplitude and phase. Frequency is the number of complete cycles of an oscillation occurring within a period of time; amplitude is the maximum value of a periodically varying series, and phase is the distance between the origin and the nearest peak in time [2].
The analysis of co-movement between time series is of interest in economics, and different approaches have been developed for testing whether several economic variables have common trends or common cycles [9,10,16]. However, a detectable periodic component must be present in each time series before these tests are applicable. Stoffer [15] proposed a random effects approach for real-valued time series, in which frequency was partitioned into a frequency common to all sequences, plus a random frequency effect peculiar to an individual series. This can enhance a standard frequency analysis, because, rather than focusing on an individual series only, use is made of the relationships among all time series. If an individual series contains the frequency of interest but with small amplitude, or with a lot of noise, information on other series containing the similar harmonic curve may help to capture a particular feature of the process [15].
Mastitis is an inflammation of the mammary gland associated with bacterial infection; it is the most costly disease in dairy cattle [6]. A periodic analysis implies that a signal repeats itself periodically, and this seems to be so for clinical mastitis in cows when followed over several lactations [7]. In statistical analysis, presence or absence of infection is typically related to some latent variable, called "liability". The trajectory of the liability to clinical mastitis for each animal could be represented as the sum of some systematic effects, plus an oscillation of a cosine function at certain frequency, plus white noise (random residual). The frequency and amplitude of the oscillation provide information about the variation of liabilities over time. and phase (θ = 1). For example, if a cow had a larger amplitude, it would be more susceptible to clinical mastitis at certain periods of lactation than a cow with a lower amplitude. In Figure 1(b), two cows have the same amplitude (a = 0.3) and phase (θ = 1), but different frequencies ( f = 0.2 versus f = 0.5). The cow with higher frequency (dotted curve) would be expected to contract mastitis more frequently. The phase parameter θ indicates how early a first mastitis case would occur. As shown in Figure 1(c), the cow associated with the solid curve has highest risk of first mastitis at period 1, whereas the other cow is more susceptible to a first mastitis case at period 3 of the oscillations. On this basis, animal breeders would select cows that, genetically, have smaller amplitude and frequency, and larger phase of the cycle.
If a random effect is included in a sinusoidal model, each cow would have a specific frequency, amplitude and phase. In the sense of the time domain, this means that each cow would have different mastitis risks at different time points. For instance, the amplitude of the signal might be modeled with fixed and random components, and a larger absolute value of the amplitude would indicate a higher risk of mastitis. If two cows have the same frequency parameter, the one with a larger amplitude would have a higher chance of developing mastitis.
Our objective is to embed Stoffer's [15] approach into a generalized nonlinear mixed effects model for binary response variables. Latent variables representing, e.g., "liability to disease", will be modelled such that each subject has specific frequency and amplitude parameters over time. A Bayesian model with Markov chain Monte Carlo computations is presented in Section 2 and an application to clinical mastitis data in dairy cattle is given.

Statistical model
The classical threshold model of quantitative genetics was adopted [4,5,17]. For a binary response (the idea can be extended to ordered polychotomous data), this model postulates the existence of an unobserved underlying or latent variable l (e.g., liability to disease) and a fixed conceptual threshold such that, if liability exceeds the threshold, then disease occurs. In a binary response situation, the threshold can be set arbitrarily to zero. Consider a longitudinal case, where the binary response variable y it for cow i at period t takes the value 1 if the cow has the disease, and 0 otherwise, according to the rule: where l it is the liability underlying y it . Liability was modeled as a function of some systematic effects and of a cosine function of amplitude, frequency and phase, plus random noise. The model for liability was l hi jkt = x hi jkt β + z h,hi jkt h + z p,hi jkt p + a j,k cos{2π f j,k (t − θ)} + e hi jkt , (2) where l hi jkt is the liability of cow i at period t, calving in herd h, age-season j and daughter of sire k. Here, x hi jkt is an incidence vector relating liability to some systematic effects contained in β; z h,hi jkt is an incidence vector relating liability to herd effects h; z p,hi jkt is an incidence vector relating liability to cow-specific effects contained in the vector p. Further, a j,k and f j,k are the amplitude (the maximum or minimum value of the cosine series) and frequency (the number of complete cycles of the cosine oscillation within a period of time), respectively, of a cosine function peculiar to age-season of first calving j and sire family k. θ is the distance between the origin and the first peak (phase), assumed to be a known constant across all series; this can be relaxed if different phases are evident across series. Finally, e hi jkt ∼ NIID(0, 1) where NIID stands for "normally, independently and identically distributed." The assumption is that if liability is "adjusted" for the effects of β, h and p, the remaining part of the model produces a stationary series. The oscillation of the cosine curve depends on a "fixed" part (i.e., the age-season of first calving j) and a random part (i.e., the sire family k). The term e hi jkt is a random noise, assumed to follow a normal process with mean 0 and variance 1. Since liability cannot be observed, the residual standard deviation is taken as unit of measurement. The conditional distribution of the liability of cow i at time t is then where TN stands for a normal distribution truncated following the rule in (1). That is, the liability has to be larger or equal to zero if a cow had mastitis and smaller than zero, otherwise. The amplitude and frequency parameters were modeled hierarchically as Here, b j is the "fixed" amplitude effect, peculiar to age-season of first calving class j, and ω j is a frequency that is peculiar to all cows in age-season of first calving class j. Further, s 1,k and s 2,k are random amplitude and frequency effects, respectively, which are peculiar to all cows that are daughters of sire k. For simplicity, we will assume that the frequency ω j is a known constant across all series, and drop the subscript for age-season class from the notation. The model for liability is nonlinear with respect to frequency, but linear related to amplitude. Bayesian MCMC methods, such as Gibbs sampling and the Metropolis algorithm, were used for drawing samples from posterior distributions of interest. The sampling space of each element of β, p and h is the real line. The parameter space for amplitude, a j,k = b j + s 1,k , is also the real line. At least two time units are required for a cycle to be finished, so the "fastest" curve (that having the highest frequency) that can be observed is one with a period of two time units. On the other hand, the "slowest" cosine curve is one with a period of T time units, where T is the maximum number of time units. Therefore, the frequency f k = ω + s 2,k lies between 1 T (slowest curve) and 1 2 (fastest curve).

Prior and posterior distributions
A bounded uniform prior distribution U[β min , β max ] was assigned to each of the elements of β , where β min = −99 and β max = 99. The vector of herd effects h was assumed to follow the normal process where σ 2 h is the variance between herds, and the cow-specific effects p (usually termed "permanent environmental effects" in animal breeding) were assumed to be normally distributed, a priori, as where σ 2 p is the variance of the distribution. Independent bounded uniform priors were assigned to the "fixed" ampli- The random sire variables s 1,k and s 2,k , interpretable as half of the additive genetic values of a sire, were assumed to follow a truncated bivariate normal distribution, which in the absence of truncation would be Here, G 0 is the 2 × 2 (co)variance matrix between the random sire amplitude and frequency effects. The truncation points for the normal distribution are described later. Let s = [s 1 , . Thus, the vector s ∼ N(0, G 0 ⊗A), where A is the known additive genetic relationship matrix between sires, and q is the total number of sires in the pedigree. As mentioned, the frequency f k = ω + s 2,k of the curve must be such that where 1 T is the bound pertaining to the slowest curve and 1 2 is the upper bound for the fastest curve. These conditions had to be met during the MCMC sampling processes.
Independent scaled inverse Chi-square prior distributions were assigned to both the herd and permanent environmental variances, that is are the degrees of freedom, and τ h = 0.06 and τ p = 0.1 are the scale parameters. The (co)variance matrix G 0 was assumed to follow, a priori, the inverse Wishart distribution where ν G 0 = 4 is the degrees of freedom and V G 0 is the scale matrix with diagonal elements equal to 0.0002 and off-diagonal element equal to 0.00001.
The joint prior density of all uncertain quantities, after augmentation with the liabilities (represented by the vector l), was assumed to have the form where the dependence on hyper-parameters is suppressed in the notation. The joint posterior density above can be written explicitly as, where N is the number of cows with records, T i is the number of periods for cow i, and N h is the number of herds. This density is defined within the boundaries defined by the truncation points of the priors and by the parameter relationships.
The fully conditional posterior distributions of all unknowns can be derived from the joint posterior density (e.g., [1,13,14]). Given the liabilities, all conditional posterior distributions are recognizable, except those of the frequency parameters of the cosine function, since these enter non-linearly into the model. The vector β has a truncated multivariate normal distribution, and the conditional posterior distributions of both h and p are multivariate normal.
Thus, each element of β has a truncated univariate normal conditional posterior distribution, and each element of h and p follows a univariate normal distribution.
The "fixed" (b j ) and random parts (s 1,k ) of the amplitude enter linearly in the model for liabilities, so their conditional posterior distributions can be derived from the joint posterior distribution directly. Let l * be a vector of offsets with elements equal to where l hi jkt is liability, as before. Note that l * hi jkt = b j cos{2π f k (t − θ)} + e hi jkt . Also, let l * j = {l * hi jkt } be a vector of all offsets within the j th age-season of first calving, and c j be a column vector, with elements equal to cos{2π f k (t − θ)} for all records in the j th age-season class. The conditional posterior distribution of the "fixed" amplitude b j is then The conditional posterior distribution of the random sire amplitude s 1,k can be derived in a similar fashion. Let now l * be a vector of offsets with elements equal to Also, let l * k = {l * hi jkt } be a vector of all offsets of daughters of sire k. In addition, let c k be a column vector, with elements equal to cos{2π f k (t − θ)} for all records of sire k. The conditional posterior distribution of the amplitude of sire k is, Here, g i j is the i th row and j th column of matrix G −1 0 ; A k,k is the k th diagonal element of the matrix A −1 ; A k,−k is the k th row of the matrix A −1 without the k th element; s i,−k , where i = 1, 2, is the vector obtained from s i by deleting its k th row.
The conditional posterior distributions of the frequency (s 2,k ) of sires with progeny records do not have a closed form. A Metropolis algorithm was used to sample these elements from their conditional posterior distributions. The algorithm is described in the following section.
The herd and cow-specific variances have scaled inverse Chi-square conditional posterior distributions. The genetic (co)variance matrix has a conditional posterior distribution that is in an inverse Wishart form. In short, Gibbs sampling can be used to draw from all recognizable conditional distributions, whereas Metropolis steps can be embedded in the scheme in order to draw the sire specific frequency effects.

Metropolis algorithm
A single-site Metropolis algorithm was tailored to sample sire-specific frequency parameters for sires with daughter records from their posterior distributions. The Metropolis implementation was done within each sire class. Univariate normal proposals were used for s 2,k , and new candidates were generated from: where s (p−1) 2,k is the value of s 2,k at the (p − 1) th iteration of the sampler, and σ 2 s 2 is some fixed positive numbers leading to a reasonable rejection rate in the sampling process. Let again l * hi jkt = l hi jkt − x hi jkt β−z h,hi jkt h−z p,hi jkt p, and recall that f k = ω + s 2,k for all records of cows that are daughters of sire k. Then, the probability of accepting s * 2,k as a draw from its posterior can each be calculated as: Here, N k is the number of daughters of sire k and G −1 = G −1 0 ⊗ A −1 . For sires that were present in pedigree file but lacked progeny records, the fully conditional posterior distributions of their frequency parameter can be shown to be the normal processes s 2,k ∼ N((g 22 A k,k ) −1 (−g 22 A k,−k s 2,−k − g 12 A k, s 1 ), (g 22 A k,k ) −1 ).
Above, A k,−k is the k th row of the A −1 matrix obtaining after deleting the A k,k element, and A k, is the k th row of the A −1 matrix. In addition, s i,−k denotes a vector obtained by deleting the k th row of s i .

Data
The data set represents a cohort of 36 178 cows of the Norwegian Red (NRF) breed from 5286 herds, which are the progeny of 245 sires. These cows were followed longitudinally for presence or absence of clinical mastitis until culled, or until completion of their fifth lactation, whichever occurred first. The interval ranging from 30 days prior to calving to 300 days post-partum was divided into five biologically meaningful periods: (−30,0), (1,30), (31,120), (121,210) and (211,300) for each cow in each of the first five lactations. These periods correspond to distinct physiological stages of lactation: pre-calving, calving and start of milking, early lactation, mid-lactation and late lactation. Although the periods span different lengths of time, we interpret them as units of "biological time". The idea here is that periodicity reflects "biological" and not chronological time. The response variables were presence or absence of clinical mastitis in each of the potential 25 periods. If a cow had at least one mastitis treatment within a given period (which is rare), the record was scored as "presence", and as "absence" otherwise. Hence, each cow had at most five records on presence or absence of clinical mastitis in each lactation; a missing record occurred when a cow was culled before 300 days in the corresponding lactation. Culled cows had mastitis records for the period in which they were culled even if this period was not complete. This overstates somewhat resistance to mastitis in that period, because the opportunity of infection is smaller.
As mentioned, a cow had at most 25 binary records over the five lactations. About 72% and 45% of the cows had complete first and second lactation records, respectively. Out of the original 36 178 cows, 11 219, 5797 and 2643 cows stayed in the herd until the last period of the third, fourth and fifth lactations, respectively. Table I gives the number of cows and the corresponding mastitis incidence rates, by period, for each of the five lactations. The highest incidence rates of clinical mastitis were found for the first month after calving (periods 2, 7, 12, 17 and 22); here, incidence rate ranged from 10.1% to 18.6%. Subsequently, there was a decrease in incidence. This indicates a periodical pattern in the process of infection.
Mastitis incidences for sires with the highest and lowest amplitude and frequency along the first five lactations are presented in Figure 2. The raw series were not stationary, since incidence of clinical mastitis increased with parity. A similar pattern was observed across sires in the four groups. The first peaks were observed in the second period for almost all sires; mastitis increases after calving and then decreases, with this pattern repeated lactation after lactation. Variation in amplitude was clearer than variation in frequency. Taller peaks were observed in the high amplitude groups of sires. The phase parameter, θ, was fixed at 2 for all series in this study, given that the first peak occurred at the second period across all series, without any apparent variation among sires. A common frequency of ω = 1 5 across all series seems evident from the figure, since the pattern repeated itself every five periods in all series.
For the model in Equation (2), the location vector β included 12 ageseason of first calving classes, 5 levels of parity and 9 years of calving effects; 5286 levels of herd effects were included in h; the vector p included 36 178 cow-specific permanent environmental effects. Since there were 437 sires presented in the pedigree file, there were 437 sire-specific effects for amplitude (s 1 ) and for frequency (s 2 ).

Convergence diagnostics
A single chain of 200 000 iterations was generated. Convergence diagnostics were based on trace plots of all (co)variance components and on Raftery and Lewis [11]. Based on the diagnostics employed, 20 000 iterations were discarded as burn-in, and inferences were based on the remaining 180 000 samples. The acceptance rate of the Metropolis algorithm for sire frequency was 29.4%.

RESULTS
Posterior means of the "fixed" amplitudes (b j ) ranged from 0.1764 to 0.2488 over age-seasons at first calving. Larger amplitude was observed for cows that g 11 = variance between sires for amplitude; g 22 = variance between sires for frequency; g 12 = covariance between sire effects on amplitude and frequency; r g 12 is the genetic correlation between amplitude and frequency; σ 2 h and σ 2 p are the variances between herds and between cows, respectively.
were older at first calving (≥ 28 months). The posterior means, standard deviations, Monte Carlo error and 95% posterior intervals for the sire (co)variances of amplitude and frequency, and for herd and cow-specific variances are given in Table II. The posterior distributions of sire variances for amplitude and frequency are in Figure 3, and the distributions were symmetric with posterior means of 0.0068 and 0.000023 for amplitude and frequency, respectively. The posterior coefficient of variation was 15.6% for frequency and 19% for amplitude. A strong negative genetic correlation (posterior mean was −0.75) between amplitude and frequency was observed, and its posterior distribution was unimodal but slightly skewed to the left (Fig. 3(c)). Figure 3(d) shows the distribution of estimated sire amplitudes which ranged between −0.07 and 0.23. The posterior means of herd and cow-specific variances were 0.072 and 0.086, respectively. Between sire variation for amplitude and frequency was much less important than between-herd and between-cow variations.

DISCUSSION
If a sire has a larger value of amplitude, his daughters would be expected to have a higher risk of mastitis, due to genetic transmission of susceptibility. The estimates of frequency provide information on how frequent the curve repeats itself, and daughters of sires with higher frequency would be expected to contract mastitis more frequently. Combinations of amplitude and frequency can give information on the variation of liability to clinical mastitis over the course of several lactations. On this basis, one should select sires that, genetically, have smaller amplitude and frequency. However, the negative genetic correlation between amplitude and frequency suggests that if sires with smaller amplitude were selected as parents, there would be an undesirable correlated response in frequency, which would be expected to increase. Although there is small variation between sire amplitudes, the difference between sire amplitudes (ranged between −0.07 and 0.20 for sires with daughter records) indicates possible selection against susceptibility to mastitis. Most sires with negative effect on amplitude were also on the top 20 sires for resistance to mastitis when multivariate threshold models were applied to first-lactation records only [3]. The small Monte Carlo errors for herd and cow-specific variances indicate that these posterior means were estimated precisely. Our estimates agree with an earlier analysis of clinical mastitis in multiparous cows using a multivariate probit model [7].
Given the strong triangular wave pattern of mastitis incidence across lactations in this study, only a single dimension cosine function was included in the analysis. The "fixed" common frequency was set equal to a constant across all age-season of first calving classes, based on the data. However, this assumption can be relaxed if heterogeneity in frequency is expected. For example, one would expect the series to reveal more variation if smaller intervals (larger number of periods) were defined. Higher harmonics for the cosine functions would be more likely to capture variation between animals when finer intervals are used to construct the longitudinal binary responses. Although finer intervals would reveal more variation between animals, it would also require much more intense computing time since the number of records for each animal would be larger. Further, it would also result in smaller amplitudes and weaker signals.
In addition, the phase of the series was fixed at 2, given the evidence that the first peak occurred at the second period across all series. However, it is possible to include both "fixed" and "random" phases in the model, if preferred. However, posterior intercorrelation between frequency and phase parameters inside the cosine function slows down convergence considerably. This requires further research and algorithm development.
Heringstad et al. [8] studied the number of cases of mastitis in first-lactation Norwegian Red cows using a censored ordinal threshold model. Repeatability ordinal threshold models could be an alternative to study multiparous mastitis records. However, the information on time to contracting mastitis would be lost. Similarly, Rodrigues-Motta et al. [12] studied the number of cases of mastitis in first-lactation Norwegian Red cows using Poisson and zero-inflated Poisson models. Repeatability or multivariate Poisson models would be useful to study the genetic variation of the mastitis cases in later lactations. However, more research is needed to account for censored records of cows culled early in lactation.
In conclusion, this is a first attempt at modelling longitudinal clinical mastitis using a periodic analysis. Our results suggest potential usefulness of a periodic analysis of longitudinal binary responses using sinusoidal functions if a periodically repeated pattern is present. However, we did not find evidence that genetic variation (as measured by variance between sires of cows) for amplitude or frequency is sizable in clinical mastitis.