Character process model for semen volume in AI rams: evaluation of correlation structures for long and short-term environmental effects

The objective of this study was to build a character process model taking into account serial correlations for the analysis of repeated measurements of semen volume in AI rams. For each ram, measurements were repeated within and across years. Therefore, we considered a model including three environmental effects: the long-term environmental effect, which is a random year* subject effect, the short-term environmental effect, which is a random within year subject* collection effect, and the classical measurement error. We used a four-step approach to build the model. The first step explored graphically the serial correlations. The second step compared four models with different correlation structures for the short-term environmental effect. We selected fixed effects in the third step. In the fourth step, we compared four correlation structures for the long-term environmental effect. The model, which fitted best the data, used a spatial power correlation structure for the short-term environmental effect and a first order autoregressive process for the long-term environmental effect. The heritability estimate was 0.27 (0.04), the within year repeatability decreased from 0.56 to 0.44 and the repeatability across years decreased from 0.43 to 0.37.


INTRODUCTION
Artificial insemination (AI) is an important tool applied in programs of genetic improvement of dairy and meat sheep. In France, AI is used essentially with fresh semen, during summer, a period when semen production units need to produce large amounts of useful semen per day from a limited number of rams. For a given conservation technique, the number of doses produced per ejaculate depends on the volume, sperm concentration and sperm motility. Each ram in an AI centre is collected repeatedly and frequently during the summer season over several years. Semen records of an adult ram may be considered as repeated measures of a single trait. The classical approach for repeated measures of a single trait is the simple repeatability model. This is widely used for the genetic analysis of semen characteristics in domestic species [1,10,20,23,24,33]. However, the simple repeatability model assumes that residual effects are independent and that permanent environmental effects are constant over the productive life of each subject, which may be too restrictive. This may result in biased estimates of variance components [31] and may invalidate inferences about the mean response profile in longitudinal data [9]. Several general models that consider a smooth change over time in the mean as well as in the variance structures are given in the literature: the random regression model, which attempts to model the forms of the functions of time for each component of the phenotype (mean, genetic additive value, permanent environment) [14], the structured antedependence model, which defines the observation at time t as a function of the previous observation [17] and the character process model, which focuses on modelling the covariance functions themselves [27]. The aim of this study was to find the character process model that fits best the semen volume data recorded on adult rams of one French AI centre.

Data
Data on the ejaculates used in this study originate from rams collected during the 1996-2004 period in a French AI centre. We have focussed our study on data recorded between May and August, which corresponds to the period of intensive semen collection. A total of 36 480 ejaculates from 974 adult males, sons of 230 different sires, was analysed. The final pedigree file consisted of 20 761 animals. The rams were between 2 and 7 years old. The intervals between successive collections within one year varied from 1 to 28 days. The rams were collected over 1 to 5 successive years. The number of records per animal was on average 36 and ranged from 1 to 183. Three traits (volume, sperm concentration and mass motility) were measured at each collection. The character of interest in the present study was the volume, which was read directly in millilitres from a graduated collection tube.

Model
The assumptions of the simple repeatability model are not compatible with the analysis of this semen data. Firstly, the assumption of independence between residual effects could not be fulfilled. Indeed, uncontrolled daily environmental (temperature) or biological (animal health) influences on semen production on a particular day may be similar to those on neighbouring days. Secondly, the random animal effect may not be constant over years. For instance, animals are assigned to different pens each year and thus a ram may be dominant one year and not be dominant another year, which can affect its semen production. Thus, it is more realistic to update yearly the permanent environmental effect. As a result, three random environmental components influencing each semen record may be considered. Following notations used by Carvalheira et al. [4], the first is the long-term environmental effect (LTE), which includes all uncontrolled events that permanently influence the semen production of a particular ram during one year. This LTE effect includes past events specific to each animal that occurred during its first months of life (health, rearing conditions) and within year conditions such as feeding management, relations with other animals in the pen and health events. The second component is the short-term environmental effect (STE), which comprises all other sources of unaccounted temporary variations (correlated) that affect semen production. The third is the classical measurement error.
To fit the data with these new assumptions, the following character process model was used: where: y is a vector of ejaculate volumes of order K (K = N i=1 n i , N being the number of animals, n i the number of measurements per animal). β is a vector of fixed environmental effects of order q. a is a vector of random genetic additive effect of order s (s = 20 761). p is a vector of random LTE effects of order r (r = N i=1 c i , where c i is the number of years of collections for subject i). ε 1 is a vector of random correlated STE effects of order K. ε 2 is a vector of independent residuals of order K. X, Z and W are incidence matrices of order K × q, K × s, K × r, respectively.

and H 2 are
and h ic, jl = g 1 x ic, jl where g 1 and g 2 are unknown correlation functions, ⊕ represents the direct sum. L ci is the number of observations made during year c for subject i. x ic, jl is a measurement of "distance" between collections j and l of subject i during year c. d imc is a measurement of "distance" between collections performed during year m and c for subject i. The vector of observations y is assumed to be normally distributed with E (y) = Xβ and var (y) = σ 2 a Z AZ + σ 2 p W H 2 W + τ 2 H 1 + σ 2 e I K . Under this model, a heritability estimate isσ 2 a /σ 2 T whereσ 2 T =σ 2 a +σ 2 p + τ 2 +σ 2 e . The within year repeatability estimate is, for all subjects i, a function of x of the forms: σ 2 a +σ 2 p +τ 2 g 1 (x) /σ 2 T and the repeatability estimate across years is, for all subjects i, a function of d of the forms σ 2 a +σ 2 p g 2 (d) /σ 2 T .

Model selection
Selection of appropriate correlation structures for LTE and STE is a nontrivial step in the model selection process. The method used is inspired from methods proposed by Verbeke and Molenberghs [34], Verbeke et al. [35] and Lesaffre et al. [21] and involves four steps.
The first step consisted in exploring graphically the correlation between collections using the empirical semi-variogram [9,34]. To detect this correlation, the following model was used: where y, β, ε 1 and ε 2 are the same as in model (1.1), f is the random subject intercept ( f ∼ N(0, I N σ 2 f )) with the corresponding incidence matrix M, we draw the scatter plot of 1 2 E e i j − e il 2 = σ 2 e + τ 2 1 − g 1 x i jl vs. the corresponding "distance" x i jl where e is the vector of independent residuals of model (1.3): where the notations are the same as in model (1.2). The total process variance For the exploration of the function g 1 , the distance considered was the interval in days between observations. For the exploration of g 2 , we used residuals obtained with a model similar to (1.3) where y corresponded to the subject average trait per year and the distance considered was the interval in years between mean records. In this step, β is the vector of all the potential fixed effects and all the two-way interactions are described in detail in David et al. [7]. The second step consisted in choosing g 1 fitting and comparing several serial correlation structures for the model (1.2). We considered four different structures for H 1 . The first model (C0 S ) assumed independent STE (g 1 (x) = 0), the second model (AR S ) assumed a first-order autoregressive correlation structure for H 1 (g 1 (x) = ρ x where x is the difference between ranks of collection), the third (SP S ) and fourth (SG S ) models assumed a spatial power (g 1 (x) = ρ x ) and a spatial Gaussian (g 1 (x) = ρ x 2 ) where x is the interval between collections in days, respectively. The models were fitted using the mixed procedure of SAS  8.2 [30] and the restricted maximum likelihood method to estimate variance components. We used restricted likelihood ratio tests to compare the C0 S model to the three other models (nested models). The corresponding null and alternative hypotheses were ρ = 0 and ρ 0, respectively. Under the null hypothesis, the test statistic is expected to be distributed as a chi-square with two degrees of freedom. We used the Akaike information criteria (AIC) to compare the non-nested models (AR S , SP S , SG S ). The best model will be the one that has the lowest value of −2AIC.
The third step selected fixed model terms in (1.2) based on the model retained in step 2. This has been done step by step by comparing nested models with the likelihood ratio test. For this step, models were fitted using the mixed procedure of SAS  8.2 [30] and the maximum likelihood method.
The fourth step selected an appropriate correlation structure for the LTE according to the correlation structure for H 1 selected in step 2 and the mean structure retained in step 3. We compared four different correlation structures for H 2 using the model (1.1). The first model assumed no correlation between LTE (C0 L , g 2 (d) = 0), the second model assumed a correlation of 1 between LTE (C1 L , g 2 (d) = 1), the third model assumed a first-order autoregressive correlation structure for H 2 (AR L , g 2 (d) = η d , where d is the absolute difference between ranks of year of collection) and the fourth, a uniform correlation structure (UC L , g 2 (d) = constant). Restricted likelihood ratio tests were used to compare C0 L or C1 L to AR L or UC L structures while AIC criteria were used to compare the other models. For this step, data were analysed with an animal model using the ASReml software [13]. Variance components and genetic parameters were estimated at the end of this step using the restricted maximum likelihood method.
Finally, to clarify the relative role of the various model components, we compared the four models obtained by fitting the AR L or C1 L structure to the LTE variation and the SP S or C0 S structures for the STE variation. Thereafter in the text, the various sub-models (1.1) are labelled according to the structure of H 1 and H 2 , for instance AR L /SP S .

Semi-variogram
The empirical semi-variograms drawn for the exploration of g 1 and g 2 are presented in Figures 1 and 2, respectively. In both cases, this suggests a serial correlation dependent on time that may be described by an increasing function of the interval between observations. Nevertheless, in both cases, the semivariogram did not give enough information to choose among the potential g(x) functions. In Figure 1, the empirical estimates of variance components for σ 2 f , τ 2 and σ 2 e were 0.052, 0.02 and 0.045, respectively.
where x is the absolute interval in days between observations, AR S ⇔ g 1 (x) = ρ x , where x is the absolute interval between ranks of observation.

Selection of short-term environmental effect correlation matrix
Variance component estimates, minus twice-restricted log-likelihood and AIC criteria (−2AIC), obtained with the different correlation structures to model STE are presented in Table I. When we compared the C0 S structure to AR S , SP S and SG S structures, the smallest likelihood ratio statistic of two models was 2138, indicating that the C0 S structure fitted the data less well than the other. Concerning the comparison among SP S , SG S and AR S structures, the −2AIC criteria were smaller for the SP S and all differences in the −2AIC criteria were higher than 80, indicating that the SP S structure was the best. The estimates of variance components with the SP S structure were close to those estimated with the empirical semi-variogram 0.048, 0.019, 0.040 vs. 0.052, 0.02 and 0.045 respectively. The correlation estimate was very high between two STE separated by one day (ρ = 0.97).

Fixed effects
The fixed effects included in the final model for the mean structure were the age at collection, the week, the year of collection, the interval with previous collection, the daily variation (AM/PM) and the interaction week * year. The number of collections during the previous year, the rank of the within season collection, the breeding value for milk production in quartile and all other twoway interactions were discarded. Minus twice the difference of maximum loglikelihood values between the full fixed effect model and the reduced models equals 210. The associated number of degrees of freedom is 180, implying that the reduced model was acceptable at the 5% level of significance (P = 0.0623). Estimates of the selected fixed effects are described in detail in David et al. [7].

Selection of long-term environmental effect correlation matrix
Variance components, heritability estimates, minus twice AIC criteria and restricted log-likelihood obtained with the different correlation structures for the LTE are presented in the first four columns of Table II. The smallest value of the restricted likelihood ratio tests corresponding to the comparison of AR L /SP S model to the C0 L /SP S or the C1 L /SP S model was 55, and the AR L /SP S model presenting a lower −2AIC criteria than the UC L /SP S model. These results indicate that the AR L /SP S model has the best fit. The correlation estimate for LTE was η = 0.88 and was significantly different from 1 (by the LRT test with C1 L /SP S ).
Repeatability within and across years, estimated with the AR L /SP S model, were decreasing functions of the interval between collections. Within year repeatability varied from 0.56 to 0.44 (Fig. 3) and repeatability across years varied from 0.43 to 0.37 (Fig. 4).
The four last columns in Table II contains the results for the four models obtained by fitting the AR L or C1 L structure to the LTE variation and the SP S or C0 S structures for the STE variation. The log-likelihood ratio values indicate that models with the AR L structure were significantly better than the models with the C1 L structure, and that including the STE component (SP S vs. C0 S ) was also significantly better. The simple repeatability model (C1 L /C0 S ) just    fitting the subject effects inflated the proportion of variance in the residual relative to the other models. A major improvement in fit was observed when either the AR L correlation to LTE or a correlation structure to the STE was added to the simple repeatability model. Both of these models effectively allow a lower correlation between records from different years than for records within the same year. Having both terms in (AR L /SP S ) gives a small, though significant, further gain, and results in a lower heritability estimate than the other models. The correlation structures between observations (overall repeatability) modelled with these four models are presented in Figure 5. The simple repeatability model (C1 L /C0 S ) assumes a constant correlation between observations, which is quite different from the correlations modelled with the three other models. The AR L /C0 S model, which considers that the correlation between observations made during the same year is constant, tended to underestimate the correlation between observations separated by less than 15 days, and to overestimate the correlation between observations separated by more than 15 days and less than 2 years. The correlations modelled with the AR L /SP S and C1 L /SP S models were quite similar but may differ significantly when the number of years between observations increases.

Model
The model proposed here, for the analysis of repeated measurements of semen volume, is a special case of the parametric character process models that Figure 5. Evolution of the correlation between observations depending on the model where g 1 and g 2 are the functions of the correlation processes for the STE and the LTE, respectively, d is the absolute difference between ranks of year of observation and x is the absolute interval between collections in days. model the covariance functions [27]. Such models also accommodate a correlation other than 1, over time, for the genetic component. However, we did not use this approach since we checked that the genetic correlations between traits, in a multiple trait analysis with one trait per age, were all high (>0.95). Similar models to the one proposed here have been used to study cow lactations [4,31] but the modelling process differed from ours by omitting the formal justification of the choice of the correlation structures. Other models can be used for the analysis of repeated and correlated data. For example, random regression or structured antedependence models have been proposed for the evaluation of cattle [17][18][19]25,26]. Nevertheless, Jaffrézic and Pletcher [15] have shown that character process models perform well in comparison to alternative methods, often providing a better fit to the covariance structure.
A first attempt to extend the character process model to the multiple-trait case has been proposed by David et al. [6] considering cross-correlations equal to 0, identical ρ and identical η for all traits. Other models that do not ignore cross-correlations have been proposed by Jaffrézic et al. [16] using a method that involves an eigen transformation of the variance covariance matrices while Gilmour and Thompson [12] proposed to model the covariance matrix on the inverse scale. However, in both cases these methods assume that the correlation process has the same form for all traits. These assumptions may be too restrictive and cannot be extended in the case of different correlation processes for each single trait. Exploring graphically the cross-correlation could be an interesting first step to this extension. If we consider two traits A and B, the method for exploring the cross-correlation process for STE would be to draw 1

Model selection
We used a graphical technique to detect serial correlation. This first step in the analysis has three advantages: it indicates whether a correlation structure exists or not, it can help the user in determining an initial form of the covariance matrix, and it gives initial values of variance components that may help convergence to a maximum likelihood solution. However, this technique cannot be used to select an appropriate correlation function. Another way to explore the correlation structure would be to fit a mixed model with an unstructured covariance structure and to use the resulting estimated covariance matrix to suggest a more parsimonious structure. Nevertheless, when the number of time points is large, the REML algorithm may not converge [8]. This was the case for semen production records and the graphical technique was then more suitable.
Fitting linear models implies that an appropriate mean structure as well as a covariance structure be specified, but these structures are not independent of each other [34]. For our analysis, the covariance structure for the STE was determined first, using a full fixed model, while the fixed effects were tested later. Nevertheless, to confirm the choice of the correlation structure for the STE, the different correlation structures were also compared with the reduced mean model and the spatial power structure was still "optimal". A robust procedure for choosing the fixed model was proposed by Liang and Zeger [22]. It is relatively insensitive to the structure of the variance covariance assumed for the data. Verbeke and Molenberghs [34] showed that this estimator is consistent as long as the mean is correctly specified in the model. In agreement with Robert-Granié et al. [29], the results we obtained with this approach showed that, in most cases, the robust standard error was smaller than the naïve one.
Concerning the selection of g 1 (x), our analysis clearly showed an STE correlation of sequential measurements. Rams were collected with unequal time lags, and models using a function of time-lag in days between collections (SP S and SG S structures) fitted the data better than the AR S model based on the order of collections rather than the actual time interval. The daily correlation of the STE in the final model (AR L /SP S ) is high: 0.96. However, this correlated component is one-third the size of the uncorrelated component (τ 2 /σ 2 e 0.27). These results indicate that an important component of unexplained and uncorrelated daily variation remained.
With regards to the selection of g 2 (x), the assumption of independent LTE was discarded (C0 L /SP S model) indicating that some influences on semen production persist over years. However, not all influences remain constant since the C1 L /SP S model with a correlation of 1 was not as good as the AR L /SP S and UC L /SP S model with a smaller correlation. The autoregressive correlation was better than the uniform correlation. Hypotheses similar to those proposed by Carvalheira et al. [3] for cow milk production may explain this result: many events that affect a ram's capacity for semen production occur during productive life, and impart a correlation structure between years of production that decays with time. Several authors [4,31] have already used the first order autoregressive correlation structure for the analysis of LTE in cow lactations. Similar to the result obtained when omitting the permanent environmental effect in the case of repeated measurements [5], assuming independent LTE (C0 L /SP S model) yielded higher genetic variance and heritability (0.43 vs. 0.27, 0.29) estimates than considering a correlation between LTE because the genetic component picks up the covariance omitted from the environmental component. The results found concerning the correlation processes for STE and LTE mean that there may be some uncontrolled factors, which modify over years the so called "permanent" environmental effect (i.e. change in the location of the pen in the shed, modification of social behaviour when changing pen. . . ) and other factors, which induce within year correlations between residuals of the simple repeatability model (health status, body weight change. . . ). Therefore, one has to identify and control these factors to improve semen production and increase heritability and repeatability of semen traits.

Variance components and genetic parameters estimates
Once the appropriate correlation structures for LTE and STE were selected, we compared the component sub-models with the classic repeatability model. The results showed that the AR L /SP S model fitted the data the best. The heritabilities of these four models ranged from 0.27 to 0.33 and the correlations between estimated breeding values for semen volume estimated with these models were high (>0.98). These results may be specific to the data and the consequence of the high correlations for LTE and STE, indeed the results obtained for similar comparisons in other studies differed a little [3,4,31,32]. However, the results obtained with the AR L /SP S model vs. the three other models relaxed the assumptions of permanent environmental effect and independent residuals, which are the usual assumptions used in the study of repeated semen trait [2,10,11,20,23,24].
The estimate of heritability with the AR L /SP S model was moderate, higher than that reported in the literature for ram semen volume (0.07, 0.11 [28], 0.15, 0.20 [10]) and in the middle of the range of heritability reported on all species (from 0.07 to 0.58 [28,33]).

CONCLUSION
We used a character process model to analyse semen volume in AI rams that has previously been used by several authors [3,4] for test-day records of cows, but with a different modelling process. An empirical semi-variogram gives an informal check for serial correlation. The spatial power correlation and the first order autoregressive structure gave the best fit to the correlation between STE and LTE, respectively. Even if variance components were not markedly affected in this study by the introduction of STE and a first order autoregressive correlation structure for LTE vs. the classical simple repeatability model, the proposed model insured an appropriate selection of fixed effects, identified uncontrolled factors and gave more information about the decline of repeatability with time. The heritability estimate was moderate indicating that selection of AI rams on their ability to produce semen volume could lead to a substantial improvement in the number of doses produced per animal. However, the impact of this selection on other traits must be evaluated.