Estimating covariance functions for longitudinal data using a random regression model

- A method is described to estimate genetic and environmental covariance functions for traits measured repeatedly per individual along some continuous scale, such as time, directly from the data by restricted maximum likelihood. It relies on the equivalence of a covariance function and a random regression model. By regressing on random, orthogonal polynomials of the continuous scale variable, the coefficients of covariance functions can be estimated as the covariances among the regression coefficients. A parameterisation is described which allows the rank of estimated covariance matrices and functions to be restricted, thus facilitating a highly parsimonious description of the covariance structure. The procedure and the type of results which can be obtained are illustrated with an application to mature weight records of beef


INTRODUCTION
Covariance functions have been recognized as a suitable alternative to the conventional multivariate mixed model to describe genetic and phenotypic variation for longitudinal data, i.e. typically data with many, 'repeated' measurements per individual recorded over time. They are especially suited for traits which are changing with time so that repeated measurements do not completely represent the same trait. The example considered here forth is growth of an animal with weights taken at a number of ages, but the concept is readily applicable to other characters and other continuous scales or 'meta-meters'.
In essence, covariance functions are the 'infinite-dimensional' equivalent to covariance matrices in a traditional, 'finite' multivariate analysis [15]. As the name indicates, a covariance function (CF) describes the covariance between records taken at certain ages as a function of these ages. A suitable function is a higher order polynomial. This implies that when fitting a CF model, we need to estimate the coefficients of the polynomial instead of the covariance components in a finite-dimensional analysis. The number of coefficients required is determined by the order of fit of the polynomials. A finite-dimensional, multivariate analysis is equivalent to 'full fit' CF analysis where the order of fit is equal to the number of ages measured, i.e. the covariance matrices for the ages in the data generated by the estimated CFs are equal to the estimates that would have been obtained in a conventional, multivariate analysis. In practice, however, a reduced order fit often suffices.
This reduces the number of parameters to be estimated and thus sampling errors, resulting in a smoothing of the estimated covariance structure.
Kirkpatrick et al. [15,16] modelled CFs using orthogonal polynomials of age, choosing Legendre polynomials. Let E denote a covariance matrix of size q x q, and 4i of size q x k the matrix of orthogonal polynomials evaluated at the given ages with elements !2! _ !!(ti), the jth polynomial for the ith age t i . The order of fit of the CF is given by k < q. This allows the covariance matrix to be rewritten as E = 4iK4i' with K = f k i j a matrix of coefficients, and gives CF Here, t m are the ages adjusted to the range for which the polynomial is defined. Let t m with elements tj for i = 0, ... , k -1 denote the row vector of powers of t m and A the matrix of polynomial coefficients. This gives the mth row of 4) as 4!! = t m A. For instance for k = 3 and Legendre polynomials and Thus equation (1) can be rewritten as 0 (t m , t i ) = t!AKA't! = t mo t §, i.e. the coefficient matrix !2 with elements c!2! is obtained from K by including the terms of the polynomial chosen.
Kirkpatrick et al. [15] described a generalized least-squares procedure to determine the coefficients of a CF from an estimated covariance matrix.
Often, however, this is not available or computationally expensive to obtain. Meyer and Hill [24] showed that the coefficients of CFs can be estimated directly from the data by restricted maximum likelihood (REML) through a simple reparameterization of existing, 'finite-dimensional' multivariate REML algorithms. For the special case of a simple animal model with equal design matrices computational requirements were restricted to the order of fit of the genetic CF. In the general case, however, their approach required a multivariate mixed model matrix proportional to the number of ages in the data to be set up and factored, even for a reduced order fit. This severely limited practical applications, especially for data with records at 'all ages'.
Polynomial regressions have been used to describe the growth of animals for a long time [35], but only recently has there been interest in random regression (RR) models. These have by and large been ignored in animal breeding applications so far, although they are common in other areas; see, for instance, Longford [19] for a general exposition. RRs in a linear mixed model context have been considered by Henderson !9!. Jennrich and Schluchter [12]  This paper describes an alternative procedure for the estimation of covariance functions to that proposed by Meyer and Hill [24], which overcomes the limitations discussed above. It is shown that the CF model is equivalent to a RR model with polynomials of age as independent variables, and that REML estimates of the coefficients of the CF can be obtained as covariances among the regression coefficients. A mechanism is described to restrict the rank of the estimated covariance matrices (of regression coefficients) and thus the CFs, reducing the number of parameters to be estimated. The method is illustrated with an application to beef cattle data.  [15,16] used the well-known Legendre polynomials (see, for instance Abramowitz and Stegun [1]) in fitting covariance functions. These have a range of &mdash;1 to 1. Let tij denote the jth age for animal i standardized to this interval, and let 0 ,(t* ) be the mth Legendre polynomial evaluated for t!.. We can then rewrite equation (2) as a RR model with aim and !y2.&dquo;, representing the mth additive genetic and permanent environmental random regression coefficients for animal i, respectively, and k A and k R denoting the respective orders of fit. This formulation (3) implies that the vector of q breeding values in a 'finitedimensional', multivariate analysis is replaced by the vector of k A additive genetic, random regression coefficients. Note, however, that with k A chosen appropriately (i.e. the minimum order of fit modelling the data adequately), there is virtually no loss of information. In other words, equation (3) can be employed as an effective tool to reduce the number of traits to be handled (and breeding values to be reported) for 'traits' measured over a continuous time scale such as weights (e.g. birth, weaning, yearling, final and mature weight) in beef cattle or test day records for dairy cows. Moreover, the RR model (3) yields a description of the animal's genetic potential for the complete time period considered, for instance, an estimate of the growth or lactation curve.

Covariance structure
The covariance between two records for the same animal is then Generally measurement errors are assumed to be i.i.d. with variance QE , so that Cov(e2!,e2!!) = a2 for j = j' and 0 otherwise, but other assumptions, such as heterogeneous variances or autoregressive errors, are readily accommodated.
Clearly, the first two terms in equation (4) are CF with the covariances between random regression coefficients equal to the coefficients of the corre-coefficients (N A > N D denoting the total number of animals in the analysis, including parents without records), y the vector of k R x N D permanent environmental random regression coefficients, e the vector of N measurement errors, and X, Z * and Z z denoting the corresponding 'design' matrices.
Here ZD is the non-zero part of Z * (for k A = h R ), i.e. the part of Z * corresponding to animals in the data. The superscript ' * ' marks matrices incorporating orthogonal polynomial coefficients. Assuming y is ordered for animals, ZD is blockdiagonal, the block for animal i is of dimension q i x k R , and has elements øm(tij). Note that each observation gives rise to k R (or k A for Z * ) non-zero elements rather than a single element of 1 in the usual, finitedimensional model, i.e. the design matrices are considerably denser than in the latter case.
Let K A with elements K Amt = Cov(am, a l ) and K R with elements K Rml = Cov( 7 ,,,,,y i ) denote the coefficient matrices for the additive genetic and permanent environmental covariance functions A and R, respectively. In terms of analysis, this is analogous to treating RR coefficients as correlated 'traits'. Assume that the fixed part of the model accounts for systematic age effects, so that a N N(0, K A 0 A) and y -N(O, K R 0 I ND ) ' and that a and y are uncorrelated. For generality, let V( E ) = R, but assume R is blockdiagonal for animals with blocks equal to submatrices of the q x q matrix Sg. The mixed model matrix pertaining to equation (5) is then where A is the numerator relationship matrix between animals, IN is an identity matrix of size N, and Q9 denotes the direct matrix product. M * has N F + kAN A + k R N D + 1 rows and columns (with N F being the total number of levels of fixed effects fitted), i.e. its size and thus computational requirements are proportional to the order of fit of the CFs. For R = o, 6 21, o! can be be factored from M * , resulting in a matrix which can be set up as for a univariate analysis.
Estimates of the distinct elements of K A and K R and the parameters determining E E can be obtained by REML, applying existing procedures for multivariate analyses under a 'finite' model. This may involve a simple, derivative-free algorithm !21) or, more efficiently, a method utilizing information from derivatives of the likelihood, such as Johnson and Thompson's [13] 'average information' algorithm; see Madsen et al. [20] or Meyer [22] for a description of the latter in the multivariate case.
While true measurement errors are generally assumed to be i.i.d., there may be cases in which we need to allow for heterogeneous variances or correlations between 'temporary' environmental effects. This may, to some extent, compensate for suboptimal orders of fit for permanent environmental or genetic covariance functions. In other cases E E may include parameters, such as the autocorrelation p for measurement errors following a stationary time series, for which V(y) is non-linear and for which derivatives are thus not straightforward to evaluate. In these instances, a two-step procedure combining a derivativefree search (e.g. a quadratic approximation) for the 'difficult' parameter(s) with an average information algorithm to maximize log G with respect to the 'linear' parameters can be envisaged. A similar strategy has been employed by Thompson [32] in estimating the regression on maternal phenotype as well as additive genetic and environmental components of variance. Alternatively, estimation may be carried out in a Baysian framework using a Monte Carlo based technique, see Varona et al. [33] for an application in a linear RR model.
Calculation of the log likelihood (G) requires factoring M * to calculate the log determinant of the coefficient matrix (log [C * [) and the residual sums of squares (y'P * y) (see Meyer [21] for details). The likelihood is then For i.i.d. measurement errors, the error variance can be estimated directly as QE = y'P * y/(N -r(X)), as for univariate analyses.

Extensions to other models
So far only the case of a simple, 'univariate' animal model has been considered. More complicated models, however, are readily accommodated in the framework described. For instance, additional random effects such as maternal genetic effects or litter effects can be taken into account analogously by modelling each as a series of random regression coefficients. Correlations between random effects, e.g. non-zero direct-maternal genetic covariances, can be modelled by allowing for covariances between the respective regression coefficients, which then yield a CF describing the covariance between random effects over time.
Similarly, 'multivariate' CF [24] for series of measurements for different traits (e.g. height and weight measured at different times) can be estimated simply by fitting sets of RR coefficients for each trait and allowing for covariances between corresponding sets for different traits. An expectation-maximization type algorithm for a bivariate analysis under a RR model has recently been described by Shah et al. !30!. As mentioned above, a variety of assumptions about the structure of the within-individual, temporary environmental covariance matrices can be accommodated; see, for instance, Wolfinger [36] for a description of some commonly used models.

Reduced rank covariance functions
For q correlated measurements, the information supplied (or most of it) can generally be summarized as a set of k G q linear combinations. These can be determined by a singular value decomposition of the corresponding covariance matrix. Typically, this yields one or a few (k) large, dominating eigenvalues with the remainder (q-k) being small or zero. Setting the latter to zero and backtransforming (by pre-and postmultiplying the diagonal matrix of eigenvalues with the matrix of eigenvectors and its transpose, respectively) then yields a modified, reduced rank covariance matrix. In estimating covariance matrices, this could be used to reduce the number of parameters to be estimated and thus sampling variation. A parameterization to the elements of the eigenvalue decomposition and setting eigenvalues k + 1, ... , q and the corresponding eigenvectors to zero would achieve this but reduce the number of parameters to be estimated only for k < q/2. Though not perceived for this explicit purpose, the 'symmetric coefficients' CF model of [15] provides an alternative way of estimating reduced rank covariance matrices [22].
As outlined by Kirkpatrick et al. [15], there is an equivalent to the eigenvalue decomposition of covariance matrices for covariance functions, with a corresponding interpretation. Estimates of the eigenvalues of a CF fitted to order k are simply the eigenvalues of the corresponding, estimated matrix of coefficients (K). Similarly, estimates of the eigenfunctions of a CF, the infinitedimensional equivalent to eigenvectors, can be obtained from the eigenvectors of K. Let v i denote the ith eigenvector of K with elements Vij and 0;(t * ) the jth order Legendre polynomial. The ith eigenfunction of the CF is then [15] Note that 0;(t * ) is not evaluated for any particular age, but includes polynomials of the standardized age t * . Hence, * i is a continuous, polynomial function in t * . As discussed by Kirkpatrick et al. [15], eigenfunctions of genetic CF are especially of interest, as they represent possible deformations of the mean (growth) trajectory which can be effected by selection, while the corresponding eigenvalues describe the amount of genetic variation in that direction. In particular, the eigenfunction associated with the largest eigenvalue gives the direction in which the mean trajectory will change most rapidly.
Fitting a CF to order k requires k(k + 1)/2 coefficients, i.e. covariances between random regression coefficients, to be estimated, and gives estimates of the first k eigenfunctions and eigenvalues of the CF. In some instances, one or several eigenvalues of the CF may be close to zero or small compared to the other eigenvalues. This implies that we require a kth order fit to model the shape of the (growth) curve adequately, but that a subset of m directions (= eigenfunctions) suffices. In other words, we might obtain a more parsimonious fit of the CF by estimating a reduced rank coefficient matrix, forcing km eigenvalues of K to be zero. Consider the Cholesky decomposition of K, pivoting on the largest diagonal where L is a lower diagonal matrix with diagonal elements of unity, l i the ith column vector of L, and D is a diagonal matrix. For a covariance matrix K, the ith element of D, di, can be interpreted as the conditional variance of variable i, given variables 1, ... , i -1. A reparameterization to the non-zero off-diagonal elements of L and the diagonal elements of D has been advocated for REML estimation of covariance components to remove constraints on the parameter space or improve rate of convergence in an iterative estimation scheme [6, 18, 25!. Other parameterizations in this context, based on the eigenstructure of the covariance matrix, have been considered by Pinheiro and Bates !26!. An alternative form of the Cholesky decomposition is K = L * L * ' where L * has diagonal elements 1*i = !2. L * is often interpreted as K 1/2 . The eigenvalues of the power of a matrix are equal to the power of the eigenvalues of the matrix, and the eigenvalues of a triangular matrix are equal to its diagonal elements (5!. Hence, the estimate of K can be forced to have rank m by assuming elements d,,,, +1 to d k in equation (9) are zero (elements d i are assumed to be in descending order). This yields a modified matrix The vectors l i corresponding to the zero d i are then not needed, i.e. K + is described by km &mdash; m(m &mdash; 1)/2 parameters, m elements d i and (k -1)mm(m &mdash; 1)/2 elements of l ij (j > i) of the l i . Clearly, this is not equivalent to fitting a (full rank) CF to the order m (which would involve m(m + 1)/2 parameters) -for instance for k = 4 and m = 2 we fit a cubic regression assuming there are only two independent directions in which the trajectory is likely to change, while for k = m = 2 we fit a linear regression.
Strictly speaking, equation (6) has to be of full rank. Hence, for practical computations, d i are set to a small positive value (e.g. . Alternatively, a REML algorithm which allows for a semi positive definite covariance matrix of random effects could be employed, c.f. Harville [8] or Frayley and Burns !3!.
Obviously, this parameterization can also be used to estimate reduced rank covariance matrices for finite-dimensional, multivariate analyses.  [24] fitted covariance functions to January weights of 913 beef cows, weighed from 2 to 6 years of age, 2 795 records in total with up to five records per cow available. Their analysis used age at weighing in years and fitted measurement errors and fixed effects for each age separately. These data were re-analysed using the random regression model and fitting age at weighing in months. Analyses were carried out using program DxMRR [23], employing a derivative-free algorithm to maximize log G.

Meyer and Hill
There were a total of 22 ages in the data, ranging from 19 to 70 months. Figure 1 gives the mean weight and number of records for each age class. Analyses were carried out fitting a separate measurement error variance component for each year of age (five variances). Fixed effects fitted were year-paddock of weighing subclasses (86 levels), year of birth effects (16 levels) and a cubic regression on age at weighing. The model for fixed effects was 'univariate', i.e. the effects were assumed to be similar for cows of all ages. Additive genetic and permanent environmental covariance functions were fitted to the same order throughout (k A = k R = k). Orders of fit considered ranged from 1 to 6.
In addition, the usual 'repeatability model' was fitted, i.e. a CF model with k = 1 and a single measurement error variance, assumed to be the same for all ages.
For each order of fit, the number of non-zero eigenvalues allowed for each coefficient matrix to be estimated was set to the same value (r) for K A and K R , considering values of r < k of 1 to 3. In several instances, analyses resulted in estimates of K R with one small eigenvalue. In these cases, the rank of K R was reduced by one. In every case, this yielded a further improvement in log G when continuing the analysis, i.e. earlier convergence had been to a false maximum as the search procedure had become 'stuck' at the bounds of the parameter space.
In general, analyses took a considerable time to converge, markedly longer for an order of fit of k than for a comparable k-variate, finite-dimensional analysis. Furthermore, several restarts were required for each analysis before likelihoods stabilized. Convergence was especially slow when attempting to estimate 'unnecessary' parameters, i.e. an order of fit or rank of CF with one or more eigenvalues close to zero.

Likelihoods
Maximum likelihood values from all analyses together with eigenvalues of the estimated coefficient matrices and estimates of the measurement error variances are summarized in table 7. Clearly, a repeatability model (first line) was inappropriate in this case, estimates of o,2 (for k = 1) being considerably higher for older cows than for 2-year-old cows. Forcing the rank r of an estimated coefficient matrix -and thus CFto be 1 is equivalent to the assumption that all corresponding correlations have a value of unity. For r = 1 and k > 1, the higher order coefficients then model heterogeneity of variances.
For all orders of fit, increasing r from 1 to 2 yielded significant increases in log G. Increasing r further to 3, however, did not raise log G significantly, increases of 5.33 (k = 5) and 5.75 (k = 6) being outweighed by the number of additional parameters (6 and 8, respectively). For r = 2, log G increased significantly until k reached 4. Estimated CFs for this model were with t i denoting the ith standardized age. When fitting polynomials, however, we should also examine an order of fit of k + 2, i.e. the next order of fit with the same type (even versus uneven) of exponent, as an odd-degree polynomial is likely to contribute little when an even-degree polynomial fits the data (4!. Indeed, while adding a quartic coefficient (k = 5) did not increase log G significantly over a cubic polynomial (k = 4), fitting a quintic term (k = 6) did.
Fitting years rather than months of age as meta-meter for the CFs, Meyer and Hill [24] found log G not to increase significantly beyond k = 2. As shown in figure 1, there was only a small spread of ages within each year, i.e. the change in scale was expected to have little effect on estimates. The model employed by Meyer and Hill !24!, however, was multivariate, i.e. fitted separate fixed effects for each year of age. Presumably, the stronger trend in covariances observed in this analysis can be attributed to less variation being removed by fixed effects. In addition, likelihood ratio tests were carried out considering full rank CFs and the associated number of parameters. Figure 2 shows the estimated phenotypic standard deviations for the ages in the data. For k = 1, deviations from a horizontal line reflected differences in estimates of QE over years. Estimates for k > 3 were similar, the cubic term for k > 4 causing estimates for 6-year-old cows to rise sharply. As shown in table I, this was accompanied by estimates of a; 5 of zero, i.e. presumably to some extent due to a restriction imposed on the parameter space, forcing QE > 0. Except for these last age classes, estimates agreed closely with those from a finitedimensional multivariate analysis treating records at different years of age as separate traits, which were 40.0, 60. 5 [24].

Eigenfunctions
As for phenotypic standard deviations, estimates of the first eigenvalue (table I! and corresponding eigenfunction of A (figure 3) were similar for orders of fit k > 3. Values of the eigenfunction for individual ages were roughly proportional to the genetic variance for the age, as determined from the corresponding estimate of genetic CF ,A. Positive values throughout imply that selection for increased weight at any age is likely to increase weight at all other ages. Estimates for the second eigenvalue and -function, however, changed considerably from k = 3, 4 to k = 5, 6, in particular for r = 3. This was accompanied by a significant increase in log G from k = 4 to k = 5 and from k = 5 to k = 6 for r = 3 but not r = 2. Eigenvalues of estimates of A and R up to orders of fit of k = 4 were similar to those reported by Meyer and Hill !24!, suggesting that the third sizeable eigenvalues appearing or k ) 5 might be an artefact of the order of the polynomial function and the 'univariate' fixed effects part of the model of analysis. Figure 4 shows genetic covariances obtained from estimates of ,A for k = 3, 6 (r = 3), evaluated for the range of ages spanned by data. For k = 3 and k = 4, the resulting surfaces were very smooth, clearly following a quadratic and cubic function, while for k > 5 surfaces became 'wiggly', following the data more closely. A peak in genetic variance at 4 years of age with a subsequent decline was consistent with estimates of 672, 1277, 2 221, 722 and 1988 (kg 2 ) for ages 2, 3, 4, 5 and 6 years, respectively, from a finite-dimensional, multivariate analysis [24].

Genetic covariances
Corresponding genetic correlations for k = 4 and 6 are shown in figure 5. equivalent to a special class of random regression models. While not common practice, covariance functions can be formulated for the sources of variation modelled by random regression coefficients in any RR model. Regression models, fixed or random, require some assumptions about the parametric form of the regression equation. Regressing on orthogonal polynomials of time (or any other meta-meter) in order to fit Kirkpatrick et al.'s [15] CF model, only invokes the assumption that the trajectory to be modelled can be described by such polynomials. As shown above, in a maximum likelihood framework we can let the data determine the order of polynomial fit required and the rank of the CF which suffices.
Likelihood likely to be small. In practical applications, the error probability a has been 'doubled' to account for this conservatism, i.e. A has been contrasted against X!+1,2Ü' instead of X q + ,,,,, [27].
Alternatively, we can decide on an order of fit a priori or decide only to use the first n eigenfunction of the CF to describe the covariance structure. This choice may depend on the computational resources available, or may be influenced by the knowledge that higher order polynomials are notorious for magnifying small sampling errors and producing 'wiggly' functions. Even if this choice does not maximize the likelihood, we are likely to obtain more appropriate estimates than under the simple repeatability model when this clearly does not apply, both in terms of estimates of genetic parameters and prediction of breeding values. Further research is required to develop guidelines as how to choose the 'best' model for particular situations.
Great care has to be taken when defining and interpreting RR models. As shown in the numerical example, data points at the extremes of the independent variable can have a big influence on estimates of regression coefficients, yielding quite erratic estimates of CF or measurement error variances. Various authors estimated genetic parameters for test day milk yields of dairy cows fitting a random regression model. In most cases, resulting heritability estimates were high at the beginning and end of lactation and low in the middle of lactation, in marked contrast to previous estimates under a 'finite-dimensional' model, and thus regarded with justified scepticism ( [10,14]; Van der Werf et al., unpublished). This emphasizes the sensitivity of RR models to the effects and orders of CFs fitted. It has to be stressed that judicious modelling of fixed and random effects, careful choice of the orders of fit of CFs, and meticuluous screening of the data are paramount for successful RR model analyses.
As emphasized by Kirkpatrick et al. !15!, many families of functions can be employed in the estimation of CFwhile the choice of orthogonal function does not affect the estimate of the CF at the ages in the data, it does affect interpolation. Following their choice, we have used Legendre polynomials. These, however, generate a weight function with comparatively heavy emphasis on records at the outer parts of the interval for which they are defined (compared to, say, weights following a kernel from a Normal distribution) (!7!; section 3.3). Other functions might thus be more suitable. Alternatively, some scaling of the observed ages can be envisaged, for instance, a logarithmic transformation should reduce the impact of few observations on very old animals.

CONCLUSIONS
Random regression models provide a valuable tool to model repeated records in animal breeding adequately, especially if traits measured change gradually. They allow covariance functions to be formulated which describe genetic and environmental covariances among records over time. Moreover, they impose a structure on covariance matrices. Fitting regressions on orthogonal polynomials of time (or equivalent) we can estimate genetic covariance functions as suggested by Kirkpatrick et al. !15!, whose eigenvalues and eigenfunctions provide an insight into the way selection is likely to affect the mean trajectory of the records considered and can be used to characterize differences between populations, e.g. breeds of animals. 6. SOFTWARE A program is available for the estimation of covariance functions by REML, fitting a random regression animal model, as described above. 'DxMRR' is written in Fortran 90 and is part of the D F R EML package, version 3.0; see Meyer [23] for further details. This is contained on the CD-ROM of the Sixth World Congress on Genetics Applied to Livestock Production, or can be downloaded from the D F R EML home page at http: //agbu.une.edu.au/&dquo;kmeyer/dfreml.html.