Restricted maximum likelihood estimation of covariances in sparse linear models

This paper surveys the theoretical and computational development of the restricted maximum likelihood (REML) approach for the estimation of covariance matrices in linear stochastic models. A new derivation of this approach is given, valid under very weak conditions on the noise. Then the calculation of the gradient of restricted loglikelihood functions is dis- cussed, with special emphasis on the case of large and sparse model equations with a large number of unknown covariance components and possibly incomplete data. It turns out that the gradient calculations require hardly any extra storage, and only a small multiple of the number of operations needed to calculate the function values alone. The analytic gradient procedure was integrated into the VCE package for co- variance component estimation in large animal breeding models. It resulted in dramatic improvements of performance over the previous implementation with finite difference gradients. An example with more than 250 000 normal equations and 55 covariance components took hours instead of days of CPU time, and this was not an untypical case.


INTRODUCTION
Best linear unbiased prediction of genetic merit [25] requires the covariance structure of the model elements involved. In practical situations, these are usually unknown and must be estimated. During recent years restricted maximum likelihood (REML) [22,42] has emerged as the method of choice in animal breeding for variance component estimation [15][16][17][34][35][36].
Initially, the expectation maximization (EM) algorithm [6] was used for the optimization of the REML objective function [26,47].
In 1987 Graser et al. [14] introduced derivative-free optimization, which in the following years led to the development of rather general computing algorithms and packages [15,28,29,34] that were mostly based on the simplex algorithm of Nelder and Mead [40]. Kovac [29] made modifications that turned it into a stable algorithm that no longer converged to noncritical points, but this did not improve its inherent inefficiency for increasing dimensions. Ducos et al. [7] used for the first time the more efficient quasi-Newton procedure approximating gradients by finite differences. While this procedure was faster than the simplex algorithm it was also less robust for higher-dimensional problems because the covariance matrix could become indefinite, often leading to false convergence. Thus, either for lack of robustness and/or excessive computing time often only subsets of the covariance matrices could be estimated simultaneously.
A comparison of different packages [45] confirmed the general observation of Gill [13] that simplex-based optimization algorithms suffer from lack of stability, sometimes converging to noncritical points while breaking down completely at more than three traits. On the other hand the quasi-Newton procedure with optimization on the Cholesky factor as implemented in a general purpose VCE package [18] was stable and much faster than any of the other general purpose algorithms. While this led to a speed-up of between two for small problems and (for some examples) 200 times for larger ones as compared to the simplex procedure, approximating gradients on the basis of finite differences was still exceedingly costly for higher dimensional problems [17].
It is well-known that optimization algorithms generally perform better with analytic gradients if the latter are cheaper to compute than finite difference approximations.
In this paper we derive, in the context of a general statistical model, cheap analytical gradients for problems with a large number p of unknown covariance components using sparse matrix techniques. With hardly any additional storage requirements, the cost of a combined function and gradient evaluation is only three times that of the function value alone. This gives analytic gradients a huge advantage over finite difference gradients. Misztal and Perez-Enciso [39] investigated the use of sparse matrix technique in the context of an EM algorithm which is known to have much worse convergence properties as compared to quasi-Newton (see also Thompson et al. [48] for an improvement in its space complexity), using an LDL T factorization and the Takahashi inverse [9]; no results in a REML application were given. A recent papers by Wolfinger et al. [50] (based again on the W transformation) and Meyer [36] (based on the simpler REML objective formulation of Graser et al. [14]) also provide gradients (and even Hessians), but there a gradient computation needs a factor of O(p) more work and space than in our approach, where the complete gradient is found with hardly any additional space and with (depending on the implementation) two to four times the work for a function evaluation.
Meyer [37] used her analytic second derivatives in a Newton-Raphson algorithm for optimization. Because the optimization was not restricted to positive definite covariance matrix approximations (as our algorithm does), she found the algorithm to be markedly less robust than (the already not very robust) simplex algorithm, even for univariate models.
We test the usefulness of our new formulas by integrating them into the VCE covariance component estimation package for animal (and plant) breeding models [17]. Here the gradient routine is combined with a quasi-Newton optimization method and with a parametrization of the covariance parameters by the Cholesky factor that ensures definiteness of the covariance matrix. In the past, this combination was most reliable and had the best convergence properties of all techniques used in this context [45]. Meanwhile, VCE is being used widely in animal and even plant breeding.
In the past, the largest animal breeding problem ever solved ( [21], using a quasi-Newton procedure with optimization on the Cholesky factor) comprised 233 796 linear unknowns and 55 covariance components and required 48 days of CPU time on a 100 MHz HP 9000/755 workstation. Clearly, speeding up the algorithm is of paramount importance. In our preliminary implementation of the new method (not yet optimized for speed), we successfully solved this (and an even larger problem of more than 257 000 unknowns) in only 41 h of CPU time, with a speed-up factor of nearly 28  In section 2 we fix notation for linear stochastic models and mixed model equations, define the REML objective function, and review closed formulas for its gradient and Hessian. In sections 3 and 4 we discuss a general setting for practical large scale modeling, and derive an efficient way for the calculation of REML function values and gradients for large and sparse linear stochastic models. All our results are completely general, not restricted to animal breeding. However, for the formulas used in our implementation, it is assumed that the covariance matrices to be estimated are block diagonal with no restrictions on the (distinct) diagonal blocks.
The final section 5 applies the method to a simple demonstration case and several large animal breeding problems.

LINEAR STOCHASTIC MODELS AND RESTRICTED LOGLIKELIHOOD
Many applications (including those to animal breeding) are based on the generalized linear stochastic model with fixed effects )3, random effects u and noise 11 . Here cov(u) denotes the covariance matrix of a random vector u with zero mean. Usually, G and D are block diagonal, with many identical blocks. By combining the two noise terms, the model is seen to be equivalent to the simple model y = X(3 + 11 ', where rl' is a random vector with zero mean and (mixed model) covariance matrix V = ZGZ T + D. Usually, V is huge and no longer block diagonal, leading to hardly manageable normal equations involving the inverse of V. However, Henderson [24] showed that the normal equations are equivalent to the mixed model equations This formulation avoids the inverse of the mixed model covariance matrix V and is the basis of most modern methods for obtaining estimates of u and j3 in equation (1). Fellner [10] observed that Henderson's mixed model equations are the normal equations of an augmented model of the simple form where Thus, without loss in generality, we may base our algorithms on the simple model [3], with a covariance matrix C that is typically block diagonal. This automatically produces the formulas that previously had to be derived in a less transparent way by means of the W transformation; cf. [5, 11, 23, 50J. The 'normal equations' for the model [3] have the form where Here AT denotes the transposed matrix of A. By solving the normal equations (4), we obtain the best linear unbiased estimate (BLUE) and, for the predictive variables, the best linear unbiased prediction (BLUP) for the vector x, and the noise e = Ax -b is estimated by the residual If the covariance matrix C = C(w) contains unknown parameters w (which we shall call 'dispersion parameters', these can be estimated by minimizing the 'restricted loglikelihood' quoted in the following as the 'REML objective function', as a function of the parameters w. (Note that all quantities in the right-hand side of equation (6) depend on C and hence on w.) More precisely, equation (6) is the logarithm of the restricted likelihood, scaled by a factor of -2 and shifted by a constant depending only on the problem dimension.
Under the assumption of Gaussian noise, the restricted likelihood can be derived from the ordinary likelihood restricted to a maximal subspace of independent error contrasts (cf. Harville [22]; our formula (6) is the special case of his formula when there are no random effects). Under the same assumption, another derivation as a limiting form of a parametrized maximum likelihood estimate was given by Laird [31]. When applied to the generalized linear stochastic model (1) in the augmented formulation discussed above, the REML objective function (6) takes the computationally most useful form given by Graser et al. [14].
The following proposition contains formulas for computing derivatives of the REML function. We write for the derivative with respect to a parameter w! occurring in the covariance matrix.
Proposition [22,32,42,50]. Let where A and B are as previously defined and Then where (Note that, since A is nonsquare, the matrix P is generally nonzero although it always satisfies PA = 0.) 3 Generally, All is very sparse with few rows and many columns, most of them zero, since only a small subset of the variables occurs explicitly in the vth element.
Each model equation has only one noise term. Correlated noise must be put into one element. All elements of the same type are assumed to have statistically independent noise vectors, realizations of (not necessarily Gaussian) distributions with zero mean and the same covariance matrix. (In our implementation, there are no constraints on the parametrization of the C o y, but it is not difficult to modify the formulas to handle more restricted cases.) Thus the various elements are assigned to the types according to the covariance matrices of their noise vectors.

Example animal breeding applications
In covariance component estimation problems from animal breeding, the vector x splits into small vectors /3 k of (in our present implementation constant) size n trait called 'effects'. The right-hand side b contains measured data vectors y, and zeros. Each index v corresponds to some animal. The various types of elements are as follows.
Measurement elements: the measurement vectors y&dquo; E lR nt ra 't are explained in terms of a linear combination of effects (3 i C 7Rnt!a't, Here the i w i form an n rec x n e ff index matrix, the J. 1vl form an n rec x n eff coefficient matrix, and the data records y! are the rows of an n rec x n tra i t measurement matrix.
In the current implementation, corresponding rows of the coefficient matrix and the measurement matrix are concatenated so that a single matrix containing the floating point numbers results. If the set of traits splits into groups that are measured on different sets of animals, the measurement elements split accordingly into several types.
Pedigree elements: for some animals, identified by the index T of their additive genetic effect (3T, we may know the parents, with corresponding indices V (father) and M (mother). Their genetic dependence is modeled by an equation The indices are stored in pedigree records which contain a column of animal indices T(v) and two further columns for their parents (V(v), M(v)). Random (3) with where -y(v) denotes the type of element v. A practical algorithm must be able to account for the situation that some components of b, are missing. We allow for incomplete data vectors b by simply deleting from the full model the rows of A and b for which the data in b are missing. This is appropriate whenever the data are missing at random [43]; note that this assumption is also used in the missing data handling by the EM approach [6,27].
Since dropping rows changes the affected element covariance matrices and their Cholesky factors in a nontrivial way, the derivation of the formulas for incomplete data must be performed carefully in order to obtain correct gradient information. We therefore formalize the incomplete element formulation by introducing projection matrices P, coding for missing data pattern [31]. If we define P, as the (0,1) matrix with exactly one 1 per row (one row for each component present in b,), at most one 1 per column (one column for each component of b,), then P&dquo;A&dquo; is the matrix obtained from A, by deleting the rows for which data are missing, and P,b, is the vector obtained from b, by deleting the rows for which data are missing. Multiplication by p T on the right of a matrix removes the columns corresponding to missing components. Conversely, multiplication by p T on the left or P on the right restores missing rows or columns, respectively, by filling them with zeros.
Using the appropriate projection operators, the model resulting from the full element formulation (13) in the case of some missing data has the incomplete element equations where The incomplete element equations can be combined to full matrix form (3), with and the inverse covariance matrix takes the form where Note that C!, M v , and log det C! (a byproduct of the inversion via a Cholesky factorization, needed for the gradient calculation) depend only on type q(v) and missing data pattern P,, and can be computed in advance, before the calculation of the restricted loglikelihood begins.

THE REML FUNCTION AND ITS GRADIENT IN ELEMENT FORM
From the explicit representations (16) and (17), we obtain the following formulas for the coefficients of the normal equations After assembling the contributions of all elements into these sums, the coefficient matrix is factored into a product of triangular matrices using sparse matrix routines [8,20]. Prior to the factorization, the matrix is reordered by the multiple minimum degree algorithm in order to reduce the amount of fill in. This ordering needs to be performed only once, before the first function evaluation, together with a symbolic factorization to allocate storage. Without loss of generality, and for the sake of simplicity in the presentation, we may assume that the variables are already in the correct ordering; our programs perform this ordering automatically, using the multiple minimum degree ordering 'genmmd' as used in 'Sparsepak' [43].
Note that R is the transposed Cholesky factor of B. (Alternatively, one can obtain R from a sparse QR factorization of A, see e.g. Matstoms [33].) To take care of dependent (or nearly dependent) linear equations in the model formulation, we replace in the factorization small pivots < sB 2i by 1. (The choice E = (macheps)2!3, where macheps is the machine accuracy, proved to be suitable. The exponent is less than 1 to allow for some accumulation of roundoff errors, but still guarantees 2/3 of the maximal accuracy.) To justify this replacement, note that in the case of consistent equations, an exact linear dependence results in a factorization step as in the following In the presence of rounding errors (or in case of near dependence) we obtain entries of order eB ii in place of the diagonal zero. (This even holds when B ii is small but nonzero, since the usual bounds on the rounding errors scale naturally when the matrix is scaled symmetrically, and we may choose the scaling such that nonzero diagonal entries receive the value one. Zero diagonal elements in a positive semidefinite matrix occur for zero rows only, and remain zero in the elimination process.) If we add B i i to R i i when Rii < eB ii and set Rii = 1 when Bii = 0, the near dependence is correctly resolved in the sense that the extreme sensitivity or arbitrariness in the solution is removed by forcing a small entry into the ith entry of the solution vector, thus avoiding the introduction of large components in null space directions. (It is useful to issue diagnostic warnings giving the indices of the column indices i where such near dependence occurred.) The determinant is available as a byproduct of the factorization. The above modifications to cope with near linear dependence are equivalent to adding prior information on the distribution of the parameters with those indices where pivots changed. Hence, provided that the set of indices where pivots are modified does not change with the iteration, they produce a correct behavior for the restricted loglikelihood. If this set of indices changes, the problem is ill-posed, and would have to be treated by regularization methods such as ridge regression, which is far too expensive for the large-scale problems for which our method is designed. In practice we have not seen a failure of the algorithm because of the possible discontinuity in the objective function caused by our procedure for handling (near) dependence.
Once we have the factorization, we can solve the normal equations R T Rx = a for the vector x cheaply by solving the two triangular systems (In the case of an orthogonal factorization one has instead to solve Rx = y, where y = Q T b.) From the best estimate x for the vector x, we may calculate the residual as with the element residuals Then we obtain the objective function as Although the formula for the gradient involves the dense matrix B-1 , the gradient calculation can be performed using only the components of B-1 within the sparsity pattern of R T + R. This part of B-' is called the 'sparse inverse' of B and can be computed cheaply; cf. Appendix 1. The use of the sparse inverse for the calculation of the gradient is discussed in Appendix 2.
The resulting algorithm for the calculation of a REML function value and its gradient is given in table I, in a form that makes good use of dense matrix algebra in the case of larger covariance matrix blocks Cl,. The symbol EB denotes adding a dense subvector (or submatrix) to the corresponding entries of a large vector (or matrix). In the calculation of the symmetric matrices B', W, M' and K', it suffices to calculate the upper triangle.
Symbolic factorization and matrix reordering are not present in table I since these are performed only once before the first function evaluation. In largescale applications, the bulk of the work is in the computation of the Cholesky factorization and the sparse inverse. Using the sparse inverse, the work for function and gradient calculation is about three times the work for function evaluation alone (where the sparse inverse is not needed). In particular, when the number p of estimated covariance components is large, the analytic gradient takes only a small fraction 2/p of the time needed for finite difference approximations.
Note also that for a combined function and gradient evaluation, only two sweeps through the data are needed, an important asset when the amount of data is so large that it cannot be held in main memory.

ANIMAL BREEDING APPLICATIONS
In this section we give a small numerical example to demonstrate the setup of various matrices, and give less detailed results on two large problems. Many other animal breeding problems have been solved, with similar advantages for the new algorithm as in the examples given below [1-3, 19, 38, 49]. 5.1. Small numerical example Table II gives the data used for a numerical example. There are in all eight animals which are listed with their parent codes in the first block under 'pedigree'. The first five of them have measurements, i.e. dependent variables listed under 'dep var'. Each animal has two traits measured except for animal 2 for which the second measurement is missing. Structural information for independent variables is listed under 'indep var'. The first column in this block denotes a continuous independent variable, such as weight, for which a regression is to be fitted. The following columns are some fixed effect, such as sex, a random component, such as herd and the animal identification. Not all effects were fitted for both traits. In fact, weight was only fitted for the first trait as shown by the model matrix in table IIZ The input data are translated into a series of matrices given in table IV. To improve numerical stability, dependent variables are scaled by their standard deviation and mean, while the continuous dependent variable is shifted by its mean only.
Since there is only one random effect (apart from the correlated animal effect), the full element formulation [13] has three types of model equations, each with an independent covariance structure C. Y .
Measurement elements (type y = 1): the dependent variables give rise to type = 1 as listed in the second column in table IV. The second entry is special in that it denotes the residual covariance matrix for this record with a missing observation. To take care of this, a new mtype is created for each pattern of missing values (with mtype = type if no value is missing) [20]; i.e. the different values of mtype correspond to the different matrices C!. However, it is still based on C 1 as given in table V which lists all types in this example.
Pedigree elements (type q = 2): the next nine rows in table IV are generated from the pedigree information. With both parents known, three entries are generated in both the address and coefficient matrices. With only one parent known, two addresses and coefficients are needed, while only one entry is required if no parent information is available. For all entries the type is -y = 2 with the covariance matrix C 2 .
Random effect elements (type y = 3): the last four rows in table IV are the entries due to random effects which comprise three herd levels in this example. They have type -y = 3 with the covariance matrix C 3 . All covariance matrices are 2 x 2, so that p = 3 + 3 + 3 = 9 dispersion parameters need to be estimated.
The addresses in the following columns in table IV are derived directly from the level codes in the data (table 77)  The coefficients corresponding to the above addresses are stored in another matrix as given in table IV. The entries are 1 for class effects and continuous variables in the case of regression (shifted by the mean).
The address matrices and coefficient matrices in table IV form a sparse representation of the matrix A of equation (3) and can thus be used directly to set up the normal equations. Note that only one pass through the model equations is required to handle data, random effects and pedigree information. Also, we would like to point out that this algorithm does not require a separate treatment of the numerator relationship matrix. Indeed, the historic problem of obtaining its inverse is completely avoided with this approach.
As an example of how to set up the normal equations, we look at line 12 of Starting values for all C w for the scaled data were chosen as 3 for all variances and 0.0001 for all covariances, amounting to a point in the middle of the parameter space. With C w specified as above we have for its inverse Optimization was performed with a BFGS algorithm as implemented by Gay [12]. For the first function evaluation we obtain a gradient given in As can be seen, the statistical model is different for the various traits.
Because traits 1 through 4 and traits 5 and 6 are measured on different animals no residual covariances can be estimated, resulting in two types la and lb, with 4 x 4 and 2 x 2 covariance matrices C la and C 16 . Together with the 6 x 6 covariance matrices C 2 and C 3 for pedigree effect 9 and random effect 8, respectively, a total of 55 covariance components have to be estimated. The coefficient matrix of the normal equations resulted in 3 961 594 nonzero elements in the upper triangle, which lead to 5 993 686 entries in the Cholesky factor. We compared the finite difference implementation of VCE [17] with an analytic gradient implementation based on the techniques of the present paper. An unconstrained minimization algorithm written by Schnabel et al. [44] that approximates the first derivatives by finite differences was used to estimate all 55 components simultaneously. The run performed 37 021 function evaluations at 111.6 s each on a Hewlett Packard 755 model amounting to a total CPU time of 47.8 days. To our knowledge, it was the first estimate of more than 50 covariance components simultaneously for such a large data set with a completely general model. Factorization was performed by a block sparse Cholesky algorithm due to Ng and Peyton [41].
Using analytic gradients, convergence was reached after 185 iterations taking 13 min each; the less efficient factorization from Misztal and Perez-Enciso [39] was used here because of the availability of their sparse inverse code. An even slightly better solution was reached and only 41 h of CPU time were used, amounting to a measured speed-up factor of nearly 28. However, this speed-up underestimates the superiority of analytical gradients because the factorization used in the Misztal and Perez-Enciso's code is less efficient than Ng and Peyton's block sparse Cholesky factorization used for approximating the gradients by finite differences. Therefore, the following comparison will be based on CPU time measurements made on Misztal and Perez-Enciso's factorization code.
For the above data set the CPU usage of the current implementation -which has not yet been tuned for speed (so the sparse inverse takes three to four times the time for the numerical factorization) -is given in table X. As can be seen from this table computing one approximated gradient by finite differencing takes around 202.6 * 55 = 11 143 s, while one analytical gradient costs only around four times the set-up and solving of the normal equations, i.e. 812 s. Thus, the expected speedup would be around 14. The 37 021 function evaluations required in the run with approximated gradients (which include some linear searches) would have taken 86.8 days with the Misztal and Perez-Enciso code. Thus, the resultant superiority of our new algorithm is nearly 51 for the model under consideration. This is much larger than the expected speed-up of 14 mainly because, with approximated gradients, 673 optimization steps were performed as compared to the 185 with analytical gradients.
Such a high number of iterations with approximated gradients could be observed in many runs with higher numbers of dispersion variables and can be attributed to the reduced accuracy of the approximated gradients. In some extreme cases, the optimization process even aborted when using approximated gradients, whereas analytical gradients yielded correct solutions. 5 Table XI presents data on a number of different runs that have been performed with our new algorithm. The statistical models used in the datasets vary substantially and cover a large range of problems in animal breeding. The new algorithm showed the same behaviour also on a plant breeding dataset (beans) which has a quite different structure as compared to the animal data sets.
The datasets (details can be obtained from the second author) cover a whole range of problem sizes both in terms of linear and covariance components. Accordingly, the number of nonzero elements varies substantially from a few ten thousands up to many millions. Clearly, the number of iterations increases with the number of dispersion variables with a maximum well below 200. Some of the runs estimated covariance matrices with very high correlations well above 0.9. Although this is close to the border of the parameter space it did not seem to slow down convergence, a behaviour that contrasts markedly with that of EM algorithms.
For the above datasets the ratio of obtaining the gradient after and relative to the factorization was between 1.51 and 3.69 substantiating our initial claim that the analytical gradient can be obtained at a small multiple of the CPU time needed to calculate the function value alone. (For the large animal breeding problem described in table X, this ratio was 2.96.) So far, we have not experienced any ratios that were above the value of 4. From this we can conclude that with increasing numbers of dispersion variables our algorithm is inherently superior to approximated gradients by finite differences.
In conclusion, the new version of VCE not only computes analytical gradients much faster than the finite difference approximations (with the superiority increasing with the number of covariance components), but also reduces the number of iterations by a factor of around three, thereby expanding the scope of REML covariance component estimation in animal breeding models considerably. No previous code was able to solve problems of the size that can be handled with this implementation.