Restricted maximum likelihood to estimate variance components for animal models with several random effects using a derivative-free algorithm

Summary - A method is described for the simultaneous estimation of variance components due to several genetic and environmental effects from unbalanced data by restricted maximum likelihood (REML). Estimates are obtained by evaluating the likelihood explicitly and using standard, derivative-free optimization procedures to locate its maximum. The model of analysis considered is the so-called Animal Model which includes the additive genetic merit of animals as a random effect, and incorporates all information on relationships between animals. Furthermore, random effects in addition to animals’ additive genetic effects, such as maternal genetic, dominance or permanent environmental effects are taken into account. Emphasis is placed entirely upon univariate analyses. Simulation is employed to investigate the efficacy of three different maximization techniques and the scope for approximation of sampling errors. Computations are illustrated with a numerical example. variance components - restricted maximum likelihood - animal model - additional random effects - derivative - free approach


INTRODUCTION
Over the last decade, restricted maximum likelihood (REML) has become the method of choice for estimating variance components in animal breeding and related disciplines trying to partition the phenotypic variation into genetic and other components. This has been facilitated not only by an increase in the general level of computational resources available, but by the development of numerous specialized algorithms, exploiting specific features of the data structure or model of analysis as well as utilizing a variety of numerical techniques. So far, REML has found most practical use in the analysis of dairy cattle data under a &dquo;sire model&dquo;. For this model, records of progeny are used only to obtain information on half of their sires' breeding value, while dams and relationships between females are ignored. Recently, interest has increased in more detailed models, in particular the conceptually simplest breeding value or &dquo;Animal Model&dquo; (AM) where each record is taken to provide information on the additive genetic merit of the animal measured. By including animals which do not have records themselves but are parents, this allows for all information on relationships to be taken into account.
A large proportion of REML applications have been restricted to models with one random factor (e.g. sires) apart from random residual errors, estimating two variance components only in a univariate analysis, or p (p -f-1) for a multivariate analysis of p traits. While algorithms for more complicated models have been described, they are by and large computationally demanding. Often they involve inversion of a matrix of size equal to the total number of levels of all random effects fitted. This can be prohibitive for practically sized data sets. Thus REML has found comparatively little use so far for models fitting several random effects.
Maximum likelihood estimation involves, by definition, location of the maximum of the likelihood function for a given set of data, model of analysis and parameters to be estimated. Estimating variance components for unbalanced data generally requires iterative schemes. Standard textbooks on numerical analysis classify procedures to find the optimum (minimum or maximum) of a function according to the amount of information required from derivatives of the function. The so-called Newton methods utilize both first and second derivatives, i.e. geometrically speaking slope and curvature, and are thus quickest to converge. Methods relying on first derivatives only include steepest descent, conjugate gradient and Quasi-Newton procedures approximating second derivatives. Finally, there are derivative-free methods involving direct search strategies or numerical approximation of derivatives (see for example Gill et al., 1981).
In the main, REML algorithms currently employed in animal breeding fall into the first two categories. Fisher's Method of Scoring is a special case of the Newton procedures, requiring expected values of second derivatives of the log likelihood function (G) to be evaluated. As these are often difficult to obtain, Expection-Maximization (EM) type algorithms (Dempster et al., 1977), exploiting first derivative information, are used more widely.
A derivative-free REML algorithm has been suggested by Graser et al. (1987) for univariate analyses to estimate the additive genetic and error variance under an animal model. Exploiting sparse matrix techniques, they showed that their procedure was suitable for data from large selection experiments involving several thousand animals. This paper describes the use of a derivative-free approach to estimate variance components by REML for AMs which include not only animals' additive genetic merit but also additional random effects, and thus cover a wide range of models suitable for the analysis of animal breeding data. Univariate analyses only are considered at present, extensions to multivariate situations will be discussed elsewhere.

CALCULATING THE LIKELIHOOD The Model
Let: denote the linear model of analysis with: Y the vector of N observations, b the vector of NF fixed effects (including any linear of higher order covariables) X the NxNF incidence or design matrix for fixed effects with column rank NF * , u the vector of all NR random effects fitted, Z the NxNR incidence matrix for random effects, and e the vector of N random residual errors. Assume: &dquo;'{TI--B r&dquo;I which gives: The mixed model equations (MME) pertaining to [1] are then (Henderson, 1973): or C F = r. If C is not of full rank, as it is often the case, estimates for b are not unique.
The Likelihood REML operates on the likelihood of linear functions of the data vector with expectations zero, so-called error contrasts, or, equivalently, on the part of the likelihood (of the data vector) which is independent of fixed effects. This results in the loss in degres of freedom due to fitting of fixed effects being taken into account (Patterson & Thompson, 1971). For Y !N(Xb, V), the log likelihood is (e.g. Harville 1977): -where X * (of order NxNF * ) is a full rank submatrix of X. Using matrix equalities given by Harville (1977) and Searle (1979), [3] can be rewritten as: where C * is the coefficient ' matrix in [2] with X replaced by X * , and P is a matrix: Calculation of the first two terms required in [4] depends on the specific structure of R and G in a given analysis. The latter two, however, can be determined in a general fashion, as suggested by Graser et al. (1987), by Gaussian Elimination (as described in most Numerical Analysis textbooks, or by Smith & Graser (1986)) applied to the mixed model array: the coefficient matrix in [2] augmented by the right hand side and a quadratic in the data vector. Calculation ofy'Py and log C*! The mixed model array for [1] is: &dquo;Absorbing&dquo; rows and columns pertaining to random effects into the rest to the matrix then gives: and eliminating rows and columns for fixed effects correspondingly, yields y!Py, the weighted sum of squared residuals required to evaluate log .C. Absorption is most easily carried out by Gaussian elimination: repeated absorption of one row and column at a time. This will also allow log I C * to be determined simultaneously. Subdivide M of size KxK (K=NF + NR + 1) with elements m2! and column vectors mi into rows 1 to K&mdash;1, and row K: Partitioned matrix results then give with Mx_ 1 * = M K -1 -M KM 'K/M KK = I M ij-M i K mjk!mKK} _ fm ij * } the matrix resulting when &dquo;absorbing&dquo; row and column K, or &dquo;pivoting&dquo; on mK x . Repeated use of this result shows that the required determinant is then simply the sum of the log of pivots log mii*, i = 2,..., K) arising when absorbing all rows and columns of M into the first row, as required to evaluate jrpy. If X is not of full rank, M has to be set up replacing X by X * or, equivalently, absorptions have to be carried out skipping the NF-NF * rows with zero pivots.

Univariate analyses
Results presented so far hold for any model of form (1!. Consider now a univariate analysis with identically and independently distributed errors, i.e. For given values of the other variance components, the error variance can be estimated directly in this case, from the residual sums of squares as (see Harville, 1977;or Graser et al., 1987) Let the other parameters to be estimated, i.e. (co)variances of the random effects fitted, be denoted by oi with i = 1, ..., p -l, and p the total number of components with up = 0-2 E* As discussed by Harville & Callanan (1988), a function of REML estimates of a set of parameters is also the REML estimate of this function. Hence, instead of maximizing log G with respect to the p components Qi , we can reparameterize to 9 and p&mdash;1 functions fi (!i, (TÐ of the other components and the error variance. An obvious choice is to express the Q i as a proportion (À i i/u2 ) of the latter, so that having found REML estimates of u § and the a i , we can estimate 6 i = 62 E' Furthermore, for fixed values of A i , log G attains its maximum with respect to u § at the REML estimate of U2 E. This allows estimation to be conducted in two steps: Firstly, a &dquo;concentrated&dquo; likelihood is maximized with respect to the A i only which yields REML estimates !i. Secondly, &2 is obtained (from [9]) for the iz (Harville & Callanan, 1988). The advantage of this approach is that it reduces the dimension of the numerical search for the maximum of log L by one. As the number of iterates and likelihoods to be evaluated to find the maximum of log L usually increases substantially with the number of parameters to be estimated, this can lead to a considerable saving in computational resources required. From [8] it follows immediately that: Log !G! depends on the random effects fitted. For the simplest model with animals as the only random effect, as considered by Graser et al. (1987): where QA is the additive genetic variance, A the numerator relationship matrix between animals, a the vector of (direct) genetic effects for animals, and NA denotes the number of animals. Since log IAI does not depend on the parameters to be estimated, it is a constant and does not need to be calculated in order to maximize log G. The inverse of A is required in [6] (for G-1 ) though, but this can be set up efficiently from a list of pedigree information, following rules described, for instance, by Quaas (1976).
Often, animals in the same environmental subclass are subject to a so-called common environment effect, for example a pen or litter effect in pig or mouse data. Let c of length NC denote a vector of such effects to be included in the model of analysis, with . This gives: In other cases, the model of analysis may involve two random effects for each animal. Let m, of length NA, denote the second animal effect and assume each element has variance a ' . If there are repeated records per animal for a trait, m represents the permanent effects due to animals, excluding additive genetic effects. These are usually assumed to be uncorrelated with any other effects in the model, so that If m had variance 0 & d q u o ; ! 1 D, [13] would be augmented by log !D!. As with log IAI, this term is constant and does not need to be evaluated. Note though that G-1 and consequently D-1 is required in (6!. A typical example for this kind of structure is a model where m stands for dominance effects, u m 2 for the respective variance and D for the dominance covariance matrix among animals. For other traits, for example measures of reproductive performance, we distinguish between a direct and a maternal (or paternal) additive-genetic component, allowing for a covariance between the two. In that situation, there may not be a record supplying information on m for each animal, but information is acquired indirectly via links arising from the genetic covariance and relationships. With UAM denoting the covariance between a and m and r AM the corresponding correlation, and partitioned matrix results give For all models discussed so far, computational requirements to determine the part of log !G!, which depends on the parameters to be estimated, are trivial. This results from random effects being either uncorrelated, so that G is blockdiagonal (!12! and !13!), or G being the direct product of a matrix of parameters and a matrix describing correlations amongst levels of random effects as in [14]. Extensions to other models are straightforward, as long as G can be partitioned into blocks of such structure. For example, fitting permanent environmental effects (c) as well as direct and maternal additive genetic effects, [14] would be augmented simply by (NC log & O E l i g ; & e t h ; ) , provided c was uncorrelated to a and m. Table I summarizes log ,C for 10 models which may arise in the analysis of animal breeding data, with up to 3 random effects and involving up to 5 (co) variance components. Otherwise, G (or a submatrix thereof) needs to be set up explicitly and its determinant be obtained using techniques as described above for log !C*!. For instance, if G contained a block of form the contribution to log IGI would be Assume V(a ) = QA A, Cov(a, c') = Cov(m, c') = 0 and V(c) = o-c I for all models.
Terms are assumed to be the result of Gaussian Eliminations performed for M with aE 2 factored out.
Terms in light italic are constant and not required to maximize the likelihood.

Computational Considerations
Typically, the augmented coefficient matrix M is very large but also very sparse. Hence use of sparse matrix techniques, storing the non-zero elements of M only, is advantageous and allows matrices of order of thousands to be handled. Since M is symmetric, only the lower (or upper) triangle is required. One form of sparse matrix storage, described in standard text books such as Knuth (1973), is a socalled &dquo;linked list&dquo; . Such linked lists, one list for each row of M in conjunction with a vector pointing to the first element in each row, are well suited, and allow the Gaussian Elimination steps required to evaluate y!Py and log !C* ! to be carried out efficiently.
In setting up M, the order of equations can be of vital importance as it affects the &dquo;fill-in&dquo; during the absorption process, i.e. the number of additional non-zero offdiagonal elements arising. For computational efficiency this should be kept as small as possible. There is extensive literature concerned with numerical operations on sparse matrices. Tewarson (1973), for example, discusses techniques for the choice of pivot in each Gaussian Elimination step which yields the least local fill-in, and also considers the scope of a priori column permutations. A number of strategies for re-ordering matrices exists, often utilizing graph theory; see for instance Duff et al. (1986). Such general techniques, making little or no assumptions about the matrix structure can be computationally expensive. This may be prohibitive for situations where the direct solution of a large sparse system of equations is required a few times only, but may be worthwhile for our application where numerous likelihood evaluations are to be performed. Future research should consider this topic.
In the meantime, critical inspection of the data and relationship structure with their implications for the pattern of off-diagonal elements in the mixed model array, and judicious ordering of effects may achieve a large proportion of the potential benefits from general reordering algorithms. A standard strategy in attempting to minimize fill-in is to process rows with the fewest off-diagonals first. Graser et al. (1987) therefore suggested selection of pivots corresponding to the youngest animals first. For the models with several random effects for each animal, these should be assigned to successive rows. In other cases, it may be possible to exploit additional features of the data structure. For data from a multi-generation selection experiment with selection within families, for example, grouping of animals according to female &dquo;founders&dquo; appears preferable to a grouping according to generation. On the other hand, if animals are nested within contemporary (fixed) groups, it may be advantageous to order equations so that animals directly follow their group effects.
For R of form (8], QE!' is usually factored from (6]. In this case, calculations to determine Y Py and log [C * as described above, do not yield the terms required in [4] directly, but (y p y !E) and (log IC * I + (NF * +NR) log U2 ), which has to be born in mind when assembling the likelihood.

MAXIMIZING THE LIKELIHOOD
Choice of a strategy to locate the maximum of the likelihood function, or equivalently the minimum of -2 log G, is determined by several considerations. Firstly, each function evaluation, i.e. likelihood calculation, is computationally very much more demanding than any calculations required by the optimization procedure as such. Hence, a method which requires a small number of function evaluations in each iterate is desirable. Secondly, the procedure should be robust, i.e. cope with starting values for parameters considerably different from the eventual estimates, and should be little affected by problems of numerical accuracy, yielding sufficiently precise estimates of the minimum even for very flat functions. Thirdly, constraints on the parameter space should be accommodated and, preferably, not require extra function values or reduce the speed of convergence.
The suitability of three different approaches was examined using simulated data for models 1, 2, 4 and 8 as specified in Table I. Records were sampled according to the model of analysis for one or several generations (up to four), each comprising a given number of full-sib families (ranging from 25 to 800) of variable size (2 to 10), with dams nested within sires and each sire mated to a specified number of dams (1 to 5). Error variances were estimated directly, while all other components were expressed as a proportion of the phenotypic variance, i.e., 8 A , Om, Oc and O AM for a 2, 0fi, 02 and Q p M , respectively. Obviously, B A is the heritability and Oc what is commonly referred to as &dquo;c 2 effect&dquo; . As described above, this reduced the dimension of search to 1, 2, 3 and 4 for Models 1, 2, 4 and 8, respectively. This parameterization rather than expressing components as a proportion of the error variance (À¡) was chosen since it allowed checks for parameter estimates out of bounds more readily and, for the limited cases examined, as it appeared to be more robust against bad starting values.
Quadratic approximation For a model with animals as the only random effect, Graser et al. (1987) fitted a quadratic function in r = 0,2!la2 E to the log likelihood, predicting the maximum of log G as the maximum of this function. For one parameter, this required function values for 3 different r values per approximation. Having calculated log £ for 3 initial points, each iterate then involved one function evaluation, for r * which maximized the quadratic function of the previous step. This value and those pertaining to the two r values either side closest to r * were then utilized in determining the next quadratic approximation to log G. As reported by Graser et al. (1987), simulations for this model showed rapid convergence. A bad initial guess for r generally did not affect the estimation procedure greatly, as long as the three points in the initial approximation spanned a sufficiently large range. Though the number of iterates and likelihood evaluations required tended to increase, the same maximum of log G as for &dquo;good&dquo; starting values was attained without increasing computational demands excessively. This approach extends to the case of multiple parameters. For t, with elements Bi, denoting the vector of parameters with respect to which log G is to be maximized, and log G (t) the corresponding log likelihood, the quadratic approximation is: The vector maximizing [16] is then, for Q positive definite, For p parameters, a total of z = 1 + p(p +3)/2 different values of t and log G (t) are required in each iterate to set up and solve a system of z equations for the intercept q, the vector of linear coefficients q and the symmetric matrix of quadratic coefficients Q. This number increases rapidly with the number of paramaters, e.g, z = 6, 10, 15 and 21 for p = 2, 3, 4 and 5, respectively.
For one parameter, choice of the point to be replaced in each iterate was straightforward. In the multi-dimensional case, however, it was less obvious. Two strategies were explored. After z initial points had been obtained, the first involved, as for p =1, in the regular case one function evaluation per iterate, i.e. calculation of log G (t * ) for t * from the last iterate. This new point was added to the set of z points which formed the basis to predict t * in the previous step. The worst of the resulting set of z + 1 points was then eliminated, and a new vector t * determined. If the quadratic approximation failed, i. e. if log £ (t * ) was lower than all z function values in the set, t * was replaced by (t * + t m ) / 2, where t nt was the parameter vector with highest function value in the set. If necessary, this was repeated until the replacement was successful. Hence, each iterate increased the average likelihood of the z current points.
The second strategy comprised z function evaluations per iterate. Given a vector of starting values to (t * from the previous iterate), np vectors t i were derived by multiplying the i-th element of to by a factor reflecting a chosen step size, 1.10 for steps of 10% in this case. Following a scheme described by Nelder & Mead (1965), further parameter vectors were then determined as (ti + tj ) / 2 for i < j = 0, ..., p.
This yielded the required total of z grid points and subsequent estimate t * . For both strategies, all vectors t were checked for elements out of the parameter space, and if necessary these were set to their respective bounds.
The quadratic approximation performed well for Model 2, though, for the limited number of examples considered, it was not consistently better than the two alternative procedures studied, in terms of the number of likelihood evaluations required. For Models 4 and 8, however, where the data structure was such that only a small proportion of animals had direct information on the second genetic effect, problems of numerical accuracy occurred. Often the system of z equations to be solved was indeterminate or almost so. Typically this yielded non-positive definite estimates of Q and useless predictions of t * . For the second strategy, an alternative approach, slightly more robust, was tried. This consisted of estimating elements of q and Q by numerical differentiation, i.e. as forward-difference approximations to the first and second derivatives of log G, respectively.
On the whole, quadratic approximation of the likelihood function involving multiple parameters appeared to be unsuitable as a general search procedure. For a one-dimensional search, however, it performed consistently best among the 3 strategies examined.

Quasi-Newton
Procedures which do not require second derivatives of the function to be minimized, but approximate the Hessian matrix (= matrix of second derivatives) are referred to as Quasi-Newton methods. This approximation is usually performed iteratively, starting from an identity matrix, utilizing rank-two update techniques based on the vectors of changes in gradients (= first derivatives) and estimates between iterates.
While most Quasi-Newton procedures require first derivatives, some are derivativefree, approximating the vector of gradients using finite differences. These have been. found to show quadratic convergence and are recommended as the derivative-free methods to be used for smoth functions with continuous derivatives. Further details are beyond the scope of this paper and can be found in standard textbooks, for instance Gill et al. (1981).
Statistical library subroutines to find the minimum of a function using a Quasi-Newton algorithm, namely NAG routine E04JBF and IMSL routine ZXMIN, have been applied to -2 log 1 :-. These routines require the user to supply a subroutine to evaluate the function to be minimized, passing the number and vector of parameters as arguments and returning the function value. In addition, starting values for the parameters, the maximum number of iterates or function calls allowed and some criteria of accuracy of evaluation required and rounding errors tolerated have to be specified. E04JBF provided the facility to constrain parameters between fixed upper and lower bounds (the IMSL equivalent ZXMWD was not tested), i.e. 0 to 1 for 0 A , 0 M and 0c, and -1 to 1 for B AM . However, to impose these constraints, the routine required function values, setting all parameters simultaneously to their upper or lower limits. Obviously, this violated other restrictions on the parameter space in the genetic context, i.e. that the sum of components is bounded correspondingly (0 < 8 A + 8 M + 9 c + 9 n n,t <_ 1), and that the absolute value of the genetic correlation has a maximum of unity (9A M <_ 0 A 0 M ). Consequently, log G could often not be determined for the required constellation since M became negative definite or Y Py assumed a negative value. Hence minimization was carried out unconstrained. Techniques to implement more complicated constraints exist, and further research should investigate their suitabililty for the kind of models which are of interest in animal breeding. Unless a parameter vector was encountered for which -2 log G could not be evaluated, the Quasi-Newton algorithms performed well for all models examined. Each iterate required p function evaluations to approximate the vector of first derivatives. The number of iterates performed depended on user-specified criteria of accuracy and maximum number of function evaluations allowed. If likelihood functions were wery flat, routines would stop before the minimum of -2 log G was determined as accurately as desired, flagging problems of numerical accuracy. Figure 1 illustrates the typical pattern of changes in likelihood and estimates observed for an analysis under Model 4 for a &dquo;good&dquo; and a &dquo;bad&dquo; initial guess of parameter values. The simulated data for this example comprised 2 generations with 100 full-sib families of size 4 to 8 and 25 half-sib families each. Records were sampled for population values of o-P = 100 (phenotypic variance), 9 A = 0.50, Om = 0.20 and I T AM = -0.05. Starting values used were the population values (Set I) and ITA = 0.30, o-M = 0.15 and ITAM = 0.05 (Set II), respectively. For Set I, ZXMIN required 93 likelihood evaluations, for a given significance level of 6 significant digits. For Set II, however, the routine used 204 function calls before it considered the maximum of -2 log G to be found, although Figure 1 suggests that likelihood and estimates were essentially identical after 60 function evaluations.

Simplex
The Simplex method of Nelder & Mead (1965) is generally advocated as the derivative-free procedure to use if the multivariate function to be minimized is discontinuous, though, initially, it has been developed with the maximization of a likelihood function in mind. It relies on a comparison of function values without attempting to utilize any statistics related to derivatives of the function. Such optimization techniques are generally refered to as direct search procedures. While they often have been developed by heuristic approaches without proof of convergence, they have found to be effective in practice (Swann, 1972). The Simplex or Polytope method, as some authors prefer to call it to avoid confusion with the Simplex technique used in Linear Programming, was initially suggested by Spendley et al. (1962). It operates on a set of parameter vectors and their pertaining function values, which form the vertices of a simplex in the parameter space, hence its name. As reviewed by Swann (1972), it is based on the conceptes of &dquo;evolutionary operations&dquo; , developed to optimize productivity of industrial plants, in which the parameter space is searched following some geometric configuration. The design which requires the least number of points and hence makes most efficient use of the function values calculated, is a regular simplex. This is defined simply as a set of mutually equidistant points, n + for a simplex of dimension n. For two dimensions, for example, the regular simplex is an equilateral triangle. A useful property of a regular is that a new simplex can be formed the existing simplex by addition of a single new point.
The search proceeds as follows. To begin, a simplex of specified size is set up, including the point representing an initial guess for the optimum, and corresponding function values are obtained. The aim in each iterate then is to replace the worst point, i.e. for a minimization problem the point with the highest function. The new point, defining the next simplex, is chosen so as to preserve the geometric shape in a direction away from the discarded point, but passing through the center of the remaining points. This cycle of rejection and regeneration of a vertex is repeated until the simplex straddles the optimum. Reducing the size of the simplex, search then recommences with the new, smaller design, or terminates if the simplex becomes smaller than a specified limit (Swann 1972).
Subsequently, the Simplex method has been modified by Nelder & Mead (1965).
Abandoning the regularity of design, their procedure allows the simplex to rescale itself automatically in each iterate, changing shape and size according to the landscape of the surface searched. This adaptability is achieved by a combination of so-called reflection, expansion and contraction steps. Consider a function F=f(t) to be minimized with respect to p independent variables, and let to denot the guess for the optimum to start with. The initial simplex then comprises to and p other points t i (i = 1, ..., p) obtained by modifying one co-ordinate of to at a time by a chosen step size. Each iterate begins by ordering and renumbering the points in the simplex according to the pertaining function values. The following three points are identified: to with function value F o is now the currently best point in the simplex (F o < F.t, i = 1, ..., p), tp with function value Fp is the worst point (Fp > Fi, i = 0, ..., p-1) which is to be replaced in this iterate, and tp-, is the next to worst point (Fp_ 1 < Fp and F!,_1 > Fi, i = 0, ..., p -2).
Next, the center of the points to remain in the simplex is obtained as: i.e. by averaging each coordinate over the set of points. A trial point is then generated by &dquo; reflecting&dquo; tp towards the center: where the reflection coefficient a is a positive constant. The corresponding function value F,. is compared with F o to Fp to determine further steps in the iterate. Three possible outcomes are distinguished. Firstly, t r can be better than the worst point but not better than the best point (F o < F r < Fp). In this case, t r replaces tp and a new iterate is started. Secondly, if t,. is the new best point (F,. > F o ), it is assumed that the direction of search has been a good one and search is &dquo;expanded&dquo; in this direction by examining a point with corresponding function value F e . The expansion coefficient 7 is positive constant. If F e is less than F r , the expansion has been successful and t e replace tp. Otherwise t e is discarded and t,. is substituted for tp to complete the iterate.
Thirdly, t r may be worse than any of the remaining p points (F r > Fp_ 1 ). This leads to &dquo;contraction&dquo; of the simplex, obtaining a new point t e with pertaining function value F,, as: if F T is less than Fp, and as otherwise. The contraction coefficient 0 is a constant in the range from 0 to 1. If F e is less than the smaller of F r and F N , the contraction has been successful and tc. replaces tp to end the iterate. If not, the complete simplex is shrunk by moving each point halfway towards the best point, i.e. for i = 1, ..., p, before a new iterate is started. Suggested values for a, ( 3 and 7 are 1.0, 0.5 and 2.0, respectively (Nelder & Mead 1965). Full computational details together with a flow diagram can be found in Nelder & Mead (1965), and a FORTRAN implementation for example in O'Neill (1971).
Since differences in function values are not utilized in establishing the direction of search, restrictions on the parameter space can be imposed easily by assigning a very large positive function value to parameter vectors out of bounds. As pointed out by Nelder & Mead (1965), this will be followed automatically by a contraction step which will eventually keep estimates within their bounds. The convergence criterion suggested by Nelder & Mead (1965) is the variance, or standard deviation, of function values in the simplex, rather than the more conventionally used change in estimates between iterates. The rationale for this was that in statistical problems such as maximum likelihood estimation, the curvature of the surface near the minimum reflects the amount of information available on the parameters. If the surface is flat, sampling errors are large and it is not worthwhile to determine the minimum with great accuracy.
For REML estimation of multiple variance components, the Simplex method proved robust and easy to use. For a given vector of starting values, to, the initial simplex was made up of to and p vectors t i , obtained by multiplying the i-th element of to by 1.20. A factor this large, corresponding to a step size of 20% was chosen to ensure quick convergence even for bad starting values. This implied, however, that for starting values close to the estimates, the procedure would search unnecessarily in the wrong direction. A smaller step size may be sufficient and perhaps more eflicient i.e. yield estimates with less likelihood evaluations. For cases where parameter vectors out of bounds were not a problem, the Quasi-Newton algorithm performed generally somewhat better than the Simplex method. The variance of function values in the Simplex had to be less than 10-8 or even 10-9 for the Simplex to find the minimum of -2 log £ with the same accuracy as ZXMIN (for an accuracy level of 5 or 6 significant digits). In terms of changes in parameter estimates, however, a limit of 10-5 to 10-6 generally appeared sufficient. Figure 2 illustrates the convergence behaviour of the Simplex method for the same analyses as depicted in Figure 1. Oscillations in likelihood and estimates clearly reflect the &dquo;trial and error&dquo; mechanism of this approach. For both sets of starting values, 88 likelihood evaluations were required before the variance of function values in the Simplex dropped below the specified limit of 10-9 .

SAMPLING VARIANCES
The matrix of approximate, large sample covariances among variance component estimates is given by the inverse of the information matrix. This in turn can be approximated by the inverse of the Hessian matrix, i.e. the matrix of second derivatives of the log likelihood function with respect to the parameters to be estimated. A quadratic approximation of log £ and Quasi-Newton procedures provide an approximate Hessian matrix (Q) as a by-product. Using the Simplex method this has to be obtained explicitly. Nelder & Mead (1965) described an appropriate strategy, equivalent to the approximation by numerical differentiation using forward differences (see e.g. Gill et al., 1981). While the mechanics for approximating second derivatives are straightforward, it can be problematic in practice.
As noted when examining the quadratic approximation of log G for multiple parameters, Q was often not positive definite, whether derived by least-squares as described by Smith & Graser (1986) or using finite differences, and consequently yielded inappropriate predictions of the parameters maximizing log G. Similarly, the approximate Hessian produced by Quasi-Newton routines did not necessarily give meaningful estimates of sampling variances. Indeed, a large proportion of the literature on Quasi-Newton algorithms is concerned with procedures to deal with bad or non positive definite approximations to the matrix of second derivatives.
Simulation was employed to examine the sampling distribution of variance component estimates and their predicted variances for models 1, 2 and 4. Data were sampled for one to four generations consisting of 25 to 800 full-sib families of size 2 to 10 each. Dams were nested within sires with each sire mated to 1 to 5 dams. Variance component estimates were obtained using the Simplex method with population parameters as starting values and a convergence criterion of V(-2 log G) < 10-8 . Second derivatives were approximated as described by Nelder & Mead (1965) using a step size of 0.1% of the estimates.
Approximation of sampling variances worked well for model 1, i.e. estimating the additive genetic and error variance only. It was satisfactory for model 2 which included a additional environmental component due to full-sib families (litters), but it failed for analyses under model 4, including a maternal genetic effect as an additional random effect for each animal.
Table II summarizes empirical and predicted sampling variances and empirical correlations between estimates for a variety of design for model 2. Results clearly emphasize the need to ensure that the data provide sufficient information to estimate all effects fitted. Considering only one generation, variance components were derived solely from the covariances between full-and half-sibs, causing a high negative sampling correlation between a 2 and <7!. or, equivalently, between heritability and c 2 effect. In terms of the likelihood surface, this implied a maximum along a flat ridge, i.e. an area where for a constant sum of the two parameters the value of the likelihood changed very little with changes in the parameter values. If each sire was mated to only one dam, i.e. there were no half-sib families, there was little scope to partition the within family variance into its genetic and environmental components. This yielded a likelihood surface of a shape which did not allow sampling variances to be approximated.
Including data for a second generation provided the covariance between parents and their offspring as an additional source of information, thus allowing a considerably better discrimination between 0 &dquo;1 and o-c 2 as evidenced by the decreased (absolute value) sampling correlation between the two components. While for one generation sampling variances decreased with increasing number of dams per sire the reverse generally held for data spanning several generations.
These relationships are also illustrated in Figure 3 for analyses under Model 4. Considering one generation only (top row), no animal had records expressing both its direct and maternal genetic effects. Hence the sampling correlation between 0 &dquo;1 and am 2 was large and negative. This was accompanied by little variation in estimates of QAM depicting that this was determined by the genetic covariance structure among animals only. For data including 3 generations (bottom row) sufficient comparisons between and within generations were available to yield estimates of Q A and am 2 virtually independent of each other, while estimates of 0 &dquo; AM were highly variable and showed a strong negative association with or M, 2 NUMERICAL EXAMPLE Table III contains simulated data for 282 animals in two generations. Each generation consists of 18 full-sib families of size 6 to 10, where each sire is mated to 3 dams.
Parents of generation one did not have records, which introduced 24 base animals, yielding 306 animals in total. Records were sampled according to Model 8 (see Ta-ble I) for population values of u £ = 40, u % = 15,0&dquo; AM = -5, u$ = 10 and U2 = 50, a phenotypic mean of 200 and a fixed effect of 20 for generation 2.
Data were analysed under Models 1, 2, 3, 4, 7 and 8, as described in Table I. Analyses were carried out using the Simplex method to find the maximum of the likelihood. Two sets of starting values were chosen, namely the population values as a good (Set I: B A =0.40, 0Nj = 0.15, O AM = -0.05 and Oc = 0.10) and B A = 0.10, om = 0.30, O AM = 0.10, Oc = 0.20 (= Set II) as a bad initial guess. For models other than 8, the appropriate subset of parameters was used. Estimates for two values for the minimum variance of function values in the Simplex are summarized in Table  IV. Characteristics of the augmented coefficient matrix M and intermediate results for the first likelihood evaluated in each run are given to help with the validation of computer programs. Gaussian elimination steps are not dependent on the model of analysis, hence the example given by Graser et al. (1987) should suffice as an illustration. Numerical examples to be used in checking the building of the mixed model array for the various models can be found elsewhere in the literature. In particular, Henderson (1988) gives examples for a variety of animal models. For model 1, data were also analyzed using an EM-algorithm with tridiagonalisation of the coefficient matrix as described by Smith & Graser (1986). The derivative-free method required 0.6 seconds CPU time for set-up operations and 2.0 seconds to obtain estimates, using the Simplex procedure with population parameters as starting values (Run I) and performing 16 likelihood evaluations. The quadratic approximation for this case was somewhat faster, needing 1.5 seconds and 12 likelihood evaluations. The EM-algorithm required 36.0 seconds for various set-up steps, 31.0 seconds alone to tridiagonalize the coefficient matrix of order 306. Estimation was then fast, needing 0.7 s only to carry out 120 iterates. Part of the differences in computational requirements could be attributed to differences in computing techniques: programs for the EM-algorithm operated on the full coefhcient matrix (though halfstored) rather than the derivative-free method which dealt with its non-zero elements only. Smith & Graser's (1986) procedure could also have been utilised to perform an analysis under model 2. This would have required at least 5 tridiagonalizations of the coefficient matrix, however, i.e. more than 180 seconds CPU time, in contrast to 9.5 seconds used by the derivative-free algorithm (Run I).
CONCLUSIONS Direct maximization of the likelihood provides an attractive alternative to REML algorithms relying on information from derivatives. Though it has been discussed here only with reference to Animal Models, it is extremely flexible and can be adapted to a wide range of models of interest in the analysis of animal breeding data. Graser et al. (1987) describe the application to a Reduced Animal Model which reduces the number of Gaussian Elimination steps required by not setting up equations for animals which do not have progeny but absorbing these directly into equations for their parents. Though not considered here, this equivalent model can be used to reduce computations for a number of animal models containing additional random effects though it is generally not feasible for non-additive genetic models; see Henderson (1988) for a discussion and examples. For AMs, a set of computer programs has been written accommodates all 10 models of Table I (Meyer 1988).
Using sparse matrix techniques, models involving several thousand random effects levels can be handled computationally. Limitations on models and size and structure of data sets are imposed by the fact that each analysis requires numerous likelihood evaluations.
For univariate analyses, a reduction in the dimension of search is possible by estimating the variance of residual errors directly. The Simplex method is recommended as a robust and easy-to-use optimization procedure when the likelihood is to be maximized with respect to several parameters. For one parameter, the quadratic approximation used by Graser et al. (1987) appeared best. Extensions to multivariate analyses are straightforward though computationally considerably more demanding and will be considered in a subsequent paper.
Simulation showed the Simplex method to perform well. Means of estimates over replicates agreed with the population values for which data were sampled, indicating that the global rather than a local maximum of the likelihood function had been located. Limited work investigated the effect of different starting values on parameter estimates: while the number of likelihood evaluations required to obtain estimates varied with the initial guess, estimates for a particular data set did not depend on it. This suggests that local maxima are not a problem in derivativefree REML estimation. H6schele (1988) recently reported that REML is less likely to converge to local maxima than Bayesian procedures of variance component estimation. In general, however, this remains an as yet unsolved problem. Restarting the search at the predicted maximum or repeating it with different starting values will provide a reasonable degree of confidence in the assumption that the global maximum has been found, but we cannot be sure. Especially for complicated designs, the number of maxima of the likelihood surface over the parameter space is not known.
For models fitting animals' additive genetic merit as the only genetic effect, approximation of second derivatives of the likelihood function provided appropriate estimates of sampling covariances between estimates. For models containing other genetic effects, both non-additive genetic and maternal genetic, large negative sampling correlations between estimates were observed for a number of designs. This resulted in a shape of the likelihood surface which in general did not allow second derivatives to be approximated by numerical differentiation.