Parameter expansion for estimation of reduced rank covariance matrices (Open Access publication)

Parameter expanded and standard expectation maximisation algorithms are described for reduced rank estimation of covariance matrices by restricted maximum likelihood, fitting the leading principal components only. Convergence behaviour of these algorithms is examined for several examples and contrasted to that of the average information algorithm, and implications for practical analyses are discussed. It is shown that expectation maximisation type algorithms are readily adapted to reduced rank estimation and converge reliably. However, as is well known for the full rank case, the convergence is linear and thus slow. Hence, these algorithms are most useful in combination with the quadratically convergent average information algorithm, in particular in the initial stages of an iterative solution scheme.


INTRODUCTION
Restricted maximum likelihood (REML) is one of the preferred methods for estimation of genetic parameters in animal breeding applications. Algorithms available to locate the maximum of the likelihood function differ in efficiency, computational requirements, ease of implementation and sensitivity to starting values in iterative schemes. The so-called 'average information' algorithm has been found to be highly effective, often converging in few rounds of iteration [40]. However, there have been some, albeit largely anecdotal, observations of convergence problems for analyses with 'bad' starting values, many 4 K. Meyer random effects or large numbers of traits. On the other hand, 'expectationmaximisation' (EM) type methods are noted for their stability, yielding estimates within the parameter space and an increase in likelihood with each iterate. Unfortunately, these desirable features often come at the price of rather slow convergence rates.
Over the last decade or so, a number of new, 'fast' EM procedures have been proposed. Of particular interest is the PX-EM or 'parameter expanded' algorithm of Liu et al. [20]. Foulley and van Dyk [6] considered its application for several types of mixed model analyses, demonstrating a dramatic increase in speed of convergence over the standard EM algorithm. Yet, there has been virtually no practical use in variance component estimation so far.
Covariance matrices in multivariate analyses by and large have been treated as 'unstructured', i.e. apart from symmetry and requiring eigenvalues to be non-negative, no further assumption is made. There has been growing interest, however, in analyses considering the leading 'factors' or 'principal components' of a set of correlated effects only. As discussed by Kirkpatrick and Meyer [16], omitting any factors explaining negligible variation reduces the number of parameters to be estimated, yielding a highly parsimonious model. The resulting estimates of covariance matrices then have a factor-analytic structure e.g. [15] or, assuming specific variances are zero, have reduced rank (RdR). Average information algorithms for these scenarios have been described by Thompson et al. [39] and Meyer and Kirkpatrick [29], respectively.
On closer inspection, it is evident that the PX-EM algorithm [20] involves a reparameterisation of the standard, linear mixed model of the same form as REML algorithms to estimate RdR covariance matrices [29]. This can be exploited to obtain EM type estimators for factorial and RdR models. After a brief review of pertinent algorithms, this paper extends the approach of Foulley and van Dyk [6] to EM and PX-EM estimation for models fitting the leading principal components only. Convergence behaviour of the resulting algorithms is examined for a number of practical examples, and contrasted to that of the average information algorithm.

REVIEW
Maximum likelihood estimation of variance components almost invariably represents a constrained optimisation problem which needs to be solved iteratively [8].
PX-EM for PC models 5

Average information algorithm
A widely used optimisation procedure is the Newton-Raphson (NR) algorithm. It utilises both first and second derivatives of the function to be optimised, and thus provides an efficient search strategy e.g. [35]. A particular variant of NR used in REML analyses is the 'average information' (AI) algorithm, proposed by Thompson and co-workers (see [40]), which replaces second derivatives of log L by the average of observed and expected values. NR algorithms perform unconstrained optimisation while REML estimates are required to be within the bounds of the parameter space [8]. Fortunately, constraints are readily implemented by estimating functions of the variance components for which the parameter space is not limited. Pinheiro and Bates [36] compare several options. The most commonly used is a parameterisation to the elements of the Cholesky decompositions of the covariance matrices, taking logarithmic values of the diagonal elements [19,31]. As well as enforcing permissible estimates, this can improve rates of convergence of iterative maximisation schemes [7,24]. In addition, NR type algorithms do not guarantee log L to increase. While an initial, small step in the 'wrong direction' might result in a better position for subsequent steps, NR algorithms frequently do not recover from steps away from the maximum of log L (log L max ). The step size in a NR iterate is proportional to the product of the inverse of the information (or AI) matrix and the vector of first derivatives of log L. A simple modification to control 'overshooting' is to reduce the step size until an increase in log L is achieved.
Optimisation theory divides the convergence of NR algorithms into two phases [1]: Phase I comprises iterates sufficiently far away from log L max that step sizes need to be 'damped' to increase log L. Convergence in this phase is generally at least linear. Jennrich and Sampson [14] suggested a simple strategy of successive 'step halving' for this purpose. More sophisticated, 'backtracking' line search algorithms are available which attempt to optimise step sizes and guarantee convergence; see, for instance, Boyd and Vandenberghe [1], Chapter 9. In particular, Dennis and Schnabel [4] describe a quadratic approximation to choose a scale factor τ. Utilising derivatives of log L yields an estimate of τ without the need for an additional function evaluation. If this step size fails to improve log L, updates can be obtained using a cubic approximation. Phase II, the 'pure' Newton phase, is reached when no further step size modifications are required. Typically, this phase shows quadratic convergence rates and involves relatively few iterates.
In addition, successful optimisation via NR algorithms requires the Hessian matrix (or its approximation) to be positive definite. While this is guaranteed 6 K. Meyer for the AI matrix, which is a matrix of sums of squares and crossproducts, it can have eigenvalues close to zero or a large condition number (i.e. ratio of largest to smallest eigenvalue). Such ill-conditioning can result in a vector of overly large step sizes which, in turn, may need excessive scaling (τ 1) to enforce an increase in log L, and thus hamper convergence. It is then advisable to modify the Hessian to ensure that it is 'safely' positive definite. Strategies based on the Cholesky decomposition of the Hessian matrix have been described [5,37] that are suitable for large optimisation problems. For problems small enough to compute the eigenvalues of the Hessian matrix, we can directly modify the vector of eigenvalues and compute a corresponding modified Hessian matrix, or add a small multiple of the identity matrix. The latter results in an update of the parameters intermediate between that from a NR step and a method of steepest descent algorithm. Choices of modification and for minimum eigenvalues are discussed by Nocedahl and Wright [35], Chapter 6.

Expectation maximisation algorithm
A widely used alternative to NR for maximum likelihood estimation is the EM algorithm, described by Dempster et al. [3]. It involves computing the expectation of the (log) likelihood, pretending any 'missing data' are known, the so-called E-step. Secondly, in the M-step, this expectation is maximised with respect to the parameters to be estimated; see, for example, Ng et al. [34] for an exposé, or McLachlan and Krishnan [21] for an in-depth treatment. The popularity of the EM type algorithm is, in part at least, due to its property of monotone convergence under fairly general conditions, i.e. that the likelihood increases in each iterate. In addition, for variance component problems based on the linear, mixed model, estimates are guaranteed to be within the parameter space, and terms in the estimators are usually much easier to calculate than those for NR type methods. An early formulation for an EM type algorithm to estimate covariances for multiple trait models has been presented by Henderson [11].
The main disadvantage of EM type algorithms is that they can be rather slow to converge. While NR methods are expected to exhibit quadratic rates of convergence, EM algorithms are expected to converge linearly [34]. This behaviour has motivated numerous modifications of the basic EM algorithm, aimed at improving its rate of convergence. In the simplest cases, it is attempted to predict changes in parameters based on changes over the past iterates, e.g. the 'accelerated EM' [17], which employs a multivariate form of Aitken acceleration. Other modifications involve approximations to derivatives 7 of the likelihood to yield Quasi-Newton e.g. [13,22] or gradient type procedures e.g. [12,18]. In addition, several generalised EM type algorithms have been proposed over the last decade. Strategies employed in these include maximisation of the likelihood conditional on subsets of the parameters, switching between the complete and observed likelihoods, or alternating between schemes to augment the observed by the missing data; see Meng and van Dyk [23] for a review.
Less attention has been paid to the effects of choice of parameterisation on convergence behaviour of EM type algorithms. Thompson and Meyer [38] showed that estimation of linear functions of variance components, similar in form to mean squares between random effects in balanced analyses of variance, instead of the variance components could dramatically improve convergence of the EM algorithm. While a reparameterisation to the non-zero elements of Cholesky factors of covariance matrices is routinely used with NR and Quasi-Newton type algorithms e.g. [31,33], this has found virtually no use in practical EM estimation of variance components. Largely this is due to the fact that estimates are ensured to be within the parameter space, so that there is no pressing need for a reparameterisation.
Lindstrom and Bates [19] described an EM algorithm for maximum likelihood and REML estimation in linear mixed models which utilised the Cholesky factorisation of the covariance matrices to be estimated. More recently, Meng and van Dyk [24] and van Dyk [41] proposed EM type algorithms which transformed the vector of random effects in the mixed model to a vector with diagonal covariance matrix, showing that substantial reductions in numbers of iteration could be achieved. The transformation utilised was the inverse of the Cholesky factor of the covariance matrix among random effects, and parameters estimated were the elements of the Cholesky factor.

Parameter expansion
Probably the most interesting proposal among the modern 'fast' EM type methods is the Parameter Expanded (PX) algorithm of Liu et al. [20]. Like the approach of Meng and van Dyk [24] it involves conceptual rescaling of the vector of random effects. However, there are no specific assumptions about the structure of the matrix α defining the transformation. Liu et al. [20] considered application of PX-EM for a number of examples, including a random coefficient, mixed model. Foulley and van Dyk [6] derived detailed formulae for PX-EM based on the standard mixed model equations for common univariate 8 K. Meyer models. As for the standard EM algorithm, the likelihood is ensured to increase in each iterate of the PX-EM algorithm [20].
Briefly, the basic procedure for PX-EM estimation of variance components is as follows [6]: The E-step of the PX-EM algorithm is the same as for standard EM. Similarly, in the first part of the M-step, covariance matrices for random effects, Σ, are estimated 'as usual', i.e. assuming α is equal to an identity matrix. Subsequently, the elements of α are estimated as additional parameters -this represents the expansion of the parameter vector. However, expansion is only temporary: pre-and postmultiplying the estimate of Σ byα andα , respectively, then yields an updated estimate of Σ, effectively collapsing the parameter vector again to its original size. Finally, estimates of the residual covariances are obtained as in the standard EM algorithm, after adjusting estimates of random effects forα.
For most algorithms, computational requirements of REML estimation increase with the number of parameters, both per iterate and overall. Hence it seems somewhat counter-intuitive to estimate a substantial number of additional parameters. For instance, if we have q traits in a multivariate analysis, there are q(q + 1)/2 elements of Σ to be estimated and, making no assumptions about the structure of α, an additional q 2 elements of α. However, the PX-EM algorithm can yield dramatically faster convergence than the standard EM algorithm [6,20].
Loosely speaking, the efficacy of the PX-EM algorithm can be attributed to the additional parameters capturing 'information' which is not utilised in the standard EM algorithm. In each iterate of the EM algorithm we treat the current values of the parameters as if they were the 'true' values, i.e. the values maximising the likelihood. Hence, before convergence, in the E-step the 'missing data' are imputed and the expectation of the complete likelihood is computed with error. This error is larger, the further away we are from log L max . The deviation ofα from the identity matrix gives a measure of the error. Adjusting the estimate of Σ forα effectively involves a regression of the vector of parameters on the vector of differences betweenα and its assumed value in the E-step. Liu et al. [20] described this as a 'covariance adjustment'.

Standard EM
Consider the standard linear, mixed model with y, β, u and e denoting the vectors of observations, fixed effects, random effects and residuals, respectively, and X and Z the corresponding incidence matrices.
The model given by (Eq. 1) is general and encompasses multiple random effects, as well as standard multivariate and random regression models. However, for simplicity of presentation, let u represent a single random effect for q traits, with subvectors u i for i = 1, . . . , q and covariance matrix G = Σ U ⊗ A. For u representing animals' genetic effects, A is the numerator relationship matrix. Σ U is the q × q covariance matrix between random effects with elements σ U i j , and ⊗ denotes the direct matrix product. Assume u and e are uncorrelated, and let Var(e) = R. Further, let Σ E be the matrix of residual covariances with elements σ E i j for i, j = 1, . . . , q. Ordering e according traits within individuals, R is block-diagonal with the k-th block equal to the submatrix of Σ E corresponding to the traits recorded for individual k.
This gives the vector of parameters to be estimated, θ = vech (Σ U ) |vech (Σ E ) of length p (with vech the operator which stacks the columns in the lower triangle of a symmetric matrix into a vector e.g. [9]). Standard formulation considers the likelihood of θ, given the data. Vectors u and β in (Eq. 1) cannot be observed and are thus treated as 'missing data' in the EM algorithm. In the E-step, we need to compute the expectation of the complete data log likelihood (log Q), i.e. the likelihood of θ given y, β and u. This can be split into a part due to the random effects, u, and a part due to residuals, e, [6], Each part comprises a quadratic form in the respective random vector and the inverse of its covariance matrix, and the log determinant of the latter. Strictly speaking, (Eq. 2) (and the following equations) should be given conditional on θ being equal to some current value, θ t , but this has been omitted for clarity; see, for instance, Foulley and van Dyk [6] or Ng et al. [34] for more rigorous formulations.
In the M-step, we take first derivatives of log Q with respect to the elements of θ, θ k . The resulting expressions are equated to zero and solved for θ k , k = 1, . . . , p.

Random effects covariances
Matrix Δ U i j has elements of unity in position i, j and j, i, and zero otherwise. With all subvectors of u of the same length, N U , and using that we obtain -after some rearranging -the well known estimators [11]σ where C is the inverse of the coefficient matrix in the mixed model equations (MME) pertaining to (Eq. 1), and C UU i j is the submatrix of C corresponding to the vectors of random effects for traits i and j, u i and u j .

Residual covariances
Similarly, estimators for the residual covariances σ E i j are obtained setting ∂log Q E /∂σ E i j = 0. Inserting R −1 R into the trace term (in Eq. 3) and rearranging, yields [11] tr Using that R is block-diagonal, we can then rewrite the left hand side of (Eq. 5) as (6) with N the number of individuals, and (Δ E i j ) k for the k-th individual equal to Δ E i j with rows and columns pertaining to traits not recorded set to zero. Likewise, the right hand side of (Eq. 5) can be accumulated across individuals, PX-EM for PC models 11 with X k , Z k and e k the sub-matrices and -vector of X, Z and e, respectively, for the k-th individual. This yields a system of q(q + 1)/2 linear equations to be solved to obtain estimates of θ E = vech (Σ E ) with elements F E i j,mn and t E i j of F E and t E as defined in (Eq. 6) and (Eq. 7), respectively.

PX-EM
For the 'Parameter Expanded' EM algorithm, (Eq. 1) is reparameterised to with Var u + = Σ + U ⊗ A. The elements of α represent the additional parameters to be estimated, i.e. the expanded parameter vector is Θ = vech(Σ + U ) |vech(Σ E ) |vec (α) (with vec the operator which stacks the columns of a matrix into a vector [9]). Depending on assumptions on the structure of α, there are up to q 2 additional parameters.
In the E-step, log Q is conditioned on α = α 0 . Choosing α 0 = I, the E-step is identical to that described above for the standard EM algorithm, i.e. the difference between u + and u is merely conceptual. This implies that steps to set up and manipulate the MME are largely 'as usual', making implementation of the PX-EM algorithm a straightforward extension to standard EM. For the reparameterised model (Eq. 9), e = y − Xβ − Z (I ⊗ α) u + . Hence, for Θ k = α i j only derivatives of log Q E are non-zero. For unstructured α, ∂log Q E /∂α i j has a single non-zero element of unity in position i, j. As shown by Foulley and van Dyk [6], equating derivatives to zero then yields -after some manipulations -a linear system of q 2 equations to be solved,θ α = F −1 α t α with θ α = vec(α). Elements of F α and t α are where u + i and Z i denote the subvector and -matrix of u + and Z, respectively, for trait i, and C XU i is the submatrix of C corresponding to the fixed effects and random effects levels for trait i.

Reduced rank estimation
Considering the direct estimation of principal components (PCs), Meyer and Kirkpatrick [29] reparameterised (Eq. 1) to The eigenvalue decomposition of the covariance matrix among random effects is Σ U = EΛE , with E the matrix of eigenvectors of Σ U and Λ the diagonal matrix of corresponding eigenvalues, λ i . As it is standard practice, let eigenvectors and -values be in descending order of λ i . For Q = E, u comprises random effect values for the PCs of the q traits considered. For Q = EΛ 1/2 , PCs are standardised to variances of unity and Σ U = QQ . This is the parameterisation used by Meyer and Kirkpatrick [29], who truncated Q to columns 1, . . . , r < q to obtain reduced rank estimates of Σ U . A more convenient alternative is Q = L with L the Cholesky factor of Σ U . This uses that L = EΛ 1/2 T with TT = I [9]. Assuming that the Cholesky decomposition has been carried out pivoting on the largest diagonals, this implies that we can obtain reduced rank estimates of a matrix considering the leading PCs only, by estimating the non-zero elements of corresponding columns of L.
At full rank (Eq. 12) gives an equivalent model to (Eq. 1). Truncating Q to the first r < q columns, yields an estimate of Σ U which has, at most, rank r. Clearly, (Eq. 12) is of the same form as (Eq. 9). However, there is a major conceptual difference: essentially, the roles of extra parameters and those of interest are reversed. The 'modifiers' of Z are now the parameters to be estimated, rather than auxiliary quantities. Conversely, the covariance matrix of random effects, Var(u ) is assumed to be an identity matrix for standard EM and AI REML algorithms. In a PX-EM algorithm, these covariances are estimated as additional parameters, Var(u ) = α , which is symmetric with r(r + 1)/2 elements α i j .

Random effects parameters
The mechanics of taking derivatives of log Q E with respect to the elements of Q are analogous to those for α i j in the full rank PX-EM algorithm. However, 13 there is no conditioning on Q = Q 0 = I. Consequently, we need to distinguish MME involving Z and Z . For generality, let Θ k = f q i j where q i j is the i j-th element of Q and f (·) is some function of q i j (but not involving any other elements of Q). This gives a matrix of derivatives Δ Q i j = ∂Q/∂Θ k which has a single non-zero element ω i j = ∂q i j /∂ f q i j in position i, j. In most cases, ω i j is unity. However, if we choose to take logarithmic values of the diagonal elements of L, ω ii = log(q ii ). For Using that ê = y − Xβ − Z u , expanding Q to Q = r m=1 q n=m Δ Q mn f (q mn ) and equating (Eq. 13) to zero then yields, after some rearrangement, (14) with u i the subvector of u for the i-th principal component. Subscript ranges, i = 1, . . . , r and j = i, . . . , q as well as m = 1, . . . , r and j = m, . . . , q in (Eq. 14), pertain to Q consisting of the first r columns of the Cholesky factor L, and are readily adapted to other choices of Q. This gives a system of r(2q − r + 1)/2 linear equations to estimate θ Q consisting of the non-zero elements of vech (Q), with elements C in (Eq. 16) and (Eq. 17) is the inverse of the coefficient matrix in the MME pertaining to (Eq. 12), i.e. involving Z rather than Z, and with numbers of equations proportional to r rather than q, with submatrices as defined above. Similarly, u i and β are the (sub-)vectors of effects in (Eq. 12). Terms Z j R −1 Z n , Z j R −1 X and Z j R −1 y, however, are submatrices and -vectors of the data part of coefficient matrix and right hand side of the mixed model equations on the 'original scale', i.e. pertaining to (Eq. 1). Hence, implementation of an EM algorithm for reduced rank estimation requires part of a second set of MMEproportional to the number of traits q -to be set up for each iterate. 14 K. Meyer

PX-EM: auxiliary parameters
Estimates of α can be obtained in the same way as the estimates of covariance components due to random effects in the standard EM algorithm (see Sect. 3.1.1 above).
for i = 1, . . . , r and j = i, . . . , r, and with C as in (Eq. 16) and (Eq. 17). Updated estimates of Q are then obtained as the first r columns of the Cholesky decomposition ofQα Q .

Examples
The performance of algorithms described above was examined for three, relatively small practical examples analysed previously. Table I summarises characteristics of the data and analyses. Further details can be found in the respective publications. Example 1 (from Meyer and Kirkpatrick [29]) consisted of four 'carcass traits' measured by live ultra-sound scanning of beef cattle in a single herd. Treating records for males and females as different traits, resulted in 8 traits in a multivariate analysis. With distinct subsets, the 16 residual covariances between traits measured on animals of different sex were zero. The model of analysis was a simple animal model, fitting animals' direct additive genetic effects as the only random effect.
Example 2 comprised records for birth, weaning, yearling and final weights of Polled Hereford cattle in the Wokalup selection experiment see [32]. While most animals had records for the first two weights, only replacement animals remaining in the herd after weaning had records for the later weights (35-40% of those with birth weight). The model of analysis fitted direct and maternal additive genetic effects, assuming direct-maternal covariances were zero, as well as maternal permanent environmental effects as random effects.
Example 3 considered repeated records for mature cow weights, also from the Wokalup selection experiment, taken between 19 and 84 months of age. Cows were weighed monthly, except during the calving season. This resulted in up to 63 records per animal, with 75% of cows having at least 13 records. With short mating and calving periods in the experiment, there was a strong association between age at and month of weighing. Previous analyses at the phenotypic level [25] thus had found a strong annual, cyclic pattern in both weights and variances. Hence, analyses fitted a random regression (RR) on quadratic B-splines of age at weighing, with 11 equi-distant knots at 6 months intervals resulting in 13 RR coefficients, for both additive genetic and permanent environmental effects of the animal. Measurement error variances were assumed to be heterogeneous with 12 classes, corresponding to the calendar month of recording.

Analyses
Full rank and RdR estimates of covariance matrices were obtained by REML, employing an AI, standard EM and PX-EM algorithm as well as a combination, consisting of 4 initial iterates of the PX-EM algorithm followed by AI (PX+AI). Residual covariance matrices were assumed to have full rank throughout. The same set of starting values for the covariance components to be estimated was used in all analyses for a particular example. Calculations were carried out using our REML program Wombat [28].
All analyses parameterised to the leading columns of the Cholesky decomposition of the covariance matrices to be estimated, pivoting on the largest K. Meyer diagonal elements. PX-EM and standard EM-algorithms for RdR estimation were implemented as described above (Sect. 3.3). In calculating the sparse inverse of the coefficient matrix (C), only the elements corresponding to the non-zero elements in the Cholesky factorisation of the original matrix were determined. Any other elements which might have been required to compute the terms in (Eq. 14) were treated as if they were zero. Convergence was assumed to have been reached when the change in log L between iterates (ΔL) was less than 10 −6 or if the relative change in the vector of parameters to be estimated, |θ t −θ t−1 |/|θ t |, was less than 10 −7 [6] (with |·| denoting the vector norm, andθ t the estimate of θ from iterate t).
The AI algorithm used was as described by Meyer and Kirkpatrick [29], but parameterising to the leading columns of Cholesky factors (see Sect. 3.3) and calculating the average information as described in the Appendix. Pivots were constrained to a minimum value of 10 −6 and transformed to logarithmic scale if small values (< 0.2) were encountered during the course of iterations. In each iterate, log L was forced to increase by scaling step sizes if necessary, using the line search procedure of Dennis and Schnabel [4]. In addition, the AI matrix was ensured to be 'safely' positive definite, by adding an appropriate multiple of the identity matrix to it, if the smallest eigenvalue was less than the minimum of 0.002 and 10 −6 × λ 1 , with λ 1 representing the largest eigenvalue of the AI matrix. The AI algorithm was deemed to have converged if the ΔL < 10 −5 and the corresponding Newton decrement [1] was greater than −0.01.

Example 1
Starting values for covariance components for Example 1 were the set of 'bad' values used by Meyer [28] to compare PX-EM, EM and AI algorithms for standard, full-rank multivariate REML analyses. These consisted of estimates from four-trait analyses for measures on females, repeated for males and all genetic covariances set to 0.01. Analyses were carried out fitting from 1, . . . , 8 principal components for additive genetic effects. Characteristics of the convergence patterns are summarised in Table II, and Figure 1 shows values of the relative log likelihood, i.e. log L deviated from the highest value found across all corresponding analyses (log L max ), for selected numbers of PCs fitted. With very stringent convergence criteria, almost all analyses for a given number of PCs converged to the same value, up to the third decimal. Table II. Number of iterates (N) needed and deviation of log likelihood (log L) from best value (D, multiplied by 1000) for change in log L between iterates to reach a minimum value, and N for log L to reach a given D, for Example 1. Both EM and PX-EM required hundreds of iterates to locate the maximum of log L. With a linear convergence pattern, reaching a stage where the ΔL dropped to less than 10 −5 generally doubled the amount of iterates required, compared to a less stringent value of 0.005, while increasing log L by less than 0.04. For all orders of fit, estimates of the matrix of auxiliary parameters for PX-EM, α , approached an identity matrix in relatively few iterates. While the PX-EM yielded slightly bigger improvements in log L than the EM algorithm initially, there was only little advantage over standard EM overall, even when all PCs were fitted. In stark contrast, there were substantial differences between the two algorithms for full rank estimation on the original scale [28], K. Meyer i.e., as suggested by Meng and van Dyk [23], parameterisation to the elements of the Cholesky factor greatly improved convergence of the EM algorithm.
In contrast, the AI algorithm converged in few iterates. With a quadratic convergence pattern, generally only a few additional iterates were required when increasing the stringency of the convergence criterion tenfold or more. The eigenvalue for the last PC of the 8 traits was very small (< 0.001). In turn, this yielded an AI matrix with small minimum eigenvalue, so that a constant needed to be added to its diagonal and multiple steps requiring step size scaling. Omitting this PC (Fit 7) removed the need for these control measures and improved the rate of convergence. Reducing the rank of fit further had comparatively little effect on the convergence of the AI algorithm, as long as the eigenvalues of the PCs not fitted were small. Fitting less than 5 PCs, however, there was a trend for the number of iterates required to increase with the number of PCs omitted. This was especially evident for an analysis fitting 2 PCs (see Fig. 1). While this did not cause a need for step size scaling or modification of the AI matrix, there was a sequence iterates with small changes in log L only. For these scenarios, a few initial iterates of the PX-EM algorithm tended to 'bypass' this area of search and thus reduced the number of iterates required by roughly 40%.

Example 2
For Example 2, analyses were carried out fitting all 4 PCs for direct genetic (A), maternal genetic (M), permanent environmental (C) and residual (E) covariance matrices (Model 4444), fitting 3 PCs for A and M and 2 PCs for C (Model 3324), and fitting 2 PCs for A, M and C (Model 2224), yielding 40, 33 and 30 parameters to be estimated, respectively. Convergence characteristics are summarised in Table III. As for Example 1, the PX-EM and EM (not shown) algorithms required substantial numbers of iterates to locate the maximum of log L, while the AI algorithm converged in about 20 iterates. With multiple random effects and highly correlated traits, both RdR analyses shown omitted only PCs with small eigenvalues and thus converged more quickly than the full rank analysis.

Example 3
For Example 3, RdR analyses considered 7 and 9 PCs (Model 79), 5 and 7 PCs (Model 57), and 5 PCs (Model 55) for both genetic and permanent environmental covariances, respectively c.f. [26]. For this example, the number of iterates required for the (PX-)EM algorithm were excessive, especially for the analysis fitting only 5 PCs for both random effects. With relative 'good' starting values, full rank AI (Model 13 13) converged quickly despite representing a highly overparameterised model, requiring 30 iterates for ΔL to drop below 0.0005 with a corresponding deviation from log L max of −0.01; see Table III. For RdR analyses, the number of AI iterates required was again reduced at first (Model 79) but tended to increase when PCs with non-negligible eigenvalues were omitted. The latter was due to a series of AI steps with small, monotonically declining improvements in log L, yielding more a linear than a quadratic convergence pattern.

DISCUSSION
RdR estimation of covariance matrices decreases the number of parameters to be estimated. Moreover, omitting PCs with negligible eigenvalues alleviates 20 K. Meyer problems associated with attempting to estimate parameters close to the boundary of their permissible space, and tends to improve convergence rates compared to full rank analyses. One of the main obstacles in multivariate analyses involving more than a few traits is the computational effort involved. While the size of the MME to be manipulated in REML estimation is proportional to the number of PCs fitted for random effects, the number of operations required in each iterate increases more than quadratically with the number of PCs. Thus even a small reduction in the number of PCs considered can have a dramatic effect on the computational requirements e.g. [27]. For Example 1, for instance, total computing times required using the AI algorithm (with a convergence criterion of ΔL < 0.0005) were 2678, 1076, 723 and 624 seconds for analyses fitting 8, 7, 6 and 5 PCs, respectively (using a 64-bit dual core processor, rated at 2.6 Ghz). The combination of greater stability and faster convergence in estimation and reduction in computational requirements per iterate makes RdR analysis a powerful strategy for higher dimensional multivariate analyses.
Caution is required, however, when reducing the number of PCs fitted beyond those with negligible eigenvalues. As results show, this can increase the number of REML iterates required. Moreover, estimates of both the directions and eigenvalues of the subset of PCs fitted tend to be biassed in this case [30].
The examples chosen represent diverse and difficult analyses involving many parameters and, at full rank, somewhat overparameterised models, applied to relatively small data sets. All algorithms examined were capable of maximising log L. The AI algorithm generally required substantially fewer iterates than the PX-EM or EM algorithm, but stringent control of the AI steps and care in choosing an appropriate parameterisation were needed throughout. Earlier work [2,28], considering the PX-EM algorithm for full rank estimation found it to be most useful in conjunction with the AI algorithm, replacing the first few iterates to reduce problems due to poor starting values or initial overshooting. As shown, the PX-EM algorithm is readily adapted to RdR estimation, and again is most useful combined with the AI algorithm for scenarios where AI performs relatively poorly initially.

CONCLUSION
The PX-EM algorithm is a useful, additional 'weapon' in our armoury for REML estimation of variance components. Reduced rank estimation is highly appealing and can reduce the number of iterates required as well as the computational requirements per iterate, thus making multivariate analyses involving more than a few traits more feasible.