The PX-EM algorithm for fast stable fitting of Henderson's mixed model

This paper presents procedures for implementing the PX-EM algorithm of Liu, Rubin and Wu to compute REML estimates of variance covariance components in Henderson's linear mixed models. The class of models considered encompasses several correlated random factors having the same vector length e.g., as in random regression models for longitudinal data analysis and in sire-maternal grandsire models for genetic evaluation. Numerical examples are presented to illustrate the procedures. Much better results in terms of convergence characteristics (number of iterations and time required for convergence) are obtained for PX-EM relative to the basic EM algorithm in the random regression.


INTRODUCTION
Since the landmark paper of Dempster et al. [4], the EM algorithm has been among the most popular statistical techniques for calculating parameter estimates via maximum likelihood, especially in models accounting for missing data, or in models that can be formulated as such. As explained by Meng and van Dyk [23], the popularity of EM stems mainly from its computational simplicity, its numerical stability, and its broad range of applications.
Biometricians, especially those working in animal breeding, have been among the largest users of EM. Modern genetic evaluation typically relies on best linear unbiased prediction (BLUP) of breeding values [13,14] and on restricted (or residual) maximum likelihood (REML) of variance components of Gaussian linear mixed models [12,27]. BLUP estimates are obtained by solving Henderson's mixed model equations, the elements of which are natural components of the E-step of the EM algorithm for REML estimation which explains the popularity of the triple of BLUP-EM-REML.
Unfortunately, the EM algorithm can be very slow to converge in this setting and various alternative procedures have been proposed: see e.g., Misztal's [26] review of the properties of various algorithms for variance component estimation and Johnson and Thompson [15] and Meyer [25] for a discussion of a second order algorithm based on the average of the observed and expected information.
Despite its slow convergence, EM has remained popular, primarily because of its simplicity and stability relative to alternatives [32]. Thus, much work has focused on speeding up EM while maintaining these advantages. Rescaling the random effects which are treated as missing data by EM has been a very successful strategy employed by several authors; e.g., Anderson and Aitkin [2] for binary response analysis, Foulley and Quaas [7] for heteroskedastic mixed models, and Meng and van Dyk [24] for mixed effects models using a Cholesky decomposition (see also procedures developed by Lindstrom and Bates [17] for repeated measure analysis and Wolfinger and Tobias [35] for a mixed model approach to the analysis of robust-designed experiments). To further improve computational efficiency, the principle underlying the random effects was generalized by Liu et al. [21] who introduced the parameter expanded EM or PX-EM algorithm, which in the case of mixed effects models fits the rescaling factor in the iteration.
The purpose of this paper is twofold: (i) to give an overview of this new algorithm to the biometric community, and (ii) to illustrate the procedure with several small numerical examples demonstrating the computational gain of PX-EM.
The paper is organized as follows into six sections. In Section 2, the general structure of the models is described and in Section 3 a typical EM implementation for these models (called EM 0 for clarity) is reviewed. The fourth Section briefly introduces the general PX-EM algorithm and gives appropriate formulae for mixed linear models. Two examples (sire-maternal grandsire models and random coefficient models) appear in Section 5 and Section 6 contains a brief discussion.

MODEL STRUCTURE
We consider the class of linear mixed models including K dependent (u k ; k = 1, 2, . . . , K) random factors, u k for k = 1, 2, . . . , K using Henderson's notation, or, in a more compact form where y is the (N × 1) data vector, β is a (p × 1) vector of fixed effects with matrix X of explanatory discrete or continuous variables, u is a (q + × 1) vector of random effects (q + = K k=1 q k ) formed by concatenating the K(q k × 1) , and e is a (N × 1) vector of residuals. The usual Gaussian assumption is made for the distribution of (y , u , e ) i.e., y ∼ N (Xβ, ZGZ + R) where In (2a), A kl is a (q k × q l ) matrix of known coefficients and g kl is a real parameter known as the (k, l) covariance component such that G 0 = {g kl }, the u-covariance matrix, is positive definite; a similar definition applies to H and σ 2 e for the residual variance component. We assume that all u-components have the same dimension i.e., the same number of experimental units, q k = q for all k, and similarly A kl = A for all k, l, so that G can be written as: where ⊗ symbolizes the direct or Kronecker product as defined e.g., in Searle [31]. Two important models in genetics and biometrics belong to this class of models.
First, the "sire-maternal grandsire" model (or SMGS model) as described by Bertrand and Benyshek [3], where u s and u t refer to (q × 1) vectors of sire and maternal grandsire contributions of q males respectively and A is the matrix of additive genetic relationships (or twice Malecot's kinship coefficients) between those males.
and g 0 = vech (G 0 ) = (σ 2 s , σ st , σ 2 t ) is a function of the variance covariance components of additive direct and maternal effects of genes.
The second model, a random coefficient model [22], can be applied for instance to longitudinal data analysis (Diggle et al. [5]; Laird and Ware [16]; Schaeffer and Dekkers [30]), and is usually written as where y i = (y i1 , y i2 , . . . , y ij , . . . , y ini ) is the (n i × 1) vector of measurements made on the ith individual (i = 1, 2, . . . , q), X i β is the contribution of fixed effects, and is the kth random regression coefficient (e.g., intercept, linear slope) on covariate information Z ik (e.g., time or age) pertaining to the ith individual.
Under the general form (1), individuals are nested within random effects whereas in random coefficient models, the opposite holds, coefficients (factors) are nested within individuals. That is, Generally, these models assume independence among residuals var(e i ) = I ni σ 2 e and independence among individuals, A = I q , but this is neither mandatory nor always appropriate as e.g., with data recorded on relatives.
Readers may be more familiar with one of the two forms (1) or (5), but both are obviously equivalent and we may use whichever is more convenient.

Typical procedure
To define an EM algorithm to compute REML estimates of the model parameter γ = (g 0 , σ 2 e ) , we hypothesize a complete data set, x, which augments the observed data, y, i.e. x = (y , β , u ) . As in [4] and [7], we treat β as a vector of random effects with variance tending to infinity.
Each iteration of the EM algorithm consists of two steps, the expectation or E step and the maximization or M step. In the Gaussian mixed model, this separates the computation into two simple pieces. The E-step consists of taking the expectation of the complete data log likelihood L(γ; x) = ln p(x|γ) with respect to the conditional distribution of the "missing data": z = (β , u ) vector given the observed data y with γ set at its current value γ [t] i.e., while the M-step updates γ by maximizing (6) with respect to γ i.e., We begin by deriving an explicit expression for (6) and then derive the two steps of the EM algorithm. By definition p(x|γ) = p(y|β, u, γ)p(β, u|γ), where and Formula (8) allows the formal dissociation of the computations pertaining to the residual σ 2 e from the u-components of variance g 0 . Combining (6) with (8) leads to The expressions on the right-hand side can be written explicity (10) and, Under assumption (3), where For the M-step, we maximize (10) as a function of σ 2 e , and (11) as a function of g 0 . For H known, this results in and see Lemma 3.2.2 of Anderson [1], page 62. The expectations in (12) and (13), i.e. the E-step, can be computed using elements of Henderson's [14] mixed model equations (ignoring subscripts), i.e., whereβ is the GLS estimate of β, andû is the BLUP of u. In particular, we compute and where p = rank(X), q + = Kq = dim(u) and C uu is the block of the inverse of the coefficient matrix of (14) corresponding to u. Further numerical simplifications can be carried out to avoid inverting the coefficient matrix at each iteration using diagonalization or tridiagonalization procedures (see e.g., Quaas [29]).

An ECME version
In order to improve computational performance, we can sometimes update some parameters without defining a complete data set. In particular, the ECME algorithm [20] suggests separating the parameter into several sub parameters (i.e. , model reduction), and updating each sub parameter in turn conditional on the others. For each of these sub parameters, we can maximize either the observed data log likelihood directly, i.e., L(γ; y) or the expected augmented data log likelihood, Q(γ|γ [t] ).
To implement an ECME algorithm in the mixed effects model, we rewrite the parameter as ζ = (d 0 , σ 2 e ) where var(y) = Wσ 2 e with W = ZDZ + H, D = D 0 ⊗ A, and d 0 = vech(D 0 ) and first update σ 2 e by directly maximizing L(ζ; y) (without recourse to missing data) under the constraint that d 0 is fixed at d An ML analogue of this formula (dividing by N instead of N − p) was first obtained by Hartley and Rao [12] in their general ML estimation approach to the parameters of mixed linear models; see also Diggle et al [5], page 67. Second, we update d 0 by maximizing Q(d 0 , σ 2[t+1] |ζ [t] ) using the missing data approach, where G [t+1] 0 is defined in (13). Henderson [13] showed that the expression for σ 2 e in (17) can be obtained from his mixed model equations solutions as whereβ [t] andû [t] are defined by (14) 0 . Incidentally, this shows that the algorithm developed by Henderson [14] to compute REML estimates, as early as 1973, introduces model reduction in a manner similar to recent EM-type algorithms.

Generalities
In the PX-EM algorithm proposed by Liu et al. [21], the parameter space of the complete data model is expanded to a larger set of parameters, Γ = (γ * , α), with α a working parameter, such that (γ * , α) satisfies the following two conditions: -it can be reduced to the original parameter γ, maintaining the observed data model via a many-to-one reduction form γ = R(Γ); -when α is set to its reference (or "null" ) value, (γ * , α 0 ) induces the same complete data model as with We introduce the working parameter because the original EM (EM 0 ) imputes missing data under a wrong model, i.e., the EM iterate γ [t] EM is different from the MLE. The PX algorithm takes advantage of the difference between the imputed value α [t+1] of α and its reference value α 0 to make what Liu et al. [21] called a covariance adjustment in γ, i.e., EM is the EM iterate, and b γ|α is a correction factor. Liu et al. [21] show that this adjustment necessarily improves the rate of convergence of EM generally in terms of the number of iterations required for convergence.
Operationally, the PX-EM algorithm, like EM, consists of two steps. In particular, the PX-E step computes the conditional expectation of the log likelihood of x given the observed data y with The PX-M step then maximizes (21) with respect to the expanded parameters and γ is updated via In the next section, we illustrate PX-EM in the Gaussian linear model. In particular, we will describe a simple method of introducing a working parameter into the complete data model.

Implementation of PX-EM in the mixed model
We begin by defining the working parameter as a (K × K) invertible real matrix α = {α kl } which we incorporate into the model by rescaling the random or alternatively, under (5) By the definition ofũ i , we Rescaling the random effects by α introduces the working parameters into two parts of the model, into (23) and into the distribution ofũ i which can be viewed as an extended parametric form of the distribution of u i (see (2a)) i.e., when α = α 0 = I K , u i andũ i have the same distribution.
To understand why the PX-EM works in this case, recall that the REML estimate is the value of which maximizes L(γ; y) = p(y|β, u, γ)p(u|γ)dβdu. (24a) What is important here is that for any value of α, L(γ; y) = L(γ * , α; y) with Thus, we can fix α in (24b) to be any value at each iteration. Liu et al. [21] showed that fitting α in the iteration can improve the computational performance of the algorithm. Computationally, the PX-EM algorithm replaces the integrand of (24a) with that of (24b). In particular, in the E-step, we compute Q(Γ|Γ [t] Here we choose α = α 0 , since any value of α will work and using α 0 reduces computations to those of the EM 0 algorithm. In the M-step, we update Γ by maximizing Q(Γ|Γ [t] ), i.e. we compute g (in the remainder of the paper we fix σ 2 e * at σ 2 e ). We now move to the details of these computations. In order to derive Q(Γ|Γ [t] , we note that p(y, β,ũ|g 0 * , α, σ 2 e ) ∝ p(y|β,ũ, α, σ 2 e )p(ũ|g 0 * ), and L(Γ|x) = L(α, σ 2 e |e) + L(g 0 * |ũ) + const. thus where Γ = (g 0 * , (vec α) , σ 2 e ) , and e ) . Maximizing Q u (g 0 * |Γ [t] ) with respect to g 0 * is identical to the corresponding calculations for g 0 in EM 0 i.e., we set G is evaluated as in EM 0 with (16).
Next, we wish to maximize Q e (α, σ 2 e * |Γ [t] ), which can be written formally as in (10) but with e defined in (23a) or (23b). Partial derivatives of this function with respect to α are given by: Solving these K 2 equations does not involve σ 2 e , and is equivalent to solving the linear system F(vec α ) = h, kl ; for k, l = 1, 2, . . . , K where and Explicit expressions for the coefficients when K = 2 are given in Appendix A.
We can compute α [t+1] using Henderson's [14] mixed model equations (14), suppressing the superscript [t], as follows where Z k H −1 Z m is the block of the coefficient matrix corresponding to u k and u m ; Z k H −1 X is the block corresponding to u k and β; C u k um and C u k β = C βu k are the corresponding blocks in the inverse coefficient matrix; Z k H −1 y is the sub vector in the right hand side of (14) corresponding to u k ; andβ andû k are the solutions for β and u k in (14). Once we obtain α [t+1] , we update G 0 as indicated previously. Finally to update σ 2 e , we maximize Q e (α, σ 2 e |Γ [t] ) via where the residual vector, e, is adjusted for the solution α [t+1] in (27), i.e., using A short-cut procedure implements a conditional maximization with α fixed at α [t] = I K and results in formula (15) as in the EM 0 procedure. One can also derive a parameter expanded ECME algorithm by applying Henderson's formula (19) with d 0 fixed at d [t] 0 ; see van Dyk [32].

Description
In this section, we illustrate the procedures and their computational advantage relative to more standard methods using two sire-maternal grandsire (model 4) and two random coefficient (model 5) examples.

Sire-maternal grandsire models
The two examples in this section are based on calving score of cattle [6]. From a biological viewpoint, parturition difficulty is a typical example of a trait involving direct effects of genes transmitted by parents on offspring, and maternal effects influencing the environmental conditions of the foetus during gestation and at parturition. Thus, statistically we must consider the sire and the maternal grandsire contributions of a male, not as simple multiples of each other (i.e., the first twice that of the second), but as two different but correlated variables. This model can be written as where µ is an overall mean; α i , β j are fixed effects of the factors A = sex (i = 1, 2 for bull and heifer calves respectively) and B = parity of dam (j = 1, 2 and 3 for heifer, second and third calves respectively); s k is the random contribution of male k as a sire and t l that of male l as maternal grandsire, and e ijklm are residual errors, assumed iid-N (0, σ 2 e ). Letting s = {s k } and t = {t l }, it is assumed that var(s) = Aσ 2 s , var(t) = Aσ 2 t and cov(s, t ) = Aσ st , where A is the matrix of genetic relationships among the several males occurring as sires and maternal grandsires and g 0 = (σ 2 s , σ st , σ 2 t ) is the vector of variance-covariance components. We analyse two data sets; the first is the original data set presented in [6], which we refer to as "Calving Data 1 or CD1", and the second is a data set with the same design structure but with smaller subclass size and simulated data which we refer to as "Calving Data 2 or CD2" (see Append. B1)

Random coefficient models
Growth data: We first analyse a data set due to Pothoff and Roy [28] which contains facial growth measurements recorded at four ages (8, 10, 12 and 14 years) in 11 girls and 16 boys. There are nine missing values, which are defined in Little and Rubin [19]. The data appear in Verbeke and Molenberghs [33] (see Table 4.11, page 173 and Appendix B2) with a comprehensive statistical analysis.
We consider model 6 of Verbeke and Molenberghs which is a typical random coefficient model for longitudinal data analysis with an intercept and a linear slope, and can be written as where the systematic component of the outcome y ijk involves an intercept µ+α i varying according to sex i (i = 1, 2 for female and male children respectively) and a linear increase with time (t j = 8, 10, 12 and 14 years); the rate β i also varies with sex. The a ik and b ik are the random homologues of α i and β i but are defined at the individual level (indexed k within sex i). Letting u ik = (a ik , b ik ) , it is assumed that the u ik 's are iid-N (0, G 0 ). Similarly, letting e ik = (e i1k , e i2k , e i3k , e i4k ) the e ik are assumed iid-N (0, σ 2 e I 4 ) and distributed independently from the u ik 's.
Ultrafiltration data: In our second random coefficient model, we consider the ultrafiltration response of 20 membrane dialysers measured at 7 different transmembrane pressures with an evaluation made at 2 different blood flow rates. These data due to Vonesh and Carter [34] are described and analysed in detail in the SAS manual for mixed models (Littell et al. [18]; data set 8.2 DIAL Appendix 4, pages 575-577 ; Appendix B3).
Using notations similar to the model for the growth data, we write where y ijk is the ultrafiltration rate in ml · h −1 , µ + α i the intercept for blood rate i (i = 1, 2 for 200, 300 dl·min −1 ), β r x r ijk is the regression of the response on the transmembrane pressure x ijk (dm Hg) as a homogeneous quartic polynomial; a ik and b r,ik represent the random coefficients up to the second degree of the regression defined at the dialyser level (k = 1, 2, . . . , 20). Again, letting u ik = (a ik , b 1,ik , b 2,ik ) and e ik = {e ijk }, it is assumed that the u ik 's are iid-N (0, G 0 ) and the e ik 's are iid-N (0, σ 2 e I 7 ) and distributed independently of each other.

Calculations and results
REML estimates of variance-covariance parameters were computed for each of these four data sets using the EM 0 and PX-EM procedures including the complete (PX-C) and the triangular (PX-T) working parameter matrices, (see van Dyk [32] for discussion of the use of a lower triangular matrix as the working parameter). For each of the three EM-type algorithms, the standard procedure described in this paper was applied as well as those based on Henderson's formula for the residual variance. The iteration stopped when the Results are summarized in Tables I and II for sire-maternal grandsire (SMGS) and random coefficient (RC) models, respectively. In all cases, the number of iterations required for convergence was smaller for the PX-C procedure than for the EM 0 procedure. The decrease in this number of iterations was 15 to 20% for SMGS models and as high as 70% for RC models.  [6]; Calving difficulty score shown in appendix B1. Calving data II: same design structure, 4 scores, small progeny group size and simulated data. The absolute numbers of iterations were 64 vs. 224 for PX-C vs. EM 0 for growth data and 76 vs. 259 for the ultrafiltration data. Since computing time per iteration is larger for PX-C than for EM 0 , the total computing time remains smaller with EM 0 for the CD1 and CD2 examples. However, in the RC models, computing time is about halved in both examples. This impressive advantage of PX-C vs. EM 0 was also observed at different norm values (10 −6 , 10 −9 ) and with a different stopping rule (decrease in -2L equal to or smaller than 10 −8 ); see also the plot of EM sequences for the intercept variance in the "growth data" example ( Fig. 1) and for all the u-components of variance in the ultrafiltration data example (Fig. 2). 1 Examples: Growth data from Pothoff and Roy [28] with the 9 missing values defined by Little and Rubin [19]: see also Verbeke and Molenberghs [33] for a detailed analysis.
Here 1st degree polynomial for the random part (intercept + slope). Ultrafiltration rates of 20 membrane dialyzers measured at 7 pressures and using two blood flood rates [18,34]. Here, second degree polynomial for the random part. Results obtained with PX-T were consistently poorer than with PX-C and in the case of ultrafiltration data even poorer than EM 0 . Computation for this model is especially demanding because of an adjustment of a second degree polynomial with a very high absolute correlation between the first and second degree coefficients of 0.94.
Finally, no practical differences were observed between the standard and the ECME procedures; Henderson's formula can be used in practice to compute the residual variance.

DISCUSSION-CONCLUSION
This paper shows that the PX-EM procedure can be easily implemented within the framework of Henderson's mixed model equations. Changes to the EM 0 procedure are simple requiring only the solution of a (K 2 × K 2 ) linear system at each iteration with K generally 2, 3 or 4. In particular, the PX-EM procedure can handle the situation of random effects correlated among experimental units (e.g., individuals) as it often occurs in genetics. A ML extension of the PX-EM procedure is straightforward with an appropriate change in the conditional expectations of u k u l and βu k along the lines proposed by Foulley et al. [8] and van Dyk [32].
Our examples confirm the potential advantage of the PX-EM procedure for mixed linear models already advocated by Liu et al. [21] and van Dyk [32]. The improvement was especially decisive in the case of random coefficient models. This is important in practice since these models are becoming more and more popular among biometricians e.g., epidemiologists and geneticists involved in the analysis of longitudinal data and space-time phenomena.
In this manuscript, emphasis was on EM procedures and ways to improve them. This does not preclude using alternative algorithms for computing maximum likelihood estimations of variance component e.g. the Average Information (AI)-REML (Gilmour, Thompson and Cullis [10]).
Nonetheless, EM procedures have two important features: (i) they allow separation of the computations required for the R matrix (errors but also time or space processes) and the G matrix for which the PX version turns out to perform especially well; here we suppose the elements of H in R fixed but the EM procedure can be easily extended to parameters in an unknown H [9]; (ii) (PX)EM generally selects the correct sub model when estimates are on or near the boundary of the parameter space, a property that second order algorithms do not always exhibit (e.g., the analysis of the growth data in Foulley et al [9] and the simulations in Meng and van Dyk [24] and van Dyk, [32]).
Actually, this is an example of the well-known stable convergence of EM type algorithms (i.e., monotone convergence) which second order algorithms do not generally exhibit [32].