Estimation of non-linear growth models by linearization: a simulation study using a Gompertz function

A method based on Taylor series expansion for estimation of location parameters and variance components of non-linear mixed effects models was considered. An attractive property of the method is the opportunity for an easily implemented algorithm. Estimation of non-linear mixed effects models can be done by common methods for linear mixed effects models, and thus existing programs can be used after small modifications. The applicability of this algorithm in animal breeding was studied with simulation using a Gompertz function growth model in pigs. Two growth data sets were analyzed: a full set containing observations from the entire growing period, and a truncated time trajectory set containing animals slaughtered prematurely, which is common in pig breeding. The results from the 50 simulation replicates with full data set indicate that the linearization approach was capable of estimating the original parameters satisfactorily. However, estimation of the parameters related to adult weight becomes unstable in the case of a truncated data set.


INTRODUCTION
Non-linear functions are particularly suited to model growth data, because predictions outside the data range can be made more reliably than by linear models, and the entire growth process can be described by few parameters. For example, growth data models commonly apply the Gompertz function, where the estimated parameters can have biological meaning. Non-linear models are, however, more complicated to solve than linear models, and several algorithms have been proposed to estimate the parameters and variance components of non-linear mixed effects models [4].
In animal production research, the Bayesian framework has received much attention in growth curve analysis [2,11]. This popularity is due to the use of Markov chain Monte Carlo methods that allow the solution of numerically complicated posterior density integration and calculation of confidence interval estimates. The cost of this procedure is, however, intensive calculations and the need to assure a sampling equilibrium [3]. Another possibility is to approximate the likelihood function using linearization [4,10,17,18] or numerical integration [10]. Also, both of these alternatives are computationally difficult, yet, linearization may enable the inference of linear mixed effects models. All linearization methods in the literature are quite similar. Early methods use first-order Taylor series expansion of non-linear functions around expectation of the random effects, and are solved by either maximum likelihood (ML) or generalized least squares (GLS) estimation [4]. Lindstrom and Bates [9] suggested a more accurate method of making the expansion around current estimates of the random effects. Subsequent research has focused more on the second-order Taylor series expansion of integrals invoked by the Laplacian approximation [10,17,18].
Because of generality and familiar formulation, the most interesting choice of approximation is based on the second-order Taylor series expansion with respect to random effects that were presented by Wolfinger and Lin [18]. They gave two alternative approaches to select points of expansion: a zeroexpansion method using expected values, and an EBLUP-expansion method using the empirical best linear unbiased predictors of the random effects. Both approximations lead to algorithms that iteratively fit mixed linear models to the suitably transformed data using either ML or restricted maximum likelihood (REML). Therefore, they allow the use of commonly applied methods for linear mixed effects models, and the use of existing programs after small modifications. A similar algorithm was proposed by Breslow and Clayton [3] in the context of generalized linear mixed models. Because conditions to functionality of the approximation methods are difficult to identify, Wolfinger and Lin [18] recommended simulation studies for assessing the performance of the methods in diverse kinds of non-linear models and data sets.
The aim of this work was to describe and examine the performance of the EBLUP-expansion method for the Gompertz function applied to the analysis of growth in the pig through simulation. The EBLUP-expansion is recommended especially for cases where the variance components are large, which may be the case for an adult weight parameter for pigs. Also, Lindstrom and Bates [9] suggested that the expansion around the expected zero value may lead to poor estimates when substantial inter-individual variation exists. We chose to examine the method through the analysis of two data sets. The first analysis tested general performance of the EBLUP-expansion technique, and the second analysis tested performance of the method for incomplete data. Incomplete data are common in pig production, where the adult weight is unavailable due to an earlier slaughter age.

Simulations
The Gompertz function has been shown to fit pig growth data, such as live weight and protein retention, well [14][15][16]. We assumed that weights of an individual i followed the Gompertz model: where n i is the number of observations for individual i, y i j is the observed weight at age t i j (in days), α, β and κ are the parameters of the Gompertz function, and e i j is the random residual. The parameters have biological meaning: α is the adult weight, κ is the rate of exponential decay of the initial growth rate, and β is the logarithm of the ratio of birth weight to adult weight.
Each of the parameters α, β and κ can be described by a linear mixed effects model. In this study, we will consider a sire model, although notation could be for an animal model. The full model for observation j of animal i is -vector of fixed effects, (s α , s β , s κ ) T = s is a l × 1-vector of random additive genetic sire effects and (p α , p β , p κ ) T = p is a q × 1-vector of random animal effects other than sire. Vectors x, z s and z p are from the design matrices of fixed, random sire and random animal effects X, Z s and Z p , respectively. It is assumed that Here, G = G 0 ⊗ A, where A is a matrix of additive relationships between sires and G 0 is a 3 × 3 genetic covariance matrix for the Gompertz parameters. Similarly, P = P 0 ⊗ I q , where P 0 is a 3 × 3 covariance matrix, i.e. the random animal effects p were identically and independently distributed for the animals. Furthermore, the residuals were assumed to be independently distributed and homoscedastic, e ∼ N(0, I n σ 2 e ). The Gompertz function coefficients were generated to simulate pig growth. The model had one fixed effect with two levels, and the random effects were the genetic sire effect and the animal effect other than the sire. The first fixed effect level had values 210, 5 and 0.017 for the parameters α, β and κ, respectively. The other fixed effect level had values 220, 4.7 and 0.016 for α, β and κ, respectively. The random effects were assumed to be normally distributed with mean zero and block diagonal covariance matrices. Variance and covariance components in the matrices for the genetic sire effect and for the animal effect are shown in Tables I and II. The residual variance σ 2 e was one. These parameters approximated the variances calculated by the NLMIXED procedure of the SAS  program that fitted the Gompertz model to growth performance data of Finnish pigs [12].
Simulation of the random sire effect required taking into account the pedigree. The pedigree had three generations of animals with 10 unrelated founder grandsires. Each of the 10 grandsires was mated with 20 unrelated dams that produced one son each, i.e., 20 half-sibs. The half-sibs were mated with unrelated dams to produce 24 progeny per sire. Only the last generation of animals had records. Thus, the data included 4800 tested animals. Two data sets were made: a complete set, and a truncated time trajectory set. The complete data contained 30 equally-distanced observations per animal between 50 and 253 days. The truncated time trajectory data contained slaughter weights up to 115 kg, which is similar to the common slaughter weight in pigs, and occurs at about 120 days of age. Consequently, the number of observations was reduced from 30 to about 11 per animal, i.e., almost two thirds of the data were discarded.

Method to estimate the values of the growth parameters
The non-linear mixed model considered was where y is an n × 1-vector of observations, f is the Gompertz function, and e is an n × 1-vector of random residuals. Vectors b, s and p, with matrices X, Z s Table I. Relative bias, relative standard deviation (Rel. SD) and relative mean squared error (Rel. MSE) (as percent from the true value) for (co)variance components of genetic sire effects from the 50 replicates of full and truncated time trajectory data. Subscripts α, β and κ denote the three parameters in the Gompertz function. and Z p , were defined as before. For the random effects, denote u T = (s T p T ), Now the distribution assumptions were Although R is diagonal here, any form is allowed, so the general form R will be used hereinafter. Unknown elements of covariance matrices G 0 , P 0 and R are denoted by parameter vector θ.
The maximized likelihood function was Only in some cases is the closed form of (2) found, so the integral is often solved numerically. However, numerical methods for the non-linear functions are usually slow to converge and numerically unstable. Instead, the integral may be approximated by quadratic Taylor-series expansion of the exponent. The second-order expansion was made about the EBLUP before integration of the likelihood function (see Appendix). This gave approximation to the logarithm of the likelihood function (2): where Z * = ∂ f ∂u T | u=ũ andũ is the empirical BLUP-estimate of random effects. For the Gompertz function and two random effects in the model, Z * had elements where c is z s or z p depending on the random effect differentiated (see (1)).
Pinheiro and Bates [10] used the approximation (3) in estimation of parameters by the Laplacian approximation. However, no straightforward generalization to the REML-estimation was presented. Wolfinger and Lin [18] developed formula (3) further. Denote V = Z * DZ * T + R | u=ũ . Then, [5]. This led to a similar estimation function for variance component estimation presented by Lindstrom and Bates [9], although through different derivation.

Estimation of the fixed and random effects
Assume that the variance component vector θ is known. Maximum likelihood estimation for the parameters b and u leads to solving equations: where X * = ∂ f ∂b T b=b andb is the estimate of fixed effects b. Elements of X * are similar to Z * , except that coefficient c in (4) is replaced by x due to the differentiated fixed effect. However, in order to arrive to these simple equations, dependency of V on b through Z * has to be ignored. On the basis of arguments made by Bates and Watts [1], Wolfinger and Lin [18] justified this by appealing to intrinsic non-linearity instead of non-linearity of the parameters.
Denote Y = y− f (X,b, Z,ũ)+X * b +Z * ũ . Equations (5) can now be written as This is similar to the mixed model equations (MME) for the linear models. Thus, already established methods for solving linear models can be used to analyse the pseudo-data Y created from the original data y withb andũ equal to their most recent estimates.

Estimation of the variance components
After finding estimates of the location parameters, profile likelihood can be used to estimate the variance components by setting b =b(θ). The logarithmic likelihood function of the parameter vector θ can be written with the pseudodata as Differentiating equation (7) with respect to θ gives Maximum likelihood estimates of variance components are found by equating (8) to zero and solving for θ. Instead of the ML-estimates, REML-estimates are commonly used in practise. These estimates account for losses in degrees of freedom caused by the estimation of fixed effects b [5]. The logarithmic likelihood function is now Differentiation with respect to θ and equating to zero gives where Solutions in θ are REMLestimates of variance components.

EBLUP-algorithm
The approximate ML-solutions of location parameters and variance components can be obtained by iteratively solving the equations (6) and (8) until convergence. Correspondingly, the REML-solutions for the EBLUP-expansion are obtained by iteratively solving the equations (6) and (10). Hence, the algorithm fits the linear mixed effects model Y = X * b + Z * u + e for the pseudo-data Y and the working vectors X * and Z * , where u ∼ N(0, G(θ)) and e ∼ N(0, R(θ)).

Implementation
We chose to implement the REML-based EBLUP-algorithm, because programs to solve the linear mixed effects models are available and commonly used by animal breeders. MiX99 [13] was used to solve the mixed model equations (6), and DMU [6], modified for random regression by Kettunen et al. [7], was used to solve the REML estimates of covariance components (10). The capability of fitting fixed and random regression models is crucial for implementation, because the coefficients in X * and Z * can have any values. Implementation of the linearization procedure required the Gompertz function formulas to be included in MiX99. However, there was no need to make changes to the variance component estimation program.
Starting values for both the location parameter effects and the variance components had to be assigned before first iteration. A natural choice was to initialize random effects with the expected value zero. However, initial values for fixed effects were derived with the model function of growth curve and available data. When the Gompertz model is used, only the asymptotic weight parameter has a natural initial value, which is the maximum value of the dependent variable. In the other cases, complex equations were derived in order to have a stable algorithm. Initial values for covariance matrices of the genetic sire effect and animal effect were diagonal matrices having values diag{100,10,1}. The initial value for the residual variance was 100.
Additionally, variance components were reparametrized for computational reasons, because the variance component κ was close to zero. Convergence was improved by scaling the time before every round of the EBLUP-algorithm. Each time of measurement t i j was multiplied by a scaling factor c, which was set equal to the most recent estimate of κ. Consequently, the variance component estimate for the scaled parameter κ * was 1 c 2 Var(κ), and thus larger than the original parameter κ when c < 1.
Convergence of the EBLUP-algorithm was assumed when the relative round to round change was less than 10 −3 . Furthermore, within every iteration of the EBLUP-algorithm, the location parameters were iterated until the relative difference between right-hand and left-hand sides of the MME was less than 1 × 10 −7 . Covariance component estimates were calculated by the Expectation Maximization (EM) -algorithm, and convergence was assumed when the round to round change was less than 5 × 10 −7 .
The results are from 50 simulation replicates. Relative bias, relative standard deviation (Rel. SD) and relative mean squared error (Rel. MSE), as percentage from the true value, were calculated for the difference of two levels of fixed effect and for the variance component parameter estimates. The relative bias was calculated as (mean-true)/|true|, where mean is the arithmetic average of 50 estimates and true is the original value used to generate the data.

Complete data
The average number of iterations of the EBLUP-algorithm was 7 in the full data simulations. The residual error variance converged well and its estimate was equal to the original value used in the simulations (relative bias and relative standard deviation were 0% and 0.5%, respectively).
The estimated (co)variance components of the genetic sire effects were in fairly good agreement with the original parameter values used to simulate the data (Tab. I). Both the relative bias and the relative SD were higher for the covariance components (12.5% and 61.8% on average, respectively) than for the variance components (1.7% and 15.6% on average, respectively).
The estimated (co)variance components of animal effects were more accurately estimated than for the genetic sire effects (Tab. II). Estimated variance components had negligible relative bias and an average relative SD of 2.1%, but covariance components had an average relative bias of 1.4% and an average relative SD of 8.6%.
Simulation results for the difference of the two levels of fixed effects in the model are shown in Table III. The results of the parameters β and κ showed fairly good agreement with the initial values, with relative bias (Rel. SD) being 3.0% (5.2%). However, estimates for the parameter α were slightly biased. Relative bias was −10.0% and relative SD was 6.1%.

Truncated time trajectory data
The average number of iterations of the EBLUP-algorithm increased from 7 to 8 for the truncated time trajectory data. Residual error variance converged in this case as well and its estimate was equal to the original input value for the simulations (the relative bias and the relative standard deviation as percentages were −0.2% and 0.6%, respectively).
Compared to the full data, analysis for the truncated time trajectory data showed larger bias and SD for both the (co)variance components of genetic sire effect (Tab. I) and animal effect (Tab. II). The genetic sire effect had average relative bias (Rel. SD) of 9.6% (29.4%) for the estimated variance components and 33.4% (118.7%) for the estimated covariance components. The estimation Table III. of covariance between α and κ parameters was especially unstable. Animal effects had an average relative bias (Rel. SD) of 7.3% (5.5%) for the estimates of variance components. This increase was mostly due to increased uncertainty with parameter α. For the same reason, average relative bias (Rel. SD) increased to 18.2% (22.3%) for the estimates of covariance components. Simulation results for the difference of the two fixed effect levels corresponded to results with the full data (see Tab. III). The results of parameters β and κ again showed fairly good agreement with the initial values, but the results for the parameter α were 14% biased. For the truncated time trajectory data, both bias and SD were on average 60% larger than for the full data.

DISCUSSION
The results from the simulation showed fairly good agreement with the original values for the data, when observations from the whole growing period were available. The largest discrepancies were seen in the estimates of covariance components for the genetic effect. When the animals were slaughtered prematurely, the adult weight was not reached and the latter part of the growth function curve contained no data. This especially influenced the estimation of (co)variance components related to adult weight. For the genetic effect, the uncertainty of estimation was also seen in the (co)variance components related to the exponential decay of the initial growth rate.
A direct comparison of our study to the literature cannot be made. Furthermore, the non-linear curves in the animal breeding literature generally rely on Bayesian analysis for case-specific problems [2,11], and therefore comparisons are difficult to make. However, Wolfinger and Lin [18] considered a logistic model where variance component estimation results were similar to our case with full data. With respect to the fixed effects, the results were more biased for the Gompertz model than for the logistic model. For the truncated time trajectory data, no earlier results to compare were found in the scientific literature, where observations were available far over the inflection point [2,8]. This improves the quality of results compared to our truncated data, where the slaughter time exceeded the inflection point only slightly.
The initial parameter values for simulation of pig growth in our study are approximations from field data. Correct values may differ, but the close proximity of slaughter time and the inflection point is a common situation in field data. However, there may be pigs that have observations until adult weight. This is due to selection of tested pigs as breeding animals. To test the effect of partial truncation, the principles of full and truncated time trajectory data simulations were combined, i.e. approximately 5% of the tested progeny had observations until day 253, whereas observations from the rest of the progeny were truncated at 115 kg weight. Even a small proportion of fully observed animals improved the results when compared to the estimates produced from completely truncated data. For the genetic effect, improvements were especially shown as smaller relative bias and standard deviations for the covariance components and the variance component of parameter κ. For the variance and covariance components of genetic sire effect, the average relative biases (Rel. SD) were 3.4% (20.2%) and 6.6% (74.0%), respectively. For the animal effect, smaller biases were seen for the variance and covariance components of parameter α. The average relative bias (Rel. SD) was 4.6% (3.7%) for the variance components and 10.2% (13.5%) for the covariance components. We cannot make general recommendations about the proportion of animals with full data, because the results may be influenced by the population structure.
Starting values are important for the non-linear models in order for the algorithm to converge. We tried different strategies for defining the starting values, but general and simple equations were not discovered. Thus, the convergence of the algorithm with the presented parametrization depends on the proper starting values. However, the convergence may be improved by a different parametrization of the Gompertz model. Alternatively, a completely multiplicative model using log-transformed data can be analysed. This takes account of the common nature of the residuals in real growth data. In addition, logtransformation removes the dependence of the derivative of the adult weight parameter on the others.
The procedure presented is similar to that commonly used in animal breeding for linear models. The variance components estimated by REML are used in the mixed model equations to solve the location parameters. Consequently, even large models and data sets can be analysed when the variance components are assumed known. Easy implementation in already existing programs for linear mixed effects models is an advantage, although the two-step iterative procedure with each step itself being iterative can be regarded as computationally intensive. Another advantage of the method presented by Wolfinger and Lin is generality. It can be used for different types of models because it is developed for general non-linear mixed models. Also, generalization of this simple model to have multiple effects and traits is straightforward.
Linearization enables the linear mixed model procedures given the validity of the linear approximation. Simulation study is a convenient way to verify the appropriateness of the approximation method to specific situations. The full data set in our simulations shows that linearization works moderately well for the Gompertz function. Some enhancement may be achieved by reparametrization, which is a subject for additional research. Another motivation for linearization is allowance for sparse data. This is common in field data, where varying amounts of information are available for the animals. The truncated data analysis showed, however, that if observations are missing from the tails of all animal growth curves, uncertainty increases and the estimation method can be distorted. This distortion diminished considerably when at least some of the animals had observations until or close to their mature weight. Therefore, the success of the Gompertz model greatly depends on the amount and nature of available information.