A generalized estimating equations approach to quantitative trait locus detection of non-normal traits

To date, most statistical developments in QTL detection methodology have been directed at continuous traits with an underlying normal distribution. This paper presents a method for QTL analysis of non-normal traits using a generalized linear mixed model approach. Development of this method has been motivated by a backcross experiment involving two inbred lines of mice that was conducted in order to locate a QTL for litter size. A Poisson regression form is used to model litter size, with allowances made for under- as well as over-dispersion, as suggested by the experimental data. In addition to fixed parity effects, random animal effects have also been included in the model. However, the method is not fully parametric as the model is specified only in terms of means, variances and covariances, and not as a full probability model. Consequently, a generalized estimating equations (GEE) approach is used to fit the model. For statistical inferences, permutation tests and bootstrap procedures are used. This method is illustrated with simulated as well as experimental mouse data. Overall, the method is found to be quite reliable, and with modification, can be used for QTL detection for a range of other non-normally distributed traits.


INTRODUCTION
Various methods have been developed to detect a quantitative trait locus, ranging from the simpler regression based and method of moments, to maximum likelihood and Markov Chain Monte Carlo methods. These methods are mostly based on a continuous (normal) distribution of the trait. However, many traits of scientific and economic interest have a non-normal distribution. For example, binary data are frequently encountered with disease status, mortality, etc.
Count data occur in animal litter size and ovulation rate studies. Ordinal data (e.g. calving ease) and purely categorical traits are also encountered.
During the 1970s and 1980s, the generalized linear model (GLM 1 ) was developed as a uniform approach to handling all these above classes of data [27], and these procedures are now included in most major statistical packages. These methods would be applicable if data could be modeled as coming from one of the distributions of the exponential family (including Poisson for counts, binomial for binary and proportions data, as well as the normal distribution). Departures from the nominal variance-mean relationships can be handled by introducing additional dispersion parameters [27], and using a quasi-likelihood instead of the standard likelihood [43].
However, standard GLMs consider fixed effects only, and do not allow for any correlation structure in the data. Since the late 1980s, various methods have been developed to extend these GLMs to include the additional correlation structures [4,8]. One way to classify such extended GLMs is whether or not additional random effects are included in the model to take account of the correlation. When included, the type of model is usually termed a generalized linear mixed model (GLMM), or otherwise a marginal model. Another split in the type of approach is whether or not full parametric modeling is assumed. Specification of a full probability model for these extended GLMs usually involves numerical integration to evaluate the likelihood [4,28], or computer simulation if Markov Chain Monte Carlo methods are used [45]. An alternative approach has been developed that only makes assumptions about means, variances and covariance structures. This approach, known as generalized estimating equations (GEEs) was pioneered in the human epidemiology and biostatistics field [23,31], and a recent paper by Lange and Whittaker [21] has introduced this method to the field of QTL detection. The GEE approach and will be the basis in this paper for developing QTL models for non-normal data, although a somewhat different method of implementation will be used.
Models to detect QTLs differ fundamentally from the standard statistical linear models (LM), linear mixed models (LMM), as well as the models for non-normal data mentioned above (GLM and GLMM). The unobserved QTL genotypes result in a "missing data" problem, and general mixture methods are used to fit such models, frequently using the E-M algorithm [6,15,16,24].
Although the vast majority of QTL methodology papers are concerned with normally distributed traits, a minority do consider methods for non-normally distributed traits. Jansen's [15,16] general mixture methods provide a framework for modeling such traits as a finite mixture of GLMs. Visscher et al. [40] developed methods for analyzing binary traits from inbred lines, while Xu and Atchley [44] and Kadarmideen et al. [18] considered methods for outbred lines. Hackett and Weller [12] outlined a method for detecting a QTL for traits with an ordinal scale, by means of finite mixture modeling of an underlying liability measure. Other methods for ordinal QTL analysis have been proposed by Rao and Xu [33] and Spyrides-Cunha et al. [36].
The LMM -and in particular BLUP methodology -is central to both the theory and application of animal breeding [14], and these methods have been adapted to QTL detection [29,30,39]. Particularly through the use of Markov Chain Monte Carlo methods, complex pedigree structures are now routinely taken into account, at least for normally distributed traits [2,42].
The current paper provides a framework for QTL detection for non-normal traits with the addition of random polygenetic and/or environmental effects, and is an expansion of the method presented previously by Thomson [38]. This research has been motivated by finding a QTL for litter size in mice, a discrete (non-normal) variable. The method is general enough to be applied to other non-normal traits, especially within the context of inbred lines, and with certain modifications, to outbred lines. However, the method will be derived in terms of the mouse litter size model.

GENETIC EXPERIMENTAL DESIGN AND ASSUMPTIONS
Two inbred strains of mice were available, a highly prolific IQS5 (Inbred Quackenbush Swiss Line 5) strain (labeled S 1 here), and a regular C57BL/6J strain (labeled S 2 ). Their mean litter sizes were 15.5 and 7.0 pups respectively. Both strains can be assumed to be homozygous for all genes, at least for those relevant for the current analysis. These strains were crossed (F 1 generation), then backcrossed with both S 1 and S 2 males yielding BC 1 (= S 1 × F 1 ) and BC 2 (= S 2 × F 1 ). Each backcross female was then mated with a standard reference line of males on four occasions, and the litter size (and other phenotypic data) was recorded at each of the four parities. In addition, each backcross female was genotyped with 66 markers distributed over 18 chromosomes. Further details of the experimental procedures can be found in Silva [35] and Maqbool [25].
We will assume that there is a single QTL gene Q with alleles Q and q responsible for litter size. Similarly, we will denote the set of markers as M k ; k = 1, 2, . . . with alleles M k and m k . Thus we are assuming that parental S 1 genotypes are all QQ and M k M k while all S 2 genotypes are all qq and m k m k . All F 1 individuals are consequently heterozygous for all genes, Qq and M k m k . Genetic heterogeneity occurs in the backcrosses (BC 1 : QQ or Qq at Q; M k M k or M k m k at M k ; and for BC 2 : qQ or qq at Q; m k M k or m k m k at M k ). Relative frequencies of recombinant events (between QTL and markers) are then used to estimate the QTL location, based on flanking-marker methods (in the body of a chromosome) and single-marker methods (at the end of a chromosome).

Model for litter size
The basic model for litter size is a Poisson regression model. However, since there is empirical evidence that the variance:mean ratio is not unity, and that this ratio varies with parity, a dispersion parameter is included for each parity.
Rather than a full parametric model specification, only the first two moments are specified. The conditional means and variances are: where Y ij = litter size; µ = overall constant; α i = fixed parity effect (i = 1, . . . , 4); u j = random animal effect ( j = 1, . . . , n); q j = unobserved QTL genotype indicator variables; γ = (γ QQ , γ Qq , γ qQ , γ qq ) = QTL effects; and φ i = parity − specific dispersion parameter. Note that the terms of the model are additive on a logarithmic scale, i.e., and hence this type of model is also termed a log-linear model [27]. In particular, the effects become multiplicative when back-transformed to the original scale. For example, assuming that α 4 = 0 (parity 4 is reference group), then parity 1 has exp(α 1 )× the number of mouse pups on average, compared with parity 4. The QTL effects, γ, are provided to cater for the four possible QTL genotypes, with genotypes QQ and Qq originating from BC 1 and qq and qQ originating from BC 2 . Note that we do not assume γ Qq = γ qQ since these heterozygous genotypes also have different amounts of background genes coming from the appropriate parental strain (BC 1 has 75% of genetic material originating from S 1 compared with 25% originating from S 1 for BC 2 ). This issue will be discussed in detail later. The unobserved q j may be one of two forms, say q (1) j or q (2) j , with probability of 1/2 for either form, where superscript (1) and (2) indicate the homozygous and heterozygous forms of Q respectively. The observations y ij are assumed to be conditionally independent, given the random animal effect (u j ) and QTL genotype (q j ) and it is also assumed that random effects are normally distributed, u j ∼ N(0, σ 2 U ). It will also be useful subsequently to write the model in a matrix "regression"type form. We write the observed data set as a vector y = (y 1 , y 2 , . . . , y n ) where y j = (y 1j , y 2j , y 3j , y 4j ) . The conditional mean vector is: where u ∼ N(0, σ 2 U I n ); X = design matrix for fixed parity effects; Z = design matrix for random animal effects; and Q = random QTL incidence matrix = (q 1 , q 2 , . . . , q n ) .
In the current application with four records per animal, Z = I n ⊗ 1 4 where ⊗ is the Kronecker product.

An alternative parameterization for the QTL effects
Although it is computationally convenient to parameterize the QTL effects as γ = (γ QQ , γ Qq , γ qQ , γ qq ) (with γ qq = 0), a more useful and interpretable parameterization is to use an extension of the Falconer notation [9], by introducing additive (a) dominance (d) and a backcross effect (b). The backcross effect would act as a "bucket" to account for any additional genes affecting litter size not accounted for by the QTL gene Q. Specifically, the re-parameterization involves setting: where µ is a new overall constant. Note that γ = (γ QQ , γ Qq , γ qQ , γ qq ) is over-parameterized, and that we may set γ qq = 0, so both methods involve three estimable QTL parameters. Again, these effects operate on the log mean scale.

Marginal modeling approach
Since there are relatively few observations per animal for estimating the u j , a marginal modeling approach is used here whereby the dispersion components will be estimated, rather than the individual random effects. An approach similar to that in McCullagh and Nelder ( [27], p. 332) will be used.
Firstly, the dependence on the random effects is removed yielding: The covariance of litter size within an animal (i.e., across parities) is Next, the unknown QTL genotype dependence can be removed. Let µ (1) ij and µ (2) ij be the two possible mean litter sizes, E Y ij |q j , depending on the particular QTL genotype indexed by q j . In particular, µ (1) ij is the mean for the homozygous QTL and µ (2) ij is the mean for the heterozygous QTL. Let π j be the probability for a homozygous QTL genotype for animal j, given the marker genotype(s), m j . This will depend on the recombination fraction between the QTL and single marker (r) or flanking markers (r 1 , r 2 ) which in turn depends on the location of the QTL on the chromosome (d Q ). So the conditional moments, given the marker information, are These results may be expressed in matrix notation as E(Y|M) = µ(Ω) and Note that V has a block diagonal structure, with each block, V j say, corresponding to the four records for each animal y j .

QTL genotype probabilities
For backcross 1, two QTL genotypes are possible, QQ and Qq, whereas for backcross 2, qQ and qq are possible. The QTL genotype probabilities are defined as the probabilities of obtaining the homozygous genotype, given the marker genotype(s) m j of the animal, i.e., For a single marker model, let r be the recombination fraction between the QTL Q and a marker M. Then: For a flanking marker (interval mapping) model, let (0 ≤ d ≤ L) represent the map position on a chromosome of length L, and assume the QTL is located between adjacent markers, M 1 and M 2 , say. Let the positions of the markers and QTL be d 1 , It is assumed that d 1 and d 2 are known without error. Then assuming Haldane's [13] mapping function, we have: and where r 1 and r 2 are the recombination fractions between the two markers and the QTL respectively. In this case, the QTL genotype probabilities are

PARAMETER ESTIMATION
Since the model is not fully parametric, maximum likelihood cannot be used, and we consequently use a generalized estimating equations (GEE) approach [4,11,21,23,27] in which the quasi-likelihood takes the place of the log-likelihood [27,43]. There are two sets of parameters to be estimated, a set of "location" effects, θ = (µ, α , γ ) , and a set of "dispersion" effects, , and so the vector of all parameters is Ω = (θ , ψ ) . In particular, we solve two sets of GEEs simultaneously, one for each of the sets of effects, and this is known as the GEE2 approach [31,32]. Note that these GEEs are the analog of the likelihood estimating (score) equations for maximum likelihood estimation, and the normal equations for standard linear models. A set of linear GEEs is used to estimate θ and a set of quadratic GEEs used to estimate ψ. For this second GEE, we define the following quadratic variables for animal j, 1j , y 1j y 2j , y 1j y 3j , y 1j y 4j , y 2 2j , . . . , y 2 4j . The y j are the data that provide information on location effects, while the z j are the data that provide information on the dispersion (variance, covariance) effects. The following two sets of nonlinear equations are then solved, Expressions for ν j can be obtained by using standard results, namely, that . However, analytical expressions for W j are more difficult as they require further assumptions to made about 3rd and 4th order moments of Y ij . Prentice and Zhao [32] have outlined some possible choices and guidelines for choosing appropriate W j . However, these authors as well as Diggle et al. [4] have noted that the estimation procedure is fairly robust against choices of W j . In the current application, an alternative is to provide an empirical estimate of W assumed common for all animals, i.e., where n Ω is the number of elements of Ω to be estimated (12 here), andν j is the estimate of ν j based onΩ, the current estimate of Ω. Such an approach will in part avoid specific moment assumptions being made.
The sets of GEEs can be solved iteratively using a Newton-Raphson method with Fisher scoring, where the superscript (i) indicates the estimates at the ith iteration.

Parameter estimation in interval mapping
In practice, we want to look for the evidence for a QTL at different map positions (d) along the length of a chromosome. Consequently, we fit the QTL model at each d using the above estimating equations, but leaving out the parameter d Q .
• For d = 0 to L in steps of ∆ d (usually 1 cM): solve the GEEs for a fixed value of d to obtain estimatesθ(d),ψ(d); calculate the quasi-score function for the QTL at position d; However, U(d) = 0 has multiple solutions along the length of the chromosome, corresponding to local maxima of a profile log-likelihood (see Fig. 1). One solution therefore is to calculate the profile log-likelihood of d given the data z j , assuming that z j is multivariate normal N(ν j , W j ), i.e.
ignoring the normalizing constant, where the ν j (and hence W j ) are evaluated using the parameter estimates at the current map position, d. Note that since we have not specified a fully parametric model for litter size, we cannot calculate the likelihood exactly. We are using the normal-based profile log-likelihood as a "first-order"approximation here. However, some independent support for this as a measure is provided by constructing a quasi-likelihood function, as follows.
In standard parametric models, the score function U(θ) for some parameter θ is related to the log-likelihood function L(θ) by means of U(θ) = ∂ ln L(θ)/∂θ, and hence log L(θ) = θ θ min U(t)dt + C for θ min ≤ θ ≤ θ max [3,4]. The same results hold when dealing with profile log-likelihoods and profile score functions. In a similar way, we can construct the profile quasi-likelihood function, say, where C is a normalizing constant. The integral U * (d) can be approximated by a simple cumulative sum approach, Note that as a general rule with GEEs for correlated data, it is not possible to reconstruct the quasi-likelihood function Q(θ) based on the quasi-score function U(θ) = D V −1 (y − µ) ( [27], p. 333). However, it is possible in the current context as we have reduced the parameter space to one dimension (d Q ) by means of a profile quasi-score function, is readily integrated to produce Q(d).
Consideration of an appropriate choice of the normalizing constant C will be considered later. Regardless of the choice of C, the global maximum of Q(d) is the parameter estimate of d Q , corresponding to a solution of U(d) = 0. However, based on simulation studies, it was found that using either L(d) or Q(d) to estimate the QTL location gives extremely similar results. Furthermore, the shape of the two functions is also extremely similar, especially for large numbers of sets of records (n), as shown in Figure 1.

TESTING FOR THE EXISTENCE OF A QTL
Using either L(d) or Q(d), the location of a QTL can be estimated. However there remains the issue of whether or not the QTL actually exists at this map position. To address this, a null model is fitted whereby both QTL parameters a and d are set to zero, i.e., γ QQ = γ Qq and γ qQ = γ qq (= 0). That is, only the backcross effect, b is assumed. Recall that this is used as a "bucket" term for the effects of genes other than Q.
To fit a model only involving backcross effects, the GEE2 approach is again used. However, this model is simpler in that it is a non-mixture model. Writing the backcross effect as γ 0 (= γ QQ = γ Qq ), and s j as a 0-1 indicator variable for backcross 1, the marginal moments of Y ij are Having estimated Ω 0 = (µ, α , γ 0 , σ 2 U , φ ) , the normal based log-likelihood corresponding to the z j is calculated, say L 0 . Hence a likelihood-ratio type test statistic can then be calculated along the length of the chromosome, as This may then be converted into a LOD score, i.e., LOD(d) = L R (d)/ ln (10).
A test statistic may also be constructed based on the quasi-likelihood function. To do this, we set the constant of integration C in such a way that the average of the Q(d) equals the average of the L(d), over the range 0 ≤ d ≤ L, i.e., set Using this choice of C, the quasi-likelihood test statistic may be interpreted like a likelihood-ratio test statistic; we shall label this test statistic Q R (d).
As a very crude measure, we may apply χ 2 approximations to the distribution of L R (d) (and Q R (d)) to assess the significance of the QTL at position d Q . That is we may test However, L R (d Q ) does not behave like an ordinary likelihood-ratio test statistic, as noted in other QTL studies [20,34]. An alternative method is to apply a permutation test to assess the significance of the QTL [5]. In the current model, this is achieved by randomly permuting the maker data m j with the phenotypic data y j . However, permutations must be done within each backcross group so as to preserve the backcross effects. Each permuted data set should contain the same numbers of BC 1 and BC 2 records as in the observed data set. Repeated permutations and subsequent model fitting allow the distribution of L R (d Q ) under H 0 to be obtained, and the significance of the observed L R (d Q ) can then be assessed as the upper tail percentile of the null distribution.
Similarly, the bootstrap can be used as a method to obtain a reliable 95% confidence interval for d Q as well as other parameters [7,41]. For this (unselective bootstrap) approach, we randomly select (with replacement) complete (m j , y j ) records, again using the same number of BC 1 and BC 2 records as in the observed data set. Confidence intervals are obtained based on the appropriate percentiles of the bootstrap distribution, and this can also be used to calculate approximate standard errors for parameter estimates. Further improvements to the confidence intervals could be obtained using a selective bootstrap approach which more closely emulates the actual mapping process [22].
Applying the GEE2 procedure, the interval map as shown in Figure 1 was obtained. As mentioned previously, there is an extremely close agreement between the two test statistic profiles, Q R (d) and L R (d). In addition, the estimated QTL location was essentially the same at 0.27 M, quite close to 0.3 M. Other parameter estimates were similarly quite acceptable:μ = 1.77,α = (−0.328, −0.129, 0.366, 0) ,γ = (0.753, 0.522, 0.231, 0) ,σ 2 U = 0.0935, and φ = (0.399, 0.920, 1.606, 2.118) . Note that these estimates are those based on the maximum L R (d), however estimates of µ, γ, σ 2 U and φ are nearly identical when the maximum of Q R (d) is used. Since the parity effects α are independent of the QTL, their estimates are identical for either criterion; furthermore their estimates do not change along the whole length of the chromosome. The maximum value of L R (d) was 38.06, and using asymptotic χ 2 methods gives P < 0.001 for a test of no linked QTL. As a check, a permutation test was conducted using 1000 permutations. As none of the permutations had a test statistic this large, we can again conclude that P < 0.001. Although these P-values agree, the overall distribution of L R (d) under H 0 is not well approximated by a 1/2χ 2 2 distribution. This is demonstrated in Figure 2 which shows the histogram of the distribution of L R (d) against the 1/2χ 2 2 . As would be expected from this permutation-based distribution with no linked QTL, the means of the QTL estimates for BC 1 were nearly identical (0.515 and 0.517 for γ QQ and γ Qq respectively), and the mean QTL estimate for BC 2 was nearly zero (0.0006 for γ qQ , recall γ qq = 0 by design).
If the 1/2χ 2 1 approximation is used, a 95% confidence interval for d Q is obtained as 0.23 M to 0.32 M. In comparison, a bootstrap confidence interval, based on 1000 bootstrap simulations, gives an interval of 0.23 M to 0.40 M, somewhat wider than the asymptotic theory estimate. However, the histogram ofd Q reveals a bimodality with 87% of the distribution occurring between the markers at 1/6 and 2/6 M, and the balance between 2/6 and 3/6 M (Fig. 3).
In addition, the bootstrap procedure may be used to obtain standard errors (as well as confidence intervals) of any parameter estimates of the model. For a parameter estimateθ of θ, its bootstrap standard error is calculated as: whereθ i is the estimate obtained from the ith bootstrap data set (i = 1, . . . , B), andθ is the mean of the B bootstrap estimates. Further, differences betweenθ from the original data andθ may be used to assess possible bias in the parameter  Table I. For the current model and simulated data, it would appear no substantial bias in estimation does occur.
Estimates and standard errors for the alternative parameterization of the QTL effects (additive, dominance, and backcross terms) can be achieved as follows. Noting that: That is we obtainâ = 0.231 with se(â) = 0.0224,d = 0.000, with se(d) = 0.0265, andb = 0.146 with se(b) = 0.0166.

Mouse data
The method has been used to estimate QTL from the data provided by Silva [35] and Maqbool [25]. The most promising region for a QTL for litter  The results for the analysis are presented here. The estimated QTL location was at the marker (40 cM) (see Fig. 4) and based on a permutation test was significant (P = 0.01); however there was an extremely wide bootstrap 95% confidence interval from 0 to 108 cM. It was apparent that insufficient mice were available for reliably locating a QTL. To evaluate the power for this design to detect a QTL, the permutation (no linkage) and bootstrap (with linkage) distributions were further utilized. The critical value for testing linkage is the upper 5% point of the test statistic L from the permutation distribution: 4.09 here. If we use the parameter estimates as though they were the actual parameter values, the bootstrap distribution provides the distribution under the alternative (linkage) hypothesis. Since only 30% of bootstrap simulations returned L ≥ 4.09, the power or this design to detect a QTL is estimated at 30%.
The other estimates obtained from the data wereμ

MONTE CARLO SIMULATION STUDY
A Monte Carlo study has been conducted to assess the performance of this procedure, particularly to assess the effect of varying the number of animals available. Each Monte Carlo study consisted of 1000 simulations using the parameters as specified in the Simulated data section of Numerical illustrations. Equal numbers of BC 1 and BC 2 animals were considered, with the number in each backcross group being 50, 100, 200, and 500. As well as simulating the linked situation (QTL at 0.3 M), an unlinked situation was also simulated, allowing the distribution of the test statistic under the no linkage hypothesis to be obtained, providing critical values for the calculation of power. Summary results are shown in Table II.   In general, there is relatively little bias in parameter estimation, especially as the number of animals increases. Similarly, there are reductions in standard errors of parameter estimates as the number of animals increases. It is evident that QTL location is extremely difficult to estimate for small numbers of records: with 50 animals per backcross, the bias was +20% with a standard error of about 50% of the mean. This is also demonstrated in the power analysis: a power of less than 50% to detect the QTL when only 50 animals are used per backcross, compared with a power of approximately 80% when the number of animals are doubled. A further doubling results in almost certain detection of the QTL.

DISCUSSION AND CONCLUSIONS
It was mentioned previously that the method presented here could be modified for other non-normal data types. At a more general level, we can write a model in the form g[E(Y|u, Q)] = Xβ+Zu+ZQγ where g(·) is the appropriate link function for the class of data (ln for count, logit for binary, identity for normal). To fit the QTL model for different classes of data, relatively little needs to be modified. We need to: (1) Evaluate the moments (given the marker data), µ(Ω) = E(Y|M), and V(Ω) = var(Y|M). Note that approximations may need to be used here [27]. (2) Evaluate the derivative matrices, D and E.
Having calculated these, all the other theory developed here may be applied without modification.
As mentioned in the Introduction, Lange and Whittaker [21] have also described a QTL detection strategy using GEEs. The approach they develop stems from a generalization of a regression method, as opposed to from a likelihood-based mixture method. If the random animal effects were not included in the model, both the current model and the one proposed by Lange and Whittaker can be expressed as: where g(·) and g −1 (·) are the link and inverse link functions respectively. In the current approach, expressions for the mean response, conditional only upon marker information, were obtained, This contrasts the approach adopted by Lange and Whittaker, Their approach has the benefit that the expression E Q (Xβ + ZQγ|M) will be linear in the parameters (β, γ), and so the resultant structure for µ LW (θ) is a generalized linear model form, allowing implementation within standard GEE software. However, the expression for µ LW (θ) will only approximate the "true" mean expression, µ(θ), since in general, apart from when g(·) is the identity link used for standard linear models. It should be noted that E Q [g −1 (Xβ + ZQγ)|M] is nonlinear in the parameters, so does no longer fit within the usual generalized linear model framework, and consequently requires additional programming effort. Analogous differences can also be made between V(θ) and V LW (θ). Clearly, there is scope for further development of this class of model. As a method of QTL analysis, we need to allow for multiple QTL affecting the trait of interest by means of a composite interval mapping or allied approach [17,46]. This can be implemented in the current model easily by including additional (marker) terms in the "fixed effect" part of the model. Other scope exists for handling repeated measures (longitudinal) data by applying one of the techniques outlined in Diggle et al. [4]. In the litter size example considered here, no serial correlation in the data is assumed: the only correlation is assumed to originate from a common random animal effect (u j ) and common QTL effect (q j ). The illustrative data used here consist of sets of four repeat measurements per animal; with extended longitudinal data sets, this aspect would need to be addressed.
There are several alternative approaches that might be used for modeling litter size data. Firstly, a normal-based model might be used, perhaps after first making some transformation of the data to a more normal scale. However, this would fail to address the underlying discrete data distribution. While the litter size data had a relatively large mean -and consequently normal-based methods might have been a reasonable approximation -the method derived can be applied reliably for animals with smaller litter sizes, such as awassi sheep. Indeed the method can be used on any other count type trait.
Another approach is to model litter size on an ordinal scale, using the methods presented in Hackett and Weller [12]. While attractive in a number of ways, additional parameters need to be estimated for the ordinal scale, and it also fails to capture all the information, since litter size is a measurement scale variable. Ordinal scale analyses usually assume a continuous underlying liability scale with the cut points identifying the particular response category realized. The appeal of an underlying liability may also be assumed in the current approach outlined here. We may consider the (conditional) mean litter size E(Y ij |u j , q j ) as the liability from which the observed litter size is drawn. However, unlike the ordinal scale models, the actual realization is fully stochastic which is biologically more appealing than the extended all-or-none threshold approach of ordinal scale modeling.
Various parametric models have been used to analyze litter size data. Foulley et al. [10] and Matos et al. [26] have used Poisson based models. Templeman and Gianola [37] have added random effects and catered for over-dispersion by fitting negative binomial models to litter size data. To a certain extent, a similar approach was used in the model derived here. Namely, a basically Poisson regression approach was used; however under-as well as over-dispersion was allowed for in the model. In addition, the model was not fully parametric: only assumptions about means, variances, and covariances were made rather than a full probability model. Intuitively, this approach would be expected to be relatively robust against the true (but unknown) underlying probability model.
However, there are difficulties with applying these Poisson-based models to litter size and ovulation rate data. While they may fit the data well empirically, the assumptions that lead to a Poisson process [3] cannot be easily justified for this type of variable. What is required is a mechanistic model for litter size as opposed to a descriptive model. Considerable research has been undertaken on determining the biological determinants that contribute to ovulation rates and litter size [1,19]. Biological models such as these could form the basis for a mechanistic stochastic model of litter size.