Statistical analysis of ordered categorical data via a structural heteroskedastic threshold model

Summary - In the standard threshold model, differences among statistical subpopulations in the distribution of ordered polychotomous responses are modeled via differences in location parameters of an underlying normal scale. A new model is proposed whereby subpopulations can also differ in dispersion (scaling) parameters. Heterogeneity in such parameters is described using a structural linear model and a loglink function involving continuous or discrete covariates. Inference (estimation, testing procedures, goodness of fit) about parameters in fixed-effects models is based on likelihood procedures. Bayesian techniques are also described to deal with mixed-effects model structures. An application to calving ease scores in the US Simmental breed is presented; the heteroskedastic threshold model had a better goodness of fit than the standard one. threshold character / heteroskedasticity / maximum likelihood/ mixed linear model / calving difficulty

The threshold model postulates an underlying (liability) normal distribution rendered discrete via threshold values. The probability of response in a given category can be expressed as the difference between normal cumulative distribution functions having as arguments the upper and lower thresholds minus the mean liability for subpopulation divided by the corresponding standard deviation.
Usually the location parameter ( ?7i ) for a subpopulation is expressed as a linear function 77i = t'O of some explanatory variables (row incidence vector ti) (see theory of generalized linear models, McCullagh and Nelder, 1989;Fahrmeir and Tutz, 1994). The vector of unknowns (e) may include both fixed and random effects and statistical procedures have been developed to make inferences about such a mixed-model structure Harville and Mee, 1984;Gilmour et al, 1987). In all these studies, the standard deviations (also called the scaling parameters) are assumed to be known and equal, or proportional to known quantities (Foulley, 1987;Misztal et al, 1988).
The purpose of this paper is to extend the standard threshold model (S-TM) to a model allowing for heteroskedasticity (H-TM) with modeling of the unknown scaling parameters. For simplicity, the theory will be presented using a fixed-effects model and likelihood procedures for inference. Mixed-model extensions based on Bayesian techniques will also be outlined. The theory will be illustrated with an example on calving difficulty scores in Simmental cattle from the USA.

Statistical model
The overall population is assumed to be stratified into several subpopulations (eg, subclasses of sex, parity, age, genotypes, etc) indexed by i = 1, 2, ... , I representing potential sources of variation. Let J be the number of ordered response categories indexed by j, and y i+ = { Yij+ } be the (J x 1) vector whose element y2!+ is the total number of responses in category j for subpopulation i. The vector y 2+ can be written as a sum and (3) I -1 contrasts among log-scaling parameters (eg, ln(Q i ) -ln(<7i) for i = 2, ... , I) or, equivalently, I -1 standard deviation ratios (eg, O &dquo;d O &dquo;d, with one of these arbitrarily set to a fixed value (eg, < 7i = 1). This makes a total of 2I + J -3 identifiable parameters, so that the full H-TM reduces to the saturated model for J = 3 categories, see examples in Falconer (1960), chapter 18.
More parsimonious models can also be envisioned. For instance, in a twoway crossclassified layout with A rows and B columns (I = AB), there are 16 additive models that can be used to describe the location ( ?7i ) and the scaling (o j ) parameters. The simplest one would have a common mean and standard deviation for the I = AB populations. The most complete one would include the main effects of A and B factors for both the location and dispersion parameters. Here there are 2(A+B)+J-5 estimable parameters, ie, J-1 thresholds plus twice (A-1)+(B-1). Under an additive model for the location parameters 71i , it is possible to fit the H-TM to binary data. For the crossclassified layout with A rows and B columns, there are AB -2(A + B) + 3 degrees of freedom available which means that we need A (or B) ! 4 to fit an additive model using all the levels of A and B at both the location and dispersion levels. Finally, it must be noted that in this particular case, dispersion parameters act as substitutes of interaction effects for location parameters.
Here, the log likelihood L(a; y) can be expressed, apart from an additive constant, as: T T with, given !4!, [5] and !6!, The maximum likelihood (ML) estimator of a can be computed using a secondorder algorithm. A convenient choice for multinomial data is the scoring algorithm, because Fisher's information measure is simple here. The system of equations to solve iteratively can be written as: -where U(a; y) = <9L((x;y)/<9<x and J(a) _ -E [å2L(a;y)/åaåa'] are the score function and Fisher's information matrix respectively; k is iterate number. Analytical expressions for the elements of U(a; y) and J(a) are given in Appendix 1. These are generalizations of formulae given by Gianola and Foulley (1983) and Misztal et al (1988). In some instances, one may consider a backtracking procedure (Denis and Schnabel, 1983)  (over the ni observations made in subpopulation i) of indicator vectors y ir = (Yilr i Yi2r i... i Yijri ... i YiJr)l such that !r=l 1 or 0 depending on whether a response for observation (r) in population (i) is in category (j) or not. Given n i independent repetitions of Yin the sum y i+ is multinomially distributed j with parameters n i = ! y ij+ and probability vector Ii i = {lIi j }. j = l In the threshold model, the probabilities 1I jj are connected to the underlying normal variables X ir with threshold values Tj via the statement with To = -oo and Tj = + 00 , so that there are J -1 finite thresholds. With Xi r rv N(!2,Q2), this becomes: where !(.) is the CDF of the standardized normal distribution.
The mean liability ( ? 7i) for the ith subpopulation is modeled as in Gianola and Foulley (1983) and Harville and Mee (1984), and as in generalized linear models (McCullagh and Nelder, 1989) in terms of the linear predictor Here, the vector (p x 1) of unknowns (0) involves fixed effects only and x i is the corresponding (p x 1) vector of qualitative or quantitative covariates.
In the H-TM, a structure is imposed on the scaling parameters. As in Foulley et al (1990,1992) and Foulley and Quaas (1995), the natural logarithm of Q i is written as a linear combination of some unknown (r x 1) real-valued vector of parameters (6), 1 p' being the corresponding row incidence vector of qualitative or continuous covariates.

Identifiability of parameters
In the case of I subpopulations and J categories, there is a maximum of I(J -1) identifiable (or estimable) parameters if the margins n i are fixed by sampling. These are the parameters of the so-called saturated model.
What is the most complete H-TM (or 'full' model) that can be fitted to the data using the approach described here? One can estimate: (1) J -1 finite threshold values or, equivalently, J -2 differences among these (eg, Tj -T1 for j = 2, ... , J-1) plus a baseline population effect (eg, q i -Ti ); (2) 1 -1 contrasts among q < values;

Goodness of fit
The two usual statistics, Pearson's X Z and the (scaled) deviance D * can be used to check the overall adequacy of a model. These are where fig = 77 tj ((x) is the ML estimate of 1I j , and Above, D * is based on the likelihood ratio statistic for fitting the entertained model against a saturated model having as many parameters as there are algebraically independent variables in the data vector, ie, 1(J -1) here. Data should be grouped as much as possible for the asymptotic chi-square distribution to hold in [9] and [10] (McCullagh and Nelder, 1989;Fahrmeir and Tutz, 1994). The degrees of freedom to consider here are I(J -1) (saturated model) minus ((J -1) + rank(X) + rank(P)] (model under study), where X and P are the incidence matrices for (3 and b respectively. It should be noted that [9] and [10] can be computed as particular cases of the power divergent statistics introduced by Read and Cressie (1984).

Hypothesis testing
Tests of hypotheses about y = 6')' can be carried out via either Wald's test or the likelihood ratio (or deviance) test. The first procedure relies on the properties of consistency and asymptotic normality of the ML estimator. which under H o has an asymptotic chi-square distribution with rank(K) degrees of freedom. Above, r(y) is an appropriate block of the inverse of Fisher's information matrix evaluated at y = y , where y is the ML estimator.
The likelihood ratio statistic (LRS) allows testing nested hypotheses of the where y and y are the ML estimators of y under the restricted and unrestricted models respectively. The criterion !# also can be computed as the difference in (scaled) deviances of the restricted and unrestricted models This is equivalent to what is usually done in ANOVA except that residual sums of squares are replaced by deviances.
Under H o , A# has an asymptotic chi-square distribution with r = dim(D) -dim( Do ) degrees of freedom. Under the same null hypothesis, the Wald and LR statistics have the same asymptotic distribution. However, Wald's statistic is based on a quadratic approximation of the loglikelihood around its maximum.

Including random effects
In many applications, the y i r 's cannot be assumed to be independent repetitions due to some cluster structure in the data. This is the case in quantitative genetics and animal breeding with genetically related animals, common environmental effects and repeated measurements on the same individuals.
Correlations can be accounted for conveniently via a mixed model structure on the 'T7!S, written now as where the fixed component x!13 is as before, and u is a (q x 1) vector of Gaussian random effects with corresponding incidence row vector zi.
For simplicity, we will consider a one-way random model, ie, u ! N(O, Ao, u 2 ) (A is a positive definite matrix of known elements such as kinship coefficients), but the extension to several u-components is straightforward. The random part of the location is rewritten as in Foulley and Quaas (1995) as Z!O&dquo;Ui u* where u * is a vector of standard normal deviation, and a u , is the square root of the u-component of variance, the value of which may be specific to subpopulation i. For instance, the sire variance may vary according to the environment in which the progeny of the sires is raised. Furthermore, it will be assumed that the ratio 0' .,,, /a i , where a i is now the residual variance, is constant across subpopulations. In a sire by environment layout, this is tantamount to assuming homogeneous intraclass correlations (or heritability) across environments, which seems to be a reasonable assumption in practice (Visscher, 1992  with u# = pu * ; (ii) X =J;xi,X2,...,x!,...,x// by S = (S1, · .. , Si, ... , SI)' with S ' = (x!, !izi); (iii) add p-2 A-1 to the coefficients of the u# x u# block pertaining to the random effects on the left hand side and _p-2 A -lU[k] to the u#-part of the right hand side (see Appendix 1). A test example is shown in Appendix 2.
A further step would be to estimate p using an EM marginal maximum likelihood procedure based on This may involve either an approximate calculation of the conditional expectation of the quadratic in u# as in Harville and Mee (1984), Hoeschele et al (1987) and Foulley et al (1990), or a Monte-Carlo calculation of this conditional expectation using, for example, a Gibbs sampling scheme (Natarajan, 1995). Alternative procedures for estimating p might be also envisioned, such as the iterated re-weighted REML of Engel et al (1995).

Material
The data set analyzed was a contingency table of calving difficulty scores (from 1 to 4) recorded on purebred US-Simmental cows distributed according to sex of calf (males, females) and age of dam at calving in years. Scores 3 and 4 were pooled on account of the low frequency of score 4. Nine levels were considered for age of dam: < 2 years, 2.0-2.5, 2.5-3.0, 3.0-3.5, 3.5-4.0, 4.0-4.5, 4.5-5.0, 5.0-8.0, and > 8.0 years. In the analysis of the scaling parameters, six levels were considered for this factor: < 2 years, 2.0-2.5, 2.5-3.0, 3.0-4.0, 4.0-8.0, and > 8.0 years. The distribution of the 363 859 records by sex-age of dam combinations is displayed in table I, as well as the frequencies of the three categories of calving scores. The raw data reveal the usual pattern of highest calving difficulty in male calves out of younger dams. However, more can be said about the phenomenon.

Method
Data were analyzed with standard (S-TM) and heteroskedastic (H-TM) threshold models. Location and scaling parameters were described using fixed models involving sex (S) and age of dam (A) as factors of variation. In both cases, inference was based on maximum likelihood procedures. A log-link function was used for scaling parameters.
With J = 3 categories, the most highly parameterized S-TM that can be fitted for the location structure includes J -1 = 2 threshold values (or, equivalently, the difference between thresholds ( 72 -71 ) and a baseline population effect / -l ), plus sex (one contrast), age of dam (eight contrasts) and their interaction (eight contrasts) as elements of (3; this gives r(X) = 17 which yields 19 as the total number of parameters to be estimated. There were I = 18 sex x age subpopulations so that the maximum number of parameters which can be estimated (in the saturated model) is (3 -1) x 18 = 36. The degrees of freedom (df ) were thus 36 -19 = 17.
The H-TM to start with was as in the S-TM for location parameters (3. With respect to dispersion parameters 6, the model was an additive one, with sex (S * : one contrast) and age of dam (A * : five contrasts) so that r(P) = 6 ( Q = 1 in male calves and < 2.0 year old dams). Thus, the total number of parameters was 19 + 6 = 25 and, the df were equal to 36 -25 = 11.

RESULTS
All factors considered in the S-TM were significant (P < 0.01), especially the sex by age of dam interactions (except the first one, as shown in table II). Hence, the model cannot be simplified further. This means that differences between sexes were not constant across age of dam subclasses, contrary to results of a previous study in Simmental also obtained with a fixed S-TM (Quaas et al, 1988). Differences in liability between male and female calves decreased with age of dam. However the S-TM did not fit well to the data, as the Pearson statistic (or deviance) was X 2 = 419 on 17 degrees of freedom, resulting in a nil P-value. An examination of the Pearson residuals indicated that the S-TM leads to an underestimation of the probability of difficult calving (scores 3 + 4) in cows older than 3 years of age, and to an overestimation in younger cows.
As shown in table II, fitting the H-TM decreased the X 2 and deviance by a factor of 20 and led to a satisfactory fit. The significance of many interactions vanished, and this was reflected in the LRS (P < 0.088) for the hypothesis of no S x A interaction in the most parameterized model. Several models were tried and tested as shown in table III. The scaling parameters depended on the age of the dam, the effect of sex being not significant (P < 0.163). Relative to the baseline population, the standard deviation increased by a factor of about 1.05, 1.15, 1.25, 1.40 and 1.50 for cows of 2.0-2.5, 2.5-3.0, 3.0-4.0, 4.0-8.0, and > 8.0 years of age at calving respectively (table IV).
The H-TM made differences between sex liabilities across ages of dam practically constant as the interaction effects were negligible relative to the main effects. The difference between male and female calves was about 0.5. Eventually, a model including sex plus age of dam (without interaction) for the location structure and only age of dam for the scaling part seemed to account well for the variation in the Logistic heteroskedastic models have been considered by McCullagh (1980) and Derquenne (1995). Formulae are given in Appendix 1 to deal with this distribution.
When the Simmental data are analyzed with the logistic (table VI), the homoskedastic model is also rejected although the fit is not as poor as with the data. Wald's and deviance statistics were in very good agreement in that respect, with P values of 0.08 and 0.16 respectively, for the SA interaction. It should be observed that this heteroskedastic model has even fewer parameters (16) than the two-way S-TM considered initially (19 parameters). In spite of this, the Pearson's chi-square (and also the deviance) was reduced from about 419 (table II) to 32  (table V) with a P-value of 0.04. This fit is remarkable for this large data set (N = 363859), where one would expect many models to be rejected.
Although the H-TM may have captured some extra hidden variation due to ignoring random effects, it is unlikely that the poor fit of the S-TM can be attributed solely to the overdispersion phenomenon resulting from ignoring genetic and other clustering effects. The large value of the ratio of the observed X 2 to its expected value (419/17 = 25) suggests that the dependency of the probabilities 77,! with respect to sex of calf and age of dam is not described properly by a model with constant variance. Whether the poor fit of the S-TM is the result of ignoring random effects, heterogeneous variance, or both, require further study, perhaps simulation.
These results suggest that in beef cattle breeding the goodness of fit of a constant variance threshold model for calving ease can be improved by incorporating scale effects for age of dam either as discrete classes, as in this study, or alternatively Qi as a polynomial regression of log a i on age.

Other distributional assumptions
The underlying distribution was supposed to be normal which is a standard assumption of threshold models in a genetic context (Gianola, 1982). However, other distributions might have been considered for modeling 77! in !3!. A classical choice, especially in epidemiology, would be the logistic distribution with mean 7 7i and variance 1f 2 a'f /3 (Collett, 1991, page 93), where probit. Interestingly, there is not much difference between the complete (S+A+SA) and the additive model (S + A), the interaction (SA) being non-significant (P = 0.30). Taking into account the variation in variance in addition to that explained by the additive model on location parameters does not improve the fit greatly. In that respect, the main source of variation turned out to be sex rather than age of dam.
Other options include the t-distribution (Albert and Chib, 1993), the Edgeworth series distribution (Prentice, 1976) and other non-normal classes of distribution functions (Singh, 1987). In fact, the t distribution tv(!2, s 2 ) with spread parameter 8 2 and v degrees of freedom is the marginal distribution of a mixture of a normal distribution !V(!,<7?), with Q2 randomly varying according to a scaled inverted X 2 (v,s 2 ) distribution (Zellner, 1976). Therefore, a threshold model based on such a t distribution is embedded in our procedure by taking a one-way random model for Incr?, ie, ln<r? = In 8 2 + a i with the density function p(a i lv) of the random variable a i as presented in Foulley and (auaas (1995, formulae 21 and 22); see also Gianola and Sorensen (1996) for a specific study of the threshold model based on the t-distribution in animal breeding.

Relationships with variable thresholds
Conceptually, heterogeneity of the a §s is viewed here in the same way as in Gaussian linear models since it applies to an underlying random variable that is normally distributed. However, the underlying variables are not observable, and the corresponding real line includes cutoff points, the thresholds, that make the outcomes discrete. It is of interest to address the question of how changes in dispersion can be interpreted with respect to the threshold concept. Let us illustrate this by a simple example involving J = 3 categories, and a oneway classification model (i = 1, 2, ... , I) as, for example, sex of calf in the Simmental breed. We will assume that the origin is at the first threshold, and that the unit of measurement is the standard deviation within males ( QM = 1). The difference between the first and second threshold values in males is expressible as: where II M1 , lI M2 are the probabilities of response in the first and second categories, respectively, for male calves. A similar expression is obtained for female calves (F), so This is precisely the ratio of the difference between thresholds 1 and 2 that would be obtained when evaluated separately in each sex. Thus Formulae [19] and [20] are analogous to expressions given by Wright (1934b) (the reciprocal of the distance between the thresholds on this scale gives the standard deviation on the postulated scale on which the thresholds are separated by a unit distance, p 545) and Falconer (1989, formula 18.5, p 307) except that these authors set to unity the difference between thresholds in the baseline population, rather than the standard deviation, which we find more appealing conceptually.
In the case of the Simmental data shown in table I, applying formulae [19] and [20] using observed frequencies of responses gives: If more than three categories are observed, this formula also holds for the differences T3 -72 , T4 -73 ,..., 7J -1 -TJ -2 , so that the ratio between standard deviations in subpopulation (i) and a reference population (R) can be expressed as: which involves (J -2) algebraically independent equalities.
In the case of three categories and a single classification, the saturated model has 21 parameters ( T2 -T i , J.L 1 , U2 -. -;/!7) <!2/<7'i) ---ai!a1, ... , arla1). Numerical values of a i! a 1 computed from [20] are also ML estimates (eg, â 2/ ¡ h = 0.973). Formula [21] indicates that there is a link between H-TM and models with variable thresholds (Terza, 1985). As compared to these, the main features of the H-TM are: i) a multiplicative model on ratios of standard deviations or differences between thresholds, rather than a linear model on such differences; ii) a lower dimensional parameterization due to the proportionality assumption made in [18] rather than a category-specific parameterization, ie: where 6 j is the vector of unknowns pertaining to the difference (T! -Ti ). For J = 3, the two models generate the same number of parameters but they are still different vis-a-vis to (i).

Further extensions
The H-TM opens new perspectives for the analysis of ordinal responses. Interesting extensions may include: i) implementing other inference approaches for mixed models such as Gilmour's procedure based on quasi-likelihood, or a fully Bayesian analysis of parameters using Monte-Carlo Markov-Chain methods along the lines of Sorensen et al (1995); ii) assessing the potential increase in response to selection by selecting on estimated breeding values calculated from an H-TM versus an S-TM; iii) incorporating a mixed linear model on log-variances as described in San Cristobal et al (1993) and Foulley and Quaas (1995) for Gaussian observations; iv) carrying out a joint analysis of continuous and ordered polychotomous traits as already proposed for the S-TM by Foulley et al (1983), Janss and Foulley (1993) and Hoeschele et al (1995).
Further research is also needed at the theoretical level to look at the sampling properties of estimators based on mis-specified models. For instance, one may be interested in the asymptotic properties of the ML estimator of ( T ', 0', 6'l' derived under the assumption of independence of the y i ,'s when this hypothesis does not hold. This problem has been discussed in general by White (1982), and it may be conjectured from the results of Liang et al (1992) that the ML estimators of these parameters remain consistent. It might also be worthwhile to assess the effect of departures from independence on testing procedures. The generalized chi-square procedure for goodness of fit derived by McLaren et al (1994) might be useful in that respect for analyses based on large samples.