A note on QTL detecting for censored traits

Most existing statistical methods for mapping quantitative trait loci (QTL) assume that the phenotype follows a normal distribution and that it is fully observed. However, some phenotypes have skewed distributions and may be censored. This note proposes a simple and efficient approach to QTL detecting for censored traits with the Cox PH model without estimating the baseline hazard function which is "nuisance".


INTRODUCTION
The standard approach for mapping the quantitative trait loci (QTL) contributing to variation in a quantitative trait makes use of the assumption that the phenotype is normally distributed and fully observed [7,8,11,18]. In recent years, many authors have proposed nonparametric or semiparametric methods to solve the problem of model misspecification caused by the assumption of normality [6,10,19]. In addition, assumptions of standard approach are likely to be false when the phenotype pertains to the survival time or failure time and the failure time is often subject to censoring. The incompleteness of the trait values presents a major challenge in the application of the interval-mapping approach [11]. Symons et al. [17] computed LOD scores under the Cox proportional hazards (PH) model using a variant of the EM algorithm using Monte Carlo simulation to make the computations tractable as described by Lipsitz and Ibrahim [13]. The EM algorithm incorporates all possible values of missing covariates according to the appropriate probability distributions. This method is computationally intensive and it needs estimating a "nuisance" parameter, that is, baseline hazard function λ 0 (·) of the Cox PH model. Broman [2] considered a cure-rate model in which the mice that are alive at the end of the study are regarded as cured and in which the survival times among the deaths follow a log-normal distribution. This is a special model which can only deal with the situations in which the potential censoring times are equal among all study subjects. Diao et al. [4] formulated the effects of QTL on the failure time through a parametric PH model with a Weibull baseline hazard function λ 0 (t) = γ 1 γ 2 t γ 2 −1 , γ 1 > 0, γ 2 > 0. Since it is a parametric model, it still has the problem of mis-specification. In a recent study, Moreno et al. [15] proposed two new QTL detection approaches which allow the consideration of censored data. One is similar to Diao et al. [4] based on Weibull distribution and the other one is based on Cox proportional hazards model.
Since the primary reason for using the Cox proportional hazards model and his partial likelihood technique is avoidance of the "nuisance" baseline hazard function λ 0 (·), in this article, we provide a simple interval-mapping method to censor traits without estimating λ 0 (·). In brief, we formulate the effects of QTL on the failure time through the PH model and treat unobserved genotypes of QTL as missing covariates. Then we develop a procedure based on partial likelihood for detecting QTL and show how to assess genome-wide statistical significance. In comparison to Symons et al. [17], Broman [2] and Diao et al. [4], our approach has some advantages. First, we used the Cox semiparametric PH model which is most popular for survival analysis. Second, we avoided estimating a baseline hazard function which is very complicated to be estimated. Third, the test statistic we used was actually the well-known log rank statistic and it is the locally most powerful test. Furthermore, because there was no iteration in calculating the test statistic, the method proposed in the following was computationally efficient.

MAIN RESULTS
We consider populations derived from a cross between two parental inbred lines P 1 and P 2 . There are two kinds of basic populations, F 2 and backcross. In this note we are only concerned with F 2 . Consider n progenies from an F 2 population. Let T i denote the quantitative trait from the ith subject, which pertains to a failure time that can potentially be censored and thus incompletely observed. Let C i be the censoring time for the ith subject. The observation The failure time T i is fully observed only when it is uncensored, i.e., δ i = 1.
Suppose that we have data on a set of genetic markers with a known genetic map. Let M i denote the multipoint marker genotype data for the ith subject. We consider a putative QTL locus d in the genome and define G i = −1, 0 or 1 according to whether the ith subject has genotype qq, Qq or QQ, respectively, at the QTL. We specify a proportional hazards model for the effects of the QTL genotypes on the failure time such that, conditional on the QTL genotype G i , the hazard function of T i takes the form where β 1 and β 2 pertain to the additive and dominant effects of QTL and λ 0 (t) is an unknown baseline hazard function. Diao et al. [4] considered a Weibull hazard function λ 0 (t) = γ 1 γ 2 t γ 2 −1 , γ 1 > 0, γ 2 > 0. In this article, we add no condition on the form of λ 0 (t).
Because G i 's are missing covariates in model (1), we consider conditional hazard function given M i , that is which is also a multiplicative hazards model. Denote the conditional expectation in (2) by a i (t, β), where β = (β 1 , β 2 ) , then we have where Λ 0 (t) = t 0 λ 0 (s)ds is the cumulative baseline hazard function. Suppose now that t 1 < · · · < t k are the ordered distinct failures in the sample and R(t j ) and D(t j ) denote the risk set just prior to t j and the set of subjects failing at t j , respectively, j = 1, · · · , k. If we use an approximation to accommodate tied failure times [1], like Prentice [16], the partial likelihood function is given as where m j is the number of failures at t j and n j is the size of the risk set R(t j ), j = 1, · · · , k. The score function of (4) is where b i (t, β) = ∂a i (t, β)/∂β, i = 1, · · · , n.
In order to test the null hypothesis of no QTL effect, i.e., H 0 : β = 0, we used score test procedure. To do so, let . These values, x 1i 's and x 2i 's, can be found, for example, in [14]. It is easy to verify that a i (t, 0) = 1 and b i (t, 0) = x i , for any i = 1, · · · , n, then the score statistic (5) at β = 0 can be written as We were not excited to find that the above statistic does not depend on the nuisance parameter λ 0 (·). Since the primary reason for using the partial likelihood technique is avoidance of λ 0 (·), the use of statistic (6) will lead to a simple and efficient mapping approach. By some arguments based on the counting process (App. A), we can get an estimate of variance of score statistics (6), where x ⊗2 = xx for x a vector. In fact, similar to Prentice [16], a finite population variance argument applied to h∈D(t j ) x h , given R(t j ) for each j = 1, · · · , k leads tõ where factors n j −m j n j −1 due to the tied data approximation used in (4). Under H 0 : β = 0, the statistic w = s ũ −1 s, (9) will have an asymptotic χ 2 2 distribution. Note that the score s and variance u or u all depend on the locus d of QTL through the dependence of x i 's on d. In the sequel, we include d in the expressions to emphasize their dependence on d, Thus the test statistic curve {w(d), d ≥ 0} for each chromosome can be drawn as in the case of standard interval mapping. For each chromosome, the position with the largest value of the curve is declared to be the QTL location provided that the value exceeds a certain threshold level. In the next section, we will show how to determine the threshold level.

THRESHOLD VALUES
When searching the entire chromosome or whole genome for QTL, one should select a threshold level such that the probability that the test statistic exceeds this level anywhere in the genome equals the desired falsepositive rate. In Appendix B we show that in a dense-map case, the process {ũ −1/2 (d)s(d), d ≥ 0} is asymptotically equivalent to a two dimensional Ornstein-Uhlenbeck process under a null hypothesis. Thus we can get analytical approximations of thresholds which are analogous to those of Lander and Botstein [11], Dupuis and Siegmund [5], etc.
Since the analytical results are based on a number of assumptions that are not likely to be met in practice, we can simply use a permutation test, as described by Churchill and Doerge [3], to obtain an empirical threshold value. In addition, this section concludes with a resampling procedure similar to Diao et al. [4] and Zou et al. [20] by which we approximate the null distribution of sup d w(d) and then get the threshold value of our interval mapping method. First, we generate Z i , i = 1, · · · , n, which are i.i.d. standard normal random variables. Then define and In Appendix C, we can show that the unconditional distribution of 1 √ n sup d w(d) can be approximated by the conditional distribution of 1 √ n w * . To this end, we generate the standard normal random sample (Z 1 , · · · , Z n ) a large number of times. For each sample, we calculate w * . The 100(1−α)th percentile of the simulated w * 's is the threshold value for the genome-wide significance level of α.

A SIMULATION STUDY
To investigate the proposed method in practical situations, we performed a small simulation study. Since the proposed score test is locally most powerful, we did not have to evaluate its power. In this section, we only examined the performance of the proposed interval-mapping method for locating the QTL for two different settings. The first setting was the same as the one in Diao [4] in order to compare their method, where the failure times were generated from the Weibull distribution with baseline hazard function λ 0 (t) = γ 1 γ 2 t γ 2 −1 with γ 1 = 0.01 and γ 2 = 2. In the second setting, the failure time were generated from the log-normal distribution, that is, log T ∼ N(0, 1) under the null hypothesis. In both settings, the censoring times were generated from the uniform (0, τ) distribution, where τ was chosen to yield ∼ 30% censored observations. Assuming no crossover interference, we generated the marker data from the Markov chain. The interval-mapping step size was set at 1 cM. We considered a chromosome with a total length of 100 cM. Genetic maps with different numbers of equally spaced markers were simulated. In both settings, one QTL located at 33 cM was simulated with β 1 = 0.5 and β 2 = 0.25. We generated 200 replicates of 300 observations from an F 2 population. The results are summarized in Table I where the unit of means and standard errors is cM.
In Table I, the results from setting 1 are very similar to those in Table 2 of Diao et al. [4]. In both settings, there is little bias for the estimation of the QTL location. In addition, we also found that the dense-maker map makes a small contribution to the accuracy of the confidence intervals of the QTL location.
Define counting processes {N i (t) = Δ i I(T i ≤ t), 0 ≤ t}, at-risk processes {Y i (t) = I(T i ≥ t, C t ≥ t), 0 ≤ t}, i = 1 · · · , n and their generating filtration Then corresponding to this filtration, for each i = 1, · · · , n, the process is a martingale. Based on these martingales, the score of (5) can be written as This is equivalent to (6). Its predictable variation process is and an estimate of it at β = 0 is where N.(t) = n i=1 N i (t). Since it is easy to verify a i (t, 0) = 1 and b i (t, 0) = x i , for i = 1, · · · , n, the above estimate is equivalent to (7).

APPENDIX B: ANALYTICAL APPROXIMATIONS OF THRESHOLDS
Let g i (d) = (G i (d), 1 − |G i (d)|) , i = 1, · · · , n, and g d = g 1 , where G i (d) is the genotype at position d. Let d 1 and d 2 denote two points on the chromosome, and p be the recombination fraction corresponding to the genetic distance |d 1 − d 2 |. It is easy to see that the correlation of g(d 1 ) and g(d 2 ) is Corr(g(d 1 ), g(d 2 )) = 1 − 2p 0 assuming Haldane's map function.