Model for fitting longitudinal traits subject to threshold response applied to genetic evaluation for heat tolerance

A semi-parametric non-linear longitudinal hierarchical model is presented. The model assumes that individual variation exists both in the degree of the linear change of performance (slope) beyond a particular threshold of the independent variable scale and in the magnitude of the threshold itself; these individual variations are attributed to genetic and environmental components. During implementation via a Bayesian MCMC approach, threshold levels were sampled using a Metropolis step because their fully conditional posterior distributions do not have a closed form. The model was tested by simulation following designs similar to previous studies on genetics of heat stress. Posterior means of parameters of interest, under all simulation scenarios, were close to their true values with the latter always being included in the uncertain regions, indicating an absence of bias. The proposed models provide flexible tools for studying genotype by environmental interaction as well as for fitting other longitudinal traits subject to abrupt changes in the performance at particular points on the independent variable scale.


Introduction
Reaction norm models have been proposed as an alternative for fitting Genotype by Environment interactions (GxE) in evolutionary biology and animal breeding [1]. In reaction norm models, the environment is often described by a continuous variable, and the phenotypes are partially explained by the regression of the genotypic values on the environmental values. When an environmental variable is observed on a continuous scale (i.e., temperature), it is expected to have a direct one-to-one relationship between the environmental scale and values. Consequently, the reaction norm model can be fitted by regressing the genotypic values on the observed environmental scale [2,3]. When the observed environmental scale is not continuous (i.e., herd classes), the genotypic values can be regressed on the effect of the categorical variable defining the different environments using, for example, least squared means of the class effects [4] or inferring the environmental values jointly with the remaining set of parameters in the model [5].
In animal breeding applications of reaction norm models, it was assumed that both the mean and the variances are either continuous, monotone functions of the environmental values [4,6] or that they are such only when the environmental values exceed a certain threshold [2,7,3]. In past studies involving thresholds, the same threshold was assumed for all animals, and it was estimated based on the quality of the fit of the average performances as a function of environmental values.
The objective of this study was to present a Bayesian hierarchical model for fitting a longitudinal trait showing an abrupt linear change at some value of the independent variable. Simulations were inspired by reaction norm models, and the procedure postulates that the effect of the environmental variable is not existent until it exceeds a certain unknown value particular for each individual with data. Furthermore, the model allows for partitioning individual variability on the threshold into genetic and environmental components.

Model and Prior specification
A general description of hierarchical Bayesian modelling can be found in [8]. Here the first stage of the hierarchy describes the data generating process, or the conditional distribution of the observed phenotypes given the model parameters. The following model was assumed: where y ijk is the i th observation measured on animal j in contemporary group k (CG k ), and THI ij is the temperature and humidity index [2,7] associated with the i th observation of animal j. Random variables  j ,  j and  0, j associated with the animal j represent an intercept ( j ), or individual value in the absence of heat stress, slope ( j ), or a change in the performance per unit of change in the THI index above the individual threshold ( 0, j ). In this study, the heat load function [7] was defined in a way that was similar to previous studies on genetics of instantaneous heat stress on daily milk production [2]. Finally,  ijk is a random homoskedastic error term associated with each particular observation.
The data was assumed to be normally distributed as follows: The second stage of the hierarchy consisted of specifying prior distributions for all parameters in the first stage.
where U indicates the uniform distribution and K is the number of levels of the contemporary group effect.
The underlying variables associated with the j th animal,  j ,  j and  0, j , were assumed to follow the multivariate normal distribution: where , , and ,  0 and  0 are vectors including scalar parameters of individuals ( j ,  j and  0, j ).
Parameters of a given individual were considered to be conditionally independent and affected at their mean level by systematic (  ,   and ) and genetic effects (a  , a  and ); the residual (co)variance matrix between underlying variables was R 0 , which is equivalent to a (co)variance matrix between permanent environmental effects on the observed measures scale.
In a third hierarchical stage, prior distributions for systematic and genetic effects and the residual (co)variance matrix between underlying variables were defined. Systematic effects were considered to be uniformly distributed, and genetic effects were assumed to follow a multivariate normal distribution according to the genetic infinitesimal model [9]: where G 0 is the (co)variance matrix between the additive genetic effects for the underlying variables. The residual (co)variance matrix was assumed to follow a uniform distribution.
In the fourth and last hierarchical stage, a prior distribution was assigned to the genetic (co)variance matrix for the underlying variables. A uniform distribution was assumed as in the case of the residual (co)variance matrix.

Fully conditional posterior distributions
The fully conditional posterior distributions must be obtained in order to perform a Bayesian MCMC estimation procedure using the Gibbs sampler algorithm. After defining the joint posterior distribution as the product of the conditional likelihood and all the prior distributions [8], the terms involving the parameter of interest in the joint posterior distribution were retained. For the model described, all the fully conditional posterior distributions are exactly the same as those described for a hierarchical model assuming intercept and linear terms [10], except those involving the individual thresholds. For all the position parameters, both in the first and second hierarchical stages, the fully conditional posterior densities were proportional to normal distributions; the fully conditional distribution for the residual variance in the first stage followed a scaled inverted chi squared distribution, and the genetic and residual (co)variance matrices in the third and second stages followed inverted Wishard distributions.
For the thresholds, the fully conditional posterior distribution had the following form: which can be explicitly expressed as: The first term comes from the likelihood; J refers to the subset of records belonging to animal j. The second term comes from the prior (second hierarchical stage); note that the relationship between the animal j and the other individuals in the population are taken into account throughout the given values of the additive genetic effects. In this second factor, scalars r i, j refer to the relevant elements of the inverse of R 0 , which is the residual (co)variance matrix in the second hierarchical stage. This fully conditional posterior distribution does not have a known closed form; thus a Metropolis step [11] was used to sample from it.
In the model presented, the definitions of the genetic and phenotypic variances in a given environment are slightly more difficult than in the standard reaction norm models because a non-linear function of random correlated variables is involved. Thus, a Monte Carlo approximation of the phenotypic variance was determined for a particular value of THI during the measurement day. For example, in a particular environment (THI value) this quantity was calculated in the r th round of the Gibbs sampler: where n is the number of records, and , with expected value , is a vector of size n with typical elements defined as below: In this expression and are the sampled values for the additive genetic effects for the animal j during the r th iteration; and are random deviates where N is the number of animals in the pedigree; A -1 is the inverse of the additive relationship matrix; is a vector of overall additive genetic effects sampled during the iteration r; and is the expected value of the random variable . The j th element of the vector was computed in each round of the Gibbs sampler using this expression: where and have the same meaning as those previously described in the equation for .
Note that non-zero expected values are considered in the  are non-linear functions of random correlated variables, thus their expected values are non-zero [12]. Also note that the relationships between records were not considered when computing the phenotypic variance due to complexity.
Based on these computed variance components, relevant genetic parameters and other genetic quantities can be easily defined for different environments (THI values). For example, heritability or expected genetic response to a selection index could be defined for different environmental values [13].

Data
Simulated data sets were used to investigate the performance of the Bayesian implementation of the model described above.
Different combinations of heritabilities and correlations for the underlying variables were investigated: low (0.1), medium (0.2) and high (0.5) heritabilities; and low (0.2, 0.3) and high (0.7, 0.9) correlations, in absolute value. In addition, two different data set designs were considered, approximately 20 (S20) and 10 (S10) records per animal. Thus, 12 different scenarios were investigated, and for each one ten replications were run.
For both data size scenarios the same genetic structure was considered but with different sizes. For S20 in the first generation, 40 males and 200 females were generated, and in the second generation, each sire was mated to five females, producing four full sibs from each mating. Thus, the entire population consisted of 1,040 animals. For S10 in the first generation, 80 males and 400 females were generated, and in the second generation, each sire was again mated to five females, producing four full sibs. In this case the entire population consisted of 2080 animals. This genetic structure resembles prolific species populations like swine or rabbit.
For both data structures 21,500 records were generated according to the described model and assigned to the total number of animals in the population. For generating records only an overall mean (with a value of 90) was considered in the first hierarchical stage as the CG effect, and overall means for the threshold (19) and for the slope (-0.5) were the only considered systematic effects in the second hierarchical stage. THI values were generated by sampling from a Normal distribution with mean 18.0 and variance 10.0, resembling the distribution of THI values in a temperate climate.

Gibbs Sampler implementation
For each replication, a Gibbs Sampler algorithm was run for 100,000 rounds, of which the first 10,000 were discarded as burn-in period; afterwards one tenth of the rounds were retained. The threshold level was sampled via a Metropolis step by using a proposal density that was normally distributed and centered on the previous value of the threshold. The variance of the proposal density was constant across animals. During the burn-in period, the value of the variance of the proposal was tuned for an average acceptance rate of around 0.5 under all the scenarios. In a post-Gibbs analysis, the convergence of the chains were assessed both by visual inspection of the trace plots for the most relevant parameters and through the Geweke test [14], in addition the effective sample size (ESS) was computed using the function effective Size () from the coda package in R [15].

Results
Tables 1 and 2 show the results of the simulation averaged over 10 replications for the 12 investigated scenarios. For all the parameters and models, the true values were well within the uncertain regions, which is an empirical indication of the unbiasedness of the inferential method. In addition the means for all the parameters were very close to their respective true values.
As expected, inference efficiency, measured through the marginal posterior standard deviation averages across parameters in Tables 1 and 2 (except residual variance), was reduced as the correlations between underlying variables was reduced. On the contrary, algorithm efficiency, measured through the ESS averages across parameters in Tables 1 and 2 (except residual variance), decreased as correlations increased. In both correlation scenarios, increasing heritability increases inference efficiency for genetic correlations but reduces efficiency for the estimation of heritabilities and environmental correlations. In general, the algorithm average efficiency increases with heritability but some exceptions can be found, particularly under data structure S10. Figure 1 shows the marginal posterior distributions and trace plots for the overall mean of the threshold level obtained in one replication in the scenarios of high correlation and low, medium and high heritabilities when the data structure was S10. The reduction in quality of the chain as heritability decreases can be observed in Tables 1  and 2.
Patterns of heritability with change in the THI during the measure day are shown in Figure 2; these plots are estimated from one replication in the scenarios of high correlations and all the cases of heritability with the S10 data structure. Relatively flat patterns were observed, and thê 95%HPD region always well covering the true pattern, computed using the approximate formulas as previously described. Table 3 shows averages across replications of Pearson correlations between predicted and true breeding values for the underlying variables for the 12 investigated scenarios. The predictors were assumed to be the average of the marginal posterior distributions. The observed values of these correlations, i.e. accuracies, correspond well with the heritabilities and correlations used during the simulation.

Discussion
The model presented in this study provides greater flexibility over traditional reaction norm models when the environmental variable is known, as it allows a semi-parametric form for the reaction norm function. This is a semi-parametric model in the sense that the point in which the linear change is assumed to start is defined by the data themselves. The forms of the functions before and after this point are defined parametrically a priori, i.e., constant before the change point and a linear function afterwards. To increase flexibility, higher order polynomials or spline functions could be fitted within each one of these two separate periods, with the advantage that within each one of the periods, the functions would remain linear on the parameters. The presented inferential procedure gave unbiased estimates because the uncertain regions always covered the true value of the parameters.
Several alternative algorithms have been proposed for non-parametric or semi-parametric curve fitting. One of them is a Reversible Jump MCMC algorithm where the optimal number of change points (parameters in the model) is estimated [16]. The model presented in this study is a simplified version of this semi-parametric procedure, as the number of parameters is fixed a priori. However, the indicated study focuses on fitting averages along the independent variable trajectory; in our case we fit individual sources of variation throughout this trajectory. For this purpose and from a computational point of view, the  proposed hierarchical structure is particularly suitable, since the dimension of the problem became greater than when fitting changes in the mean. By using this hierarchical structure, updating mixed model equations in each round of the Gibbs Sampler can be avoided; only the right hand side needs to be modified. In addition, this hierarchical structure jointly with the Bayesian estimation procedure allows for a more appropriate prior assumption that takes advantage of the family structure in the population. Other general procedures for finding change points in continuous functions are the so-called change point techniques. These approaches were previously used in animal breeding to find points of change when fitting heterogeneous residual variance analysing test day milk records [17]. These approaches provide greater flexibility than the models presented because they allow for non-linear functions within each one of the defined regions. However these techniques are more complex because of the nonlinearity and the values of two successive functions at change points need to be constrained explicitly to be identical. Our parametrization model can be considered a truncated power representation of a linear spline [18], and in these cases the aforementioned constraints are implicitly considered [19].
Like other previously proposed reaction norm models [2,7,3], the described model could be used for studies and evaluations for genetic tolerance to high heat. The model allows the identification of not only those individuals in the population that are less sensitive to temperature changes after a particular threshold, but also those that became heat stressed at higher values of temperature or THI value. And this individual variation can be partitioned into environmental and genetic components, both for the threshold and the intensity of sensitivity to heat stress. This makes it possible to identify genetically superior individuals for a particular underlying variable of interest: intercept, slope, threshold, or some index involving these variables.
The load function used in this study is the same used for fitting the effect of instantaneous THI on milk production [2]. However it is relatively straight forward to consider more complex functions, for example, those used for stud- Marginal posterior distribution and trace plots for the overall mean of the threshold level in three different scenarios for S10 Figure 1 Marginal posterior distribution and trace plots for the overall mean of the threshold level in three different scenarios for S10. a) high correlation and high heritability, b) high correlation and medium heritability, c) high correlation and low heritability.
ying cumulative effect of THI on carcass weight in pigs [7,3].
In the described model, the covariate (THI) is assumed to be known; however, a traditional reaction norm model could be fitted by predicting an unobserved environmental covariate from the contemporary groups. This extension can be implemented either in two steps as in Kolmodin et al. [4] or more complexly as in Su et al. [5] by integrating out all the possible values of the contemporary group effects. In these models with unknown covariates, it could be equally reasonable to assume that no effect is observed on the phenotypic performance until some threshold in the environmental scale is reached, beyond which some kind of change in the performance could be expected.
The presented model was applied to study variability on the onset of heat stress tolerance on milk production in dairy cattle. In this study the population size was around 90,000 animals and over 300,000 test-day records were considered. For this data set 250,000 Gibbs iterations took approximately 5.0 CPU days.
Although the methodology presented has been illustrated by focusing on the genetics of heat stress tolerance, more applications could be considered. In particular those longitudinal traits showing a threshold response, i.e., those traits with an abrupt change in the response beyond some Patterns of heritability with change in the THI in three differ-ent scenarios for S10 Figure 2 Patterns of heritability with change in the THI in three different scenarios for S10. high correlation and high heritability, b) high correlation and medium heritability, c) high correlation and low heritability; the line represents the true pattern, points are the estimated value for the particular THI pattern and the segments represent 95% highest density regions.    Tables 1 and 2 where true parameter values can be found point on the explanatory variable scale could be fitted using the model presented.

Conclusion
A model for fitting traits in which the response to an environmental variable is subject to an abrupt linear change was presented. The described statistical procedure performed satisfactorily under the simulated scenarios in estimating the model parameters. As an application example, the model could be useful for identifying animals with higher adaptation to environmental changes, to heat in particular. These animals will be characterized by a smaller phenotypic decline in the performance as well as a later onset of environmental stress. In addition, the proposed methodology can attribute the individual variation on these two expressions of tolerance to environmental stress to genetic and systematic components, which would be useful for the detection of genetically superior breeding animals to be used in selection.