Searching for phenotypic causal networks involving complex traits: an application to European quail
© Valente et al; licensee BioMed Central Ltd. 2011
Received: 20 May 2011
Accepted: 2 November 2011
Published: 2 November 2011
Structural equation models (SEM) are used to model multiple traits and the casual links among them. The number of different causal structures that can be used to fit a SEM is typically very large, even when only a few traits are studied. In recent applications of SEM in quantitative genetics mixed model settings, causal structures were pre-selected based on prior beliefs alone. Alternatively, there are algorithms that search for structures that are compatible with the joint distribution of the data. However, such a search cannot be performed directly on the joint distribution of the phenotypes since causal relationships are possibly masked by genetic covariances. In this context, the application of the Inductive Causation (IC) algorithm to the joint distribution of phenotypes conditional to unobservable genetic effects has been proposed.
Here, we applied this approach to five traits in European quail: birth weight (BW), weight at 35 days of age (W35), age at first egg (AFE), average egg weight from 77 to 110 days of age (AEW), and number of eggs laid in the same period (NE). We have focused the discussion on the challenges and difficulties resulting from applying this method to field data. Statistical decisions regarding partial correlations were based on different Highest Posterior Density (HPD) interval contents and models based on the selected causal structures were compared using the Deviance Information Criterion (DIC). In addition, we used temporal information to perform additional edge orienting, overriding the algorithm output when necessary.
As a result, the final causal structure consisted of two separated substructures: BW→AEW and W35→AFE→NE, where an arrow represents a direct effect. Comparison between a SEM with the selected structure and a Multiple Trait Animal Model using DIC indicated that the SEM is more plausible.
Coupling prior knowledge with the output provided by the IC algorithm allowed further learning regarding phenotypic causal structures when compared to standard mixed effects SEM applications.
Structural equation models or SEM [1, 2] are used to model multiple traits and functional links among them, which may be interpreted as causal relationships. These models were adapted for the context of quantitative genetics mixed models by , and henceforth applied and extended by a number of authors [4–11].
Fitting SEM requires choosing a causal structure a priori. This structure describes qualitatively the causal relationships among traits by determining the subset of traits that imposes causal influence on each phenotype studied. By fitting a SEM, it is possible then to infer the magnitude of each causal relationship pertaining to the causal structure, which is quantified by model parameters called structural coefficients. However, choosing the causal structure may be cumbersome, given the typically very large space of possible causal hypotheses, even when only a few traits are studied. The choice of causal structures for the aforementioned SEM applications that followed the work of  were performed on the basis of prior beliefs, resulting in poor exploration of structures spaces.
Methodologies such as the IC algorithm [12, 13] make it possible to search for recursive causal structures that are compatible with the joint probability distribution of the variables considered. Therefore, applying these methodologies allows the selection of causal structures without relying on prior knowledge alone. Nonetheless, such algorithms are constructed based on specific assumptions regarding the data, such as the causal sufficiency assumption (for more details, see [12, 14]). Under this assumption, the residuals of the SEM for which the causal structure will be chosen are regarded as independent between traits. This construction is necessary to establish the connection between the selected causal structures and the joint probability distribution under study, such that d-separations [12, 14] in causal structures among traits are reflected as null partial correlations. Under this scenario, the IC algorithm takes a correlation matrix as input and searches for causal structures that are capable of producing that matrix, with its conditional dependencies and independencies. However, multiple phenotypes may present unobserved correlated genetic effects which confound such search, as discussed by Valente et al. . When using mixed effects SEM to represent this scenario, this confounding may take place even if model residuals are regarded as independent. As an alternative, Valente et al.  proposed a methodology which couples Bayesian model fitting and the application of the IC algorithm to the joint distribution of phenotypes conditional on the genetic effects.
With the purpose of validating and illustrating their method, Valente et al.  applied it to simulated data based on different scenarios. Here, we present the first application of such methodology to a real data set, by exploring the space of causal structures among five productive and reproductive traits in European quail. The discussion is focused on the challenges and benefits resulting from applying this method to field data, as well as on proposing approaches to overcome such challenges.
Mean and standard deviation (SD) for each trait
Structural equation models
where Ψ = Ψ0⊗I n .
Recursive causal structure selection
Selection of causal structure was performed by following the methods presented by . As mentioned by these authors, there are algorithms that search for recursive causal structures (i.e. causal structures with no cycles or feedback relationships between traits) assuming that conditional independencies in the joint probability distribution of the studied variables mirror d-separations in the causal structure (for more details, see [12, 14–16]). One of such algorithms is the Inductive Causation (IC) algorithm, which is able to search, within typically vast causal structure spaces, for a class of minimal structures that are compatible with the conditional independencies carried by the joint distribution of the data. This class consists of statistically equivalent causal structures that impose the same set of stable conditional independencies in the joint distribution (i.e. they cannot be distinguished on the basis of data evidence) and may be represented by a partially oriented graph, i.e., a causal structure carrying directed and undirected edges, the latter representing causal connections with unspecified causal direction. The edges that are left undirected by the algorithm may present one direction or the other in different structures within the class, such that no direction results in causal cycles or further unshielded colliders (sub-structures consisting of unlinked vertices with a common child, such as y j → y j'' ← y j' , where j, j', and j'' are indexes indicating three different phenotypic traits, and y j → y j' indicates that y j directly affects y j' ). The IC algorithm, when applied to a set P of t phenotypic traits, can be described as follows:
Step 1. For each pair of phenotypic traits y j and y j' (j ≠ j' = 1, 2, ..., t) in P, search for a set of traits S jj' such that y j is independent of y j' given S jj' . If y j and y j' are dependent for every possible S jj' , connect y j and y j' with an undirected edge. This step returns an undirected graph U.
Step 2. For each pair of non-adjacent traits y j and y j' with a common adjacent trait y j'' in U (i.e., y j - y j'' - y j' ), search for a set S jj' containing y j'' such that y j is independent of y j' conditional on S jj' . If there is no such set, then add arrowheads pointing at y j'' (y j → y j'' ← y j' ). Otherwise, continue.
Step 3. In the partially oriented graph returned by the previous step, orient as many undirected edges as possible in such a way that it does not result in new unshielded colliders or in cycles.
An important point to observe regarding the study of causal structures among phenotypic traits is that even if the residual covariance matrix is considered as diagonal, which is a consequence of the causal sufficiency assumption, unobserved correlated genetic effects act as sources of confounding [15, 16]. Such feature damages the connection between causal structures and joint probabilities such that d-separations in the former are not expected to be reflected as conditional independencies in the latter. However, conditionally on the genetic effects, this connection is restored. Assessing this conditional probability distribution is possible since such effects can be 'controlled' based on a genetic distance matrix (e.g. a genetic relationship matrix). The conditional covariance matrix of y given u can be obtained by fitting a standard multiple trait animal model (MTAM, ) and obtaining the estimated residual covariance matrix, here represented by . In some systems, other factors (e.g. correlated maternal effects) may also impose confounding in the search, and in these cases they should also be incorporated in the MTAM from which will be taken as the algorithm's input. Using Bayesian data analysis with a Markov chain Monte Carlo (MCMC) implementation, the following approach was proposed by :
Step 1. Fit a MTAM and draw samples from the posterior distribution of .
Obtain the posterior distribution of residual partial correlation ρj, j'|S. These partial correlations are functions of . Therefore, samples from their posterior distribution can be obtained by computing the correlation at each sample drawn from the posterior distribution of .
Compute the highest posterior density (HPD) interval with some specified probability content for ρj, j'|S.
If the HPD interval contains 0, declare ρj, j'|Sas null. Otherwise, declare y j and y j' as conditionally dependent.
Step 3. Fit a SEM using the selected causal structure (or one member within the class of statistically equivalent structures returned by the IC algorithm) as the 'true' causal structure.
More details on causal structure search based on observational data are given by [12, 14]. Additionally, the approach proposed to select recursive causal structures in the quantitative genetics mixed model context is discussed by  and reviewed in .
Application of the IC algorithm involves performing a set of statistical decisions about declaring partial correlations as null or not. As the posterior distribution of these parameters becomes flatter, the statistical decisions get poorer, i.e. errors become more likely. In this scenario, using a high content HPD interval (such as 95%) protects against declaring a null correlation as non-null, but the algorithm becomes more prone to declaring non-null correlations as null. However, these two types of errors are equally important when exploring causal structure spaces , and therefore, in scenarios where posterior distributions of partial correlations are not sharp, results may be better when decisions are made on the basis of HPD intervals with lower content. In this article we applied several HPD content magnitudes (70, 75, 80, 85, 90, and 95%), and compared the final causal structures obtained. This approach may indicate the edges and the structures that are more stable to changes in the magnitude of HPD contents used for the statistical decisions.
Bayesian inference and fully recursive model
where Λ fr and Ψ fr are, respectively, a matrix of structural coefficients and a diagonal residual covariance matrix, both associated with a fully recursive model. Such approach allowed all the models studied in this article to be fitted by using the same program.
The analyses were carried out using programs written in R , which are available from the authors upon request. As all fully conditional posterior distributions had closed forms, a Gibbs sampler, as discussed in , was applied to obtain a single chain of 300, 000 iterations for each model fitted. On the basis of visual inspection of a subset of parameters, including the structural coefficients, genetic and residual covariances, the initial 100, 000 samples of each chain were discarded as a conservative burn-in. The remaining 200, 000 iterations were regarded as samples from the posterior distributions of the parameters. The retained samples were used as basis for recursive causal structure search via IC algorithm, model comparison, and inferences about the parameters of the model fitted conditionally on the selected causal structure.
where , which is the posterior mean of θ, and were obtained from the posterior samples of θ.
Results and discussion
Posterior means and 95% HPD intervals for the dispersion parameters pertaining to a MTAM
95% HPD Interval
95% HPD Interval
Given the HPD contents applied to the IC algorithm, the output in Figure 2b may be considered as the most stable, since it was consistently selected when using HPD intervals of 75%, 80%, 85% and 90%. This structure is similar to the one obtained using 70% HPD intervals, except for the absence of the edge connecting BW and NE. Another difference from the previous selected structure is that this slightly sparser undirected graph reflects a set of conditional independencies that could effectively result from a recursive SEM. In other words, this undirected graph represents a non-empty class of recursive causal structures, which is in contrast to the graph previously discussed, which suggested features in the joint distribution that could not result from an acyclic SEM under the causal sufficiency assumption. However, every instance of this class conflicts with the prior knowledge regarding the temporal sequence of the studied traits, i.e. every structure of this class considers that at least one trait is affected by some other trait not yet expressed. More specifically, for every member of this causal structure class, AEW is regarded as a cause of W35, or a cause of BW, or both. Here we allowed the temporal sequence information to override the algorithm output, leading to the oriented structure presented in Figure 3b, which involves adding in the unshielded collider BW → AEW ← W35.
Finally, the last selected structure resulted from using the proposed approach based on 95% HPD intervals to make the statistical decisions. As presented in Figure 2c, this structure is also undirected, and consists of two disconnected sub-structures. Unlike the previous outputs, this class of structures carries one structure that is consistent with the temporal information regarding the studied traits, which is depicted in Figure 3c. Moreover, the edges conveyed by this undirected graph were the most stable, as they were present for every HPD interval content that was used in the search methodology.
DIC obtained for SEM with causal structures as in Figure 3a (Model A), 3b (Model B), 3c (Model C), and for a Multiple Trait Animal Model (MTAM)
Posterior means and 95% HPD intervals for the dispersion parameters pertaining to Model C
95% HPD Interval
Posterior means and 95% HPD intervals for the dispersion parameters pertaining to a reduced SEM with causal structure C
95% HPD Interval
95% HPD Interval
It should be stressed that one's interpretation of the output provided by the approach proposed by  must be guided by the (causal) assumptions one is willing to accept. This methodology could be regarded as causal structure inference in situations where the assumptions provided by  are accepted (namely: (1) causal sufficiency, (2) same causal relations for every individual in population, (3) faithfulness of joint distribution to an acyclic directed graph, and (4) correctness of statistical decisions). Some causal learning may still take place even if we do not accept the strong assumption of causal sufficiency (i.e., that every variable which affects two or more variables under study is already in the set of the studied variables). Applying this to the results of the present study, the existence of causal influence of AFE over NE could be claimed by simply accepting the Causal Markov Condition (which is not an assumption as strong as causal sufficiency) and by acknowledging temporal information (W35 before AFE, and the latter before NE) . Nevertheless, structural equation modeling may be used without learning from the causal information carried by it. Under this circumstance, the goal may simply be to represent a joint probability distribution in a more parsimonious fashion. Generally, when a recursive causal structure is applied with this purpose, the residual covariance matrix is constructed as diagonal to achieve parameter identifiability. Nonetheless, this is exactly the statistical consequence of accepting the IC algorithm's causal sufficiency assumption, so that the described methodology may be properly used under this construction. Because the proposed approach searches for minimal causal structures, applying the retrieved structures to fit a recursive SEM would result in parsimonious modeling of joint probability distributions derived from multiple trait models.
BDV, MAS, RBT and RAT acknowledge support from Conselho Nacional de Desenvolvimento Científico e Tecnológico and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Brazil. GJMR would like to acknowledge support from the Vilas Associate Award, Graduate School of University of Wisconsin, and Agriculture and Food Research Initiative Competitive Grant no. 2011-67015-30219 from the USDA National Institute of Food and Agriculture.
- Haavelmo T: The statistical implications of a system of simultaneous equations. Econometrica. 1943, 11: 12-View Article
- Wright S: Correlation and causation. J Agric Res. 1921, 201: 557-585.
- Gianola D, Sorensen D: Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes. Genetics. 2004, 167: 1407-1424. 10.1534/genetics.103.025734.PubMed CentralView ArticlePubMed
- de los Campos G, Gianola D, Boettcher P, Moroni P: A structural equation model for describing relationships between somatic cell score and milk yield in dairy goats. J Anim Sci. 2006, 84: 2934-2941. 10.2527/jas.2006-016.View ArticlePubMed
- de los Campos G, Gianola D, Heringstad B: A structural equation model for describing relationships between somatic cell score and milk yield in first-lactation dairy cows. J Dairy Sci. 2006, 89: 4445-4455. 10.3168/jds.S0022-0302(06)72493-6.View ArticlePubMed
- Heringstad B, Wu XL, Gianola D: Inferring relationships between health and fertility in Norwegian Red cows using recursive models. J Dairy Sci. 2009, 92: 1778-1784. 10.3168/jds.2008-1535.View ArticlePubMed
- de Maturana EL, Campos Gdl, Wu XL, Gianola D, Weigel KA, Rosa GJM: Modeling relationships between calving traits: a comparison between standard and recursive mixed models. Genet Sel Evol. 2010, 42: 1-10.1186/1297-9686-42-1.PubMed CentralView ArticlePubMed
- de Maturana EL, Wu XL, Gianola D, Weigel KA, Rosa GJM: Exploring biological relationships between calving traits in primiparous cattle with a Bayesian recursive model. Genetics. 2009, 181: 277-287.PubMed CentralView ArticlePubMed
- Varona L, Sorensen D, Thompson R: Analysis of litter size and average litter weight in pigs using a recursive model. Genetics. 2007, 177: 1791-1799. 10.1534/genetics.107.077818.PubMed CentralView ArticlePubMed
- Wu XL, Heringstad B, Chang YM, de los Campos G, Gianola D: Inferring relationships between somatic cell score and milk yield using simultaneous and recursive models. J Dairy Sci. 2007, 90: 3508-3521. 10.3168/jds.2006-762.View ArticlePubMed
- Wu XL, Heringstad B, Gianola D: Exploration of lagged relationships between mastitis and milk yield in dairy cows using a Bayesian structural equation Gaussian-threshold model. Genet Sel Evol. 2008, 40: 333-357.PubMed CentralView ArticlePubMed
- Pearl J: Causality: Models, Reasoning and Inference. 2000, Cambridge, UK: Cambridge University Press
- Verma T, Pearl J: Equivalence and synthesis of causal models. Proceedings of the 6th Conference on Uncertainty in Artificial Intelligence, 27-29 July1990 Cambridge. Edited by: Henrion M, Shachter R, Kanal L, Lemmer J. 1990, 220-227.
- Spirtes P, Glymour C, Scheines R: Causation, Prediction and Search. 2000, Cambridge, MA: MIT Press, 2
- Valente BD, Rosa GJM, de los Campos G, Gianola D, Silva MA: Searching for recursive causal structures in multivariate quantitative genetics mixed models. Genetics. 2010, 185: 633-U361. 10.1534/genetics.109.112979.PubMed CentralView ArticlePubMed
- Rosa GJM, Valente BD, Campos Gdl, Wu XL, Gianola D, Silva MA: Inferring causal phenotype networks using structural equation models. Genet Sel Evol. 2011, 43: 6-10.1186/1297-9686-43-6.PubMed CentralView ArticlePubMed
- Henderson CR, Quaas RL: Multiple trait evaluation using relative records. J Anim Sci. 1976, 43: 1188-1197.
- Shipley B: Cause and Correlation in Biology. 2002, Cambridge/London/New York: Cambridge University Press
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2009, R Foundation for Statistical Computing
- Spiegelhalter DJ, Best NG, Carlin BR, van der Linde A: Bayesian measures of model complexity and fit. J R Stat Soc Ser B-Stat Methodol. 2002, 64: 583-616. 10.1111/1467-9868.00353.View Article
- Scheines R: An introduction to causal inference. Causality in crisis. 1997, Citeseer, 185-199.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.