Analysis of the real EADGENE data set: Multivariate approaches and post analysis (Open Access publication)

The aim of this paper was to describe, and when possible compare, the multivariate methods used by the participants in the EADGENE WP1.4 workshop. The first approach was for class discovery and class prediction using evidence from the data at hand. Several teams used hierarchical clustering (HC) or principal component analysis (PCA) to identify groups of differentially expressed genes with a similar expression pattern over time points and infective agent (E. coli or S. aureus). The main result from these analyses was that HC and PCA were able to separate tissue samples taken at 24 h following E. coli infection from the other samples. The second approach identified groups of differentially co-expressed genes, by identifying clusters of genes highly correlated when animals were infected with E. coli but not correlated more than expected by chance when the infective pathogen was S. aureus. The third approach looked at differential expression of predefined gene sets. Gene sets were defined based on information retrieved from biological databases such as Gene Ontology. Based on these annotation sources the teams used either the GlobalTest or the Fisher exact test to identify differentially expressed gene sets. The main result from these analyses was that gene sets involved in immune defence responses were differentially expressed.


INTRODUCTION
In the host response to pathogens rather than individual gene actions the biological importance may be exhibited through the combined actions of a group of genes. Since the microarray technology allows us to monitor simultaneously the expression of thousands of genes, employing multivariate statistical methods to analyse these data may enable us to identify gene groups involved in the host response to pathogens.
There are two conceptually different ways of defining gene groups. First, gene groups can be identified from the experimental data at hand using statistical methods developed for clustering genes that show similar expression patterns [8]. For example, cluster analysis can be used to discover classes of genes responding differently to specific pathogens. Another important application of microarray data is to build classifiers that could predict if and when an animal will respond to a specific pathogen.
Second, gene groups can be defined based on prior biological knowledge on gene functions available from public available databases (e.g. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG)) [1]. Once the gene groups have been identified several statistical methods exist to evaluate the association of the gene group with the biological outcome [12,13].
The aim of this paper was to describe and when possible compare the multivariate statistical methods used by the participants in the EADGENE WP1.4 workshop.

MATERIALS AND METHODS
An EADGENE funded microarray study experiment was performed to gain a better insight into the genes involved in mastitis in dairy cows [15]. The EADGENE partners were provided data from a microarray analysis of mammary tissue sampled at different time points relative to intra-mammary exposure to the pathogen. For a detailed description of the experiment, the expression data as well as methods used by individual teams for data normalisation and identification of differentially expressed genes, we refer to Jaffrézic et al. [14]. Three conceptually different multivariate statistical approaches were used by the EADGENE teams: class discovery and class prediction, differential co-expression of gene sets, and differential expression of gene sets. These statistical approaches will each be presented in more detail. An overview of the different statistical methods and software used can be found in Table I.

Class discovery and class prediction
The first statistical approach was for class discovery and class prediction using evidence from the data at hand. Several teams used hierarchical clustering (HC) or principal component analysis (PCA) to identify groups of differentially expressed genes with similar expression pattern. Hierarchical clustering was based on Euclidian distance similarity measure and average agglomeration method. HC and PCA were based on expression data from genes that were found to be differentially expressed between different time points or pathogens.
The INRA_T team used a slightly different approach in which differentially expressed genes where clustered according to their smoothed expression profile over time. Before the usual clustering step, a smooth expression curve was fitted for each gene [5,17]. The derivative of this curve was computed and will be refered to as a "profile". Each profile was discretized over 13 equidistant time points from 0 to 24 h. For this new data set, the dendrogram produced by HC was used as a guide to choose the final number of clusters (k). Finally, to gain in robustness, a K-means algorithm was performed with k initial centroids from the HC [5].
The INRA_T team also used Classification and Regression Trees (CART) to identify subsets of genes (among all the genes on the array) that best predicted different time points and infective pathogens. CART is a tree-building method that partitions a set of samples into groups [4]. However, slight changes in the expression data can lead to a very different construction of the tree 654 P. Sørensen et al. and therefore two wrapper methods [16] that aggregate the trees were used in combination with CART. The first wrapper method was Random Forest (RF), which creates an ensemble of trees (the forest) using different bootstrap samples [3]. In addition, at each partition of the tree, the best predictor is chosen among a fixed number of randomly selected genes using the "Mean decrease accuracy" as a measure of the predictive ability [16]. The second wrapper method was based on a stochastic algorithm [10]. The idea of this method is to quantify numerically the classification efficiency of each gene with a probability weight (the heavier the weight the better) and learn iteratively this probability using stochastic approximations and a classification task. In this case, the classification method used to measure the classification efficiency of each gene was CART that is well adapted to multi-class problems. This will enable us to select a subset of discriminative genes (with the heaviest probability weight) that hence holds useful information on the microarray experiment.

Differential co-expression of gene sets
The second statistical approach identified groups of differentially coexpressed genes between the E. coli and S. aureus infected quarters. Based only on E. coli, data genes were hierarchically clustered, using 1-r (where r is the Pearson correlation coefficient) as a distance measure and the "average" agglomeration method [19]. Gene groups were identified by cutting the cluster tree at a height of 0.005. Then for each gene group the mean of the pairwise correlations among all genes was estimated in the S. aureus data and if the group of genes is more correlated than is expected by chance, they are said to be differentially co-expressed.

Differential expression of gene sets
The third statistical approach tests for differential expression of a priori defined gene sets using either the GlobalTest [13] or the Fisher exact test [6]. The GlobalTest uses all the genes in the data set and is based on an empirical Bayesian generalised linear model with the regression coefficients between expression data and the sample treatments (e.g. time point or pathogen) as random variables. The method investigates whether samples with similar treatments tend to have similar gene expression patterns. It assumes that the regression coefficients for individual genes belonging to a specific gene set are a sample from some common distribution with expectation of zero and tests (using a score test) if the variance associated with the gene set is different from zero [12]. Since time points and pathogens are categorical variables the GlobalTest was applied using the logistic model. P-values for tests were calculated using the asymptotic distribution. The GlobalTest is implemented in the Bioconductor package [11] GlobalTest [13]. An alternative method to identify differentially expressed gene sets is to use the Fisher exact test to identify over-represented gene sets in a list of differentially expressed genes. In this method it is determined how many of the genes from a specific gene set are in the list of differentially expressed genes and how many are on the whole array. With these counts a Hypergeometric test is performed. This is equivalent to using the Fisher exact test. The teams used this method as implemented in the Bioconductor package GOstats [9], the Expression Analysis Systematic Explorer (EASE) on-line application (http://david.abcc.ncifcrf.gov/) or the Ingenuity software (www.ingenuity.com).

Annotation of bovine genes
Several approaches for obtaining information about bovine gene functions were explored by the EADGENE teams. The original annotation file for the bovine 20K array included 5371 Ensembl Genes, 3847 EntrezGenes and 8127 UniGenes. From these gene identifiers it is possible to retrieve additional information on bovine gene functions using the Functional Annotation Tool on DAVID Bioinformatics website (david.abcc.ncifcrf.gov). Another approach to obtain bovine annotation is to use the Bioconductor package AnnBuilder [20]. Using the AnnBuilder software and provided mappings between microarray cDNA probe ID and Bovine GenBank Locus ID it is possible to build a customised annotation package gathering bovine annotation from several public databases. An alternative approach to annotate bovine genes was based on homology to human orthologs. This was done using data from the Ensembl (www.ensembl.org) and BioMart (www.biomart.org) databases and the Bioconductor package biomaRt [7] as the query interface. For each unique feature on the bovine array, the query ID (Ensembl ID, Entrezgene ID or Unigene ID) was extracted from the original annotation file. For each bovine query ID the human homolog was retrieved and used to obtain annotation from the Homo sapiens dataset in Ensembl. Details about the method for gene orthology prediction can be found here www.ensembl.org/info/data/compara/homology_method.html.

Class discovery and class prediction
The main results from the HC and the PCA of the differentially expressed genes were that E. coli samples clustered into two groups where time 24 h was different from the other time points (Fig. 1). HC analysis clustered genes into two groups, one down regulated at 0-12 h and one up regulated at 24 h. A similar pattern was observed with the K-means clustering of the smoothed expression profiles where five (three) clusters corresponded to increasing (decreasing) expressions, with various magnitudes (Fig. 2). Principal component analysis of all samples showed that S. aureus samples are different from E. coli. Using the classifier algorithms (CART in combination with the aggregation methods) to identify subsets of genes that best predicted different time points enabled the HC analysis to better separate samples taken at different time points compared to an HC analysis performed with a selection of genes with an F-test. The RF algorithm selected 38 genes and the stochastic algorithm selected 70 genes as best predictors of the time points. Although these two aggregation methods use the same classification method (CART) only 18 genes are overlapping. The predictive genes were generally differentially expressed, but not necessarily ranked high based on the F statistic. As opposed to the results obtained from the HC analysis of the differentially expressed genes (Ftest selection), where the selected genes mostly discriminated time 24 against the others, here separation of time points 0, 6 and 24 were better with this unsupervised classification (Fig. 3).  Heatmap showing hierarchical clustering of 38 best predictor genes (horizontally) and microarrays (vertically) corresponding to timepoints 0 (n = 4), 6 (n = 4), 12 (n = 3), and 24 (n = 4) h relative to E. coli infection. The 38 genes were selected using the aggregation method random forest. The light grey (black) colour represents overexpressed (underexpressed) genes.

Differential co-expression of gene sets
The method used to identify groups of differentially co-expressed genes, identified several clusters of genes highly correlated when animals were infected with E. coli but not correlated more than expected by chance when the infective pathogen was S. aureus (Fig. 4).

Annotation of bovine genes
In total, 2254 out of 8126 Unigene ID in the original bovine annotation file could be recognised, but only 1142 UniGene ID were annotated in the

661
DAVID database. A similar number of bovine genes were annotated using the AnnBuilder package where 1331 out of 2173 EntrezGene ID genes were associated with GO terms. Annotating bovine genes based on homology to human orthologs resulted in 3831 out of 5840 query probe ID that were associated with human GO terms.

Differential expression of gene sets
Based on these annotation sources the teams used either the Globaltest or the Fisher exact test to identify differentially expressed gene sets. Both methods identified several gene sets defined by the Biological Process (GO-BP) terms that were differentially expressed in response to infection. These gene sets include biological processes such as 'defense response', 'immune response', 'inflammatory response', 'regulation of apoptosis', 'cell-cell adhesion', 'response to biotic stimulus', 'response to wounding' and 'response to pest, pathogen or parasite'. The functional analysis using the DAVID software also showed that the genes up-regulated at E. coli 24 h infection were enriched for GO terms related to immune activities. These results were also supported by the functional analysis using the Ingenuity software that was based on 63 differentially expressed genes and identified four highly significant biological networks (Fig. 5) which are all related to infection including the Toll-like receptor signalling pathway genes (CD14, NFKBIA and TIRAP) which are essential in the innate immunity response to gram-negative infection.
The results also showed that annotation based solely on bovine genes and annotation based on bovine homology to human orthologs was quite different (Fig. 6). In total, 49 significantly enriched GO terms were identified with the two methods. Of these, only three specific GO terms were found across both methods, which is mainly due to the limited number of identical genes mapping to the exact same GO term (Fig. 6). Hence, if the same genes do not map to the same GO terms in the two methods, it is of course not possible for them to come up as enriched in the two methods. The relative low number of identical genes mapping to the exact same GO terms in the two methods can be explained by the level of specificity by which genes map to biological processes (Fig. 7). An example is the term 'inflammatory response' in the pure bovine annotation approach, which is represented by eight more specific terms in the bovine-human annotation approach.

DISCUSSION
Three conceptually different multivariate statistical approaches were used by the EADGENE teams in the analysis of the microarray data. Because Figure 5. Significant canonical pathway. The most significant canonical pathways, as determined by Ingenuity software, across the entire dataset of 596 genes with a < 0.05 FDR are displayed along the x-axis. The y-axis displays the significance level which is a P-value, calculated using the right-tailed Fisher Exact Test. In this method, the P-value is calculated by comparing the number of user-specified genes of interest (i.e. Functional Analysis Genes) that participate in a given function or pathway, relative to the total number of occurrences of these genes in all functional/pathway annotations stored in the Ingenuity Pathways Knowledge Base. different normalisation procedures and methods for identifying differentially expressed genes were used by the EADGENE teams, a detailed comparison of the different statistical approaches is not possible.
The aim of the first statistical approach was class discovery and class prediction. Hierarchical clustering and principal component analysis of differentially expressed genes demonstrated a different expression profile in tissue samples taken at 24 h following E. coli infection as compared to the other time points. This is mainly due to the fact that a majority of the differentially expressed genes are identified at 24 h. It was also shown that by using classifier algorithms such as CART to identify subsets of genes that best predicted different time points and pathogens enables the HC analysis to better classify the samples taken at different time points. In some cases, highly differentially expressed genes (high rank in the F-test) will be strongly correlated and may not necessarily be useful for prediction. Other genes not correlated with the  top ranking genes but still differentially expressed will be better in terms of prediction and may contain more informative genes that explain the biological experiment. Although HC and PCA are able to display the predominant structures in the data they may fail to capture alternative structures and local behaviour.
In the second statistical approach, the goal is to identify differential coexpression of gene sets. In this approach, HC was used to identify several clusters of co-expressed genes highly correlated when animals were infected with E. coli but not correlated more than expected by chance when the infective pathogen was S. aureus. This approach looks for changes in the relationship among genes themselves and may provide insight into changes in co-regulation of genes. It does, however, rely on HC which has the disadvantages that it imposes a tree structure to the data, is highly sensitive to the distance metric used, and typically requires subjective decisions on the number of clusters. Therefore it may be useful to identify differentially co-expressed genes using gene sets based on known gene functions and pathways (e.g. GO, KEGG).
The third statistical approach tests for differential expression of a priori defined gene sets using either the GlobalTest [13] or the Fisher exact test [6]. Gene sets were defined mostly based on GO and used together with these two methods which primarily relies on associations between the genes and the phenotype of interest. Although the purpose of these methods is the same they are quite different in terms of methodology and in the genes included in the analysis. Despite the different teams used different methods for determination of differentially expressed genes and different annotation sources both methods showed that gene sets involved immune defence responses. Methods for identification of differential expression of individual genes are in general optimised for detecting genes with large changes in gene expression whereas methods for detecting differentially expressed gene sets are more powerful at detecting smaller changes in gene expression patterns of a whole group of genes. Therefore gene set analyses are complementary to analyses at the level of individual genes and represent powerful tools for the dissection of complex changes in gene expression [13].
Methods for detecting differentially expressed gene sets rely on the availability of annotated bovine genes. Because there are a limited number of annotated bovine genes available, an alternative annotation approach is to use the human ortholog genes that take advantage of the well-annotated human genome. Although the two annotation approaches affected the differentially expressed gene set results with respect to identified GO terms and the level of GO term specificity, there was not any major advantage in using one or the other approach. In fact they might be complementary, suggesting that more alternative methods should generally be applied in such post-analysis studies. Furthermore, when using gene expression arrays for genomes of livestock animals that are not fully annotated in terms of gene functions, as is the case for the bovine genome, it is possible to identify additional gene annotations like more GO terms when using several alternative annotation approaches as we have done in the present study. Using several annotation methods may not be worthwhile with more completely annotated genomes like the human and mouse genomes.
It is, however, still a challenge to interpret the biological relevance of the differentially expressed genes and gene sets. Briefly, differentially expressed genes included CD14 whose expression level was up regulated in our study which was consistent with the increase of CD14 protein levels observed in infected cow milk [2]. CD14 is part of the Toll-like Receptor signalling pathway and two other genes in this pathway, NFKBIA and TIRAP, were also up regulated and are known to be involved in the innate immunity response to gram-negative infection [18].

CONCLUSION
Different multivariate statistical methods enabled the EADGENE teams to discover groups of genes that displayed similar changes in gene expression patterns. Statistical methods that use a priori defined gene groups seem to be useful in the search for biological relevant changes in gene expression although the interpretation of the complex changes in gene expression remains a challenge. Although these methods rely on annotated bovine genes it is possible to "borrow" information from other well-annotated species (e.g. human or mouse) until more is known about the function of bovine genes.