Consensus genetic structuring and typological value of markers using multiple co-inertia analysis

Working with weakly congruent markers means that consensus genetic structuring of populations requires methods explicitly devoted to this purpose. The method, which is presented here, belongs to the multivariate analyses. This method consists of different steps. First, single-marker analyses were performed using a version of principal component analysis, which is designed for allelic frequencies (%PCA). Drawing confidence ellipses around the population positions enhances %PCA plots. Second, a multiple co-inertia analysis (MCOA) was performed, which reveals the common features of single-marker analyses, builds a reference structure and makes it possible to compare single-marker structures with this reference through graphical tools. Finally, a typological value is provided for each marker. The typological value measures the efficiency of a marker to structure populations in the same way as other markers. In this study, we evaluate the interest and the efficiency of this method applied to a European and African bovine microsatellite data set. The typological value differs among markers, indicating that some markers are more efficient in displaying a consensus typology than others. Moreover, efficient markers in one collection of populations do not remain efficient in others. The number of markers used in a study is not a sufficient criterion to judge its reliability. "Quantity is not quality".


INTRODUCTION
Today, a large number of studies are aimed at investigating the genetic structuring of populations within species. The goal of such studies is first to provide Generally, in these studies on genetic structuring, two methods were performed: phylogenetic reconstruction [46,57,67] and/or multivariate procedures [8,15,63,65,69]. In phylogenetic reconstruction, a consensus tree is typically built to summarize information and measure the reliability of the tree. Several methods have been proposed for inferring consensus trees, among them the maximum agreement subtree, the strict consensus, the majority tree, the Adams consensus and the asymmetric median tree [12,52].
However, construction of trees using admixed populations, as is the case in livestock species, violates the principles of phylogeny reconstruction [25,64]. In this situation, multivariate procedures are recommended. The most common method to analyze allelic frequency data is the principal component analysis (PCA) [6,33,34,36,37,48]. Using such methods may result in a non consensus representation, due to the incongruence among markers [50]. Weak congruence could also explain some of the low bootstrap values which are typically reported in several studies in the following species: beef cattle [13,43,45,47,51,67], goats [35,42], sheep [63,70], and natural populations, such as white-tailed deer [20].

547
The markers involved in such studies are chosen to be neutral. One of the main principles of population genomics states that neutral markers across the genome will be similarly affected by demography and the evolutionary history of populations [44]. Accordingly, these markers should be congruent, i.e. should reveal the same typology among populations.
Nevertheless, neutral markers may be influenced by selection on nearby (linked) loci, and, then, reveal different patterns of variation.
Thus, a method explicitly devoted to exhibit a consensus in a multivariate framework is necessary. In this context, the markers of interest should be both highly variable and congruent in order to perform a consensus typology. The multiple co-inertia analysis (MCOA) is dedicated to this purpose. MCOA was first described by Chessel and Hanafi [17], and is used in ecology [4,30].
In this paper, we address the capacity and efficiency of marker panels to exhibit a genetic structuring and measure the contribution of each specific marker by MCOA. In the genetic framework, this ordination method identifies the structures of populations common to many tables of allelic frequencies. First, single marker analyses were performed. Allelic frequencies are a special case of compositional data [1,3]: they consist of vectors of positive values summing to one. De Crespin de Billy et al. [19] introduced a specifically designed principal component analysis (%PCA) for this kind of data. This method can be used together with a biplot representation [27], which permits an interpretation of the location of a population in terms of its allelic frequencies. Adding confidence ellipses [29] around the population points on the resulting plot improves the visual assessment of the separating power of the markers. It also allows accounting for the uncertainty due to the size of the sampled population. Second, MCOA simultaneously finds ordinations from the tables that are most congruent. It does this by finding successive axes from each table of allelic frequencies, which maximize a covariance function. This method permits the extraction of common information from separate analyses, in the settingup of a reference typology, and the comparison of each separate typology to this reference typology. Finally, to quantify the efficiency of a marker, we introduce the typological value (TV), which is the contribution of the marker to the construction of the reference typology.
Hence, we reply to the following practical questions. Which markers contribute most to the typology of populations? Do efficient markers in one collection of populations remain efficient in others? Does the number of markers ensure the reliability of the typology? 548 D. Laloë et al. In this article, we provide a short background to MCOA, we describe the typological value and we study the interest and efficiency of this method using a bovine data set.

Single marker analyses
Each marker yields allelic frequencies that define Euclidian distances between the populations in a multidimensional space. The principal component analysis [33,34] can be used to find a plane on which the populations are scattered as much as possible, i.e. conserving the distances among populations as best as possible. However, this method does not take into account the true nature of the data. Since allelic frequencies are positive and sum to one, they are compositional data [1]. Aitchison addressed some issues specific to the multivariate analysis of such data [1][2][3] and showed that centered PCA performs better when compositional data are transformed using log ratios or other logarithmic data transformations [55]. An appealing alternative to these approaches is to use a principal component analysis of proportion data (%PCA) [19]. Indeed, the typologies provided by this analysis are directly interpretable in term of allelic frequencies, which is at least discussed in former methods [68].
The %PCA yields the same axes as a classical centered PCA, and the distances between the scores of the populations are exactly the same as in PCA. Thus the typology of the populations is not altered. %PCA differs from PCA in that the cloud of points corresponding to the populations is not constrained to be at the origin. Instead, the populations are placed by averaging with respect to their allelic frequencies. The score s i of a population i onto an axis u is computed as the mean of the allele coordinates (denoted u j , 1 ≤ j ≤ p) weighted by the corresponding allelic frequencies ( f i j ): This method makes it possible to draw meaningful biplots [19], where both populations and alleles are represented, respectively by points and arrows. In such biplots, the closer the populations are to an allele, the higher the corresponding frequencies are.
To improve the typologies of populations obtained by %PCA, we propose confidence ellipses as a visual tool to assess the genetic differences between populations. Indeed, it should be valuable to take the precision of the population frequency estimates into account. Since these frequencies are just estimates of the real ones, they may change from one sample to another. The consequence for the typology is that the coordinates of any population fluctuate around the true, unknown position. Hence, we can determine a confidence ellipse [29], inside which the true population can be expected to be located, with a given probability. This probability P is linked to a size factor S by: Using a PCA appropriate for allelic frequencies and confidence ellipses around population positions should help to interpret the different typologies provided by the markers. At this point, the multiple co-inertia makes it possible to carry out a comparison between these typologies.

Multiple co-inertia analysis
Multiple co-inertia analysis is an ordination method, which simultaneously analyzes K tables describing the same objects (in rows) with different sets of variables (in columns). The mathematical principles of the method are fully described by their authors [17], but we provide essential steps in the appendix; examples of its utilization can be found in ecology studies [4,30].
Within the MCOA framework, K sets of variables produce K typologies of the same objects on the basis of any single-table analysis, such as PCA or correspondence analysis. MCOA relies on the idea that there may be congruent structures among these typologies. The MCOA coordinates the K separate PCA, in order to facilitate their comparison and emphasize their similarities. A reference ordination is then constructed, which best summarizes the congruent information among the sets of variables. It can thus be considered as a "reference structure" (also called "reference").
We apply the MCOA to analyze a set of n populations typed on K markers. The method provides a set of K coordinated %PCA, each corresponding to a given molecular marker. These analyses can be interpreted like previous %PCA since populations are placed by averaging with respect to the alleles. However, these analyses display both scattered and congruent typologies, which can thus be compared. So, the criterion of the scores of maximum variance (used in %PCA) is no longer sufficient, and the correlation of the scores with the reference must be taken into account. To consider these two aspects, the MCOA maximizes the sum of the co-inertias (i.e. squared covariances) between the scores of populations of the coordinated analyses, and the reference. Let l r k be the r th scores of populations in the coordinated %PCA of a marker k (with 1 ≤ k ≤ K),and v r be the r th reference scores. The criterion optimized in 550 D. Laloë et al.

MCOA is then:
where w k is a given weight for the marker k. These weights can be chosen according to the nature and disparity of the markers. We choose here uniform weights (w k = 1 K ) for every marker, but it is possible, for instance, to choose w k so that markers of different types are on the same level of variation.
The optimized criterion (1) guarantees that the typologies are scattered (maximization of the variance of the scores) and emphasizes their common structure (maximization of the squared correlation). This matches our definition of what a "good marker" is, from a typological point of view: a marker which can separate the populations well, and which separates them like many other markers. Mathematically, this exactly corresponds to the contribution of a marker to the MCOA criterion:

Typological value
If the maximum of (1) is noted λ r , we can define the typological value (TV) of the marker k as its relative contribution to the previous criterion: Contrary to (2), this expression is a proportion and can be expressed as a percentage. It corresponds to the ability of the marker k to display the r th reference structure. The higher it is, the better it displays the r th structure of the reference. As a consequence, it can be used to compare the typological values of a set of markers on a given structure. Whenever a structure is expressed by more than one axis of the reference, (3) can be extended by summing separately the numerator and denominator. For example, if an interesting structure of populations is expressed by scores i and j, (3) is generalized as: A last question to be tackled concerns the number of existing common structures. This is the number of scores to be kept for the reference and for each coordinated analysis. This number is chosen according to the decrease of λ r , as is the case in PCA with eigenvalues. However, this choice is made easier than in PCA, since MCOA eigenvalues have the status of squared PCA eigenvalues, the differences between high ones (interesting structures) and low ones would be clearer in MCOA. These methods are available in the ade4 package [18] of the R software [54].

Application to data
Blood samples of 755 unrelated animals from 16 cattle breeds were ana- . The Borgu breed is a crossbred between West African shorthorn cattle and zebu. West African populations were collected in three neighboring countries: Benin, Togo and Burkina Faso. This West African data set has been taken from [49].
All breeds were genotyped for 30 microsatellite loci recommended for genetic diversity studies by the EC-funded European cattle diversity project (Resgen CT 98-118) and the FAO. Details on primers, original references and experimental protocols (conditions of PCR, multiplexing) can be found at http://dad.fao.org/en/refer/library/guidelin/marker.pdf.
To standardize genotypes between our laboratory and Labogena and in order to limit genotyping errors during laboratory experiments, we used three reference animals as controls in each gel run. To limit scoring errors, the results were recorded by two independent scorers [53].

RESULTS AND DISCUSSION
We first ran a %PCA on each microsatellite table of allelic frequencies (single-marker analysis). Corresponding plots are drawn on the same scale for six markers on  shown. Alleles are represented by arrows, the most discriminating ones being joined by lines. A confidence ellipse (P = 0.95) accounting for the number of sampled animals is drawn around each population point. The barplot of eigenvalues is drawn at the bottom left. It indicates the relative magnitude of each axis with respect to the total variance. The higher the eigenvalue is, the higher the Euclidean distances are among populations. For example, for HEL13, the first axis accounts for 75% of the total variance and the second axis accounts for 21%. When all the markers are considered, it is easy to see that the efficiency of each marker differs. Some did not exhibit any clustering (INRA35), others exhibited some clusters but not always the same. For example HEL1 and HEL13 separated three clusters: French taurine, African taurine and African Zebu. Some microsatellites i.e. MM12 separated the African taurine breeds from the zebu breed. Within the French cluster, INRA63 separated three breeds and HEL5 isolated the Maine-Anjou breed from the others. Figure 1 is a graphical tool, which compares the usefulness of markers for separating populations. However, the axes of each %PCA differ from one marker to another, and cannot be interpreted in the same way. Axis 1 of the HEL1 plot is not the same as Axis 1 of the MM12 plot. Single-marker structures cannot be easily compared by looking at factorial maps of separate uncoordinated analyses. The multiple co-inertia analysis deals with this problem, through coordinated analyses, where axes of each plot tend to display the same structures.
Coordinated %PCA plots are drawn on the same scale for the six markers on Figure 2. Ellipses and proximities between alleles and populations can be interpreted in the same way as in Figure 1. However, the barplot at the bottom left of the plot no longer represents eigenvalues, but the variance of the scores according to the different axes. For instance, populations are more scattered along the first axis for HEL13 than for HEL1, or INRA63.
A comparison of Figure 1 with Figure 2 shows that some markers fit the common structures quite well. For instance, the first two axes of the plots of HEL1, HEL13 and INRA63 are almost identical. Some others remain non efficient e.g. INRA35. However, for MM12 and HEL5, the situation is more interesting. For MM12, axis 1 in Figure 1 is more or less axis 2 in Figure 2 of the common structure exhibited by MCOA. Concerning HEL5, in Figure 1 the most obvious feature is the separation of the Maine-Anjou breed from the others. However this marker exhibits the common structure as indicated in Figure 2.
Therefore, the non-coordinated analyses answer the question: does the marker separate the populations while the coordinated analysis answers the question: how does the marker separate the populations regarding the common structure.
The decrease of eigenvalues shows three main structures in the reference typology. The first three axes of the reference typology are shown in Figures 3A (axes 1 and 2) and 3B (axes 1 and 3). The first axis clearly distinguishes French breeds from African breeds. The second axis separates African breeds into three groups: Taurine breeds, Borgu and Zebu. The intermediate position of the Borgu is explained because this breed is an African shorthorn × Zebu crossbred. The third axis separates French breeds into three clusters. The first cluster is mainly composed of southwestern French breeds and the Montbeliarde breed, the second is composed of Charolaise and Bretonne Pie Noire breeds and the third distinguishes the Maine-Anjou breed. Note that these clusters mainly fit with history and geography except for the Charolaise and Bretonne Pie Noire cluster.
The relationship between a single marker analysis (Fig. 2) and the MCOA (Fig. 3a) is illustrated by a cohesion plot, which is the superimposition of the Consensus structuring and typological value 557 two corresponding plots (Fig. 4). In this figure, the location of each data point can be indicated using an arrow. The tip of the arrow is used to show a location in the single marker analysis and the start of the arrow is the location of the breed in MCOA analysis. If both typologies strongly agree, the arrows would be short. Equally, a long arrow demonstrates a locally weak relationship among structures.
Of the six microsatellites, INRA35 exhibits the longest arrows and is thus the less congruent marker. With the MM12 marker, the direction of the arrows is mainly horizontal, showing discrepancies along the first axis (separation between France and Africa), while there is a good adequacy for the second axis (separation between African taurine breeds and zebu breeds). However, HEL1 reproduces the reference almost perfectly. HEL13 is also a structuring marker for all the breeds except for the Bazadaise breed, which is clustered with African taurine breeds. Figures 5A (1 st axis), 5B (2 nd axis) and 5C (3 rd axis). The heterogeneity of typological values increases with the number of the axis. In order to obtain a total percentage equal or greater than 50%, nine markers are needed for axis 1, eight markers for axis 2, and only six for axis 3. Minimum value is close to 0 for the three axes (0.11% (INRA35), 0.07% (SPS115) and 0.02% (ILSTS005) for axes 1 to 3, respectively). The maximum percentage (8.3%) for axis 1 is reached by HEL13. This marker is also the most important for axis 2, with a typological value percentage equal to 9.0%. For axis 3, the typological values reach a maximum percentage of 11.5%, for HEL5. Some markers do not contribute to the population structuring, whatever the axes: INRA35, INRA5 and SPS115. However, the typological values vary according to the structures. For example, HEL13, which is the most important marker for axes 1 and 2, is among the worst markers for axis 3 (typological value percentage of 0.21%). Conversely, HEL5 is the most important marker for axis 3, but not for axes 1 and 2. MM12 contributes mostly to axis 2, but not to the other axes.

Diagrams of typological values are plotted in
Thus, efficient markers for distinguishing African from French breeds are not necessarily the same as for distinguishing within Africa or within France.  Cohesion plots showing the differences between the reference typology (labels and arrows origin) and the coordinated single-marker analyses (normed scores) on the first two axes. The arrows represent the typological "mistakes" displayed by the markers. The longer an arrow is, the greater the mistake is. A common scale is used (d = 1) for all plots.

CONCLUSION
In this paper, we describe the MCOA in the context of a population genetic structuring analysis. This methodology is easy to use and could be of general applicability for livestock species. The efficiency of a set of markers is addressed with graphical tools and quantitative measures. This method is implemented in the ade4 package [18] of the R software [54].
This method is independent of the mutation model of the markers used, and thus can be applied to various types of markers (e.g., proteins, blood groups, microsatellites, amplified fragment length polymorphism, single nucleotide polymorphisms).
The choice of a weighting scheme should be thought according to the nature of the markers involved in the study. A uniform weighting may be sensible if only one type of markers is used, as in this paper. However, weighting each marker by its total inertia will give the same scale of differentiation for each marker. These two weighting options are available in the ade4 package. Moreover, thanks to the flexibility of the method, the user may supply any weighting scheme of his/her own choice, which could be based, for instance, on the number of alleles of the marker.
Separate coordinated plots show how the markers separate the populations regarding a common structure, while superimposed plots visually address the discrepancies among the common structure and one single-marker structure.
The quantitative measure of typological value includes two aspects: the ability to perform a typology of populations and the degree of congruence with the reference. Population structure is more easily exhibited using markers with high typological values, than using those with low values. We show that efficient markers in one collection of populations do not remain efficient in others. Typological values of markers are structure-dependent. When strongly different populations such as French and African populations are considered, all markers roughly equally reproduce the main features of the typology. However, this is not the case for closely related populations because only a few markers reproduce the reference typology. Thus, caution is needed in evaluating populations based on molecular studies if a small number of efficient loci are used. These results contradict the idea [61,62] that increasing the number of markers will increase the reliability of the typology analysis: quantity is not quality.
As such, a marker selection method based on the typological value should select an efficient, not to say the most efficient, subset of markers for exhibiting a consensus population structuring. In this respect, a general algorithm, and particularly stopping rules for determining an optimum number of selected markers should be investigated, as in [38,40] or [66] in a classical PCA context. Towards a quality process, it is important to check data (sampling strategy, DNA, experimental protocol, tracking of genotyping errors [53], standardization of data), tools (choice of markers [58]), methods (suitability of the method to the data and scientific goal [61,71]) and the computer programs (well established and recommended by experts [21,32]). This process has been initiated in livestock species by FAO guidelines [24], including recommended ISAG/FAO sets of genetic markers for domestic species. In this respect, MCOA should play a major role in the choice of panels of markers, which is essential for an efficient design of population genetic analyses of species. A large number of genetic diversity studies for livestock species has been carried out, some concern livestock from a single country [23,41,67], others have examined diversity and distribution of livestock at the regional level [13,22,26] or even at the scale of nearly an entire continent or all over the world [16,28,31,63]. Since such studies are still continuing and have financial constraints, it is important to have a measure that permits the elimination of non efficient markers from studies. If no previous data are available, another application of the MCOA is to study a subset of the populations, and remove the less informative markers when completing the analysis. Luikart et al. [44] advocate the importance of identifying "outlier loci" to avoid biased estimates of population parameters. With that respect, MCOA and typological values should also be efficient tools to differentiate neutral markers from markers likely to be selected from the selection of a subset of markers, or for the comparison of the degree of differentiation in neutral marker loci and genes coding quantitative traits [58,64].