Effects of quantitative and qualitative principal component score strategies on the structure of coffee, rubber tree, rice and sorghum core collections

La strategie PCSS (Principal Component Score Strategy) est une methode de selection, basee sur des analyses multivariees, proposee pour constituer des core collection a partir de collections importantes de ressources genetiques. La methode decrite sur des donnees quantitatives est adaptee ici a des donnees qualitatives de type moleculaire. Ces deux methodes ont ete testees pour leurs impacts sur la structure de quatre plantes tropicales: cafeier; hevea, riz et sorgho. Les resultats montrent, dans tous les cas, que l'augmentation des contributions relatives cumulees (CRC) sont tres rapides mais different d'une espece a l'autre. Dix pour cent de la collection totale permet d'obtenir de 22 a 58 % de CRC. Comme prevu, la variabilite des caracteres quantitatifs dans les echantillons, est peu ou pas modifiee lorsque la selection est qualitative mais elle l'est fortement par une selection quantitative. La selection qualitative apparait comme la plus efficace pour conserver les alleves rares et augmenter la diversite globale avec des effets limites au niveau quantitatif. L'utilisation d'especes tres differentes a permis de comparer les impacts respectifs des deux methodes et de mettre en lumiere les avantages d'une selection combinee sur les deux types d'approches.


INTRODUCTION
One of the major issues that gene-bank managers (curators) face is the need to increase the accessibility of their collection to a large group of users. Indeed, the sheer size of many collections, the low degree of characterisation of the accessions and the poor efficiency of data management systems, often lead to collections that are difficult to use effectively. Franke1 and Brown (1984) first proposed that one way of alleviating this problem lay through the development of core-collections (CCs), defined as combining 'the genetic diversity of a crop species and its relatives with minimum repetitiveness'.
The first procedures for establishing CCS were based on neutral characters. Brown (1989b) suggested that a random (R) sampling strategy of 10 % of accessions from the base collection (BC) should yield, for rare but widespread alleles, over 70 % of the variation in the BC. Brown (1989a) introduced hierarchical sampling which assumes that the BC can be structured into groups. These groups, or clusters, could be based on different types of data such as genetics, ecogeography, country of origin. According to the hierarchical sampling strategies proposed by Brown, each subgroup size is related to the initial group size according to different strategies: 1) a constant number of accessions per group (C strategy); 2) a number of accessions in proportion to the group size (P strategy); or 3) a number of accessions in proportion to the logarithm of the group size (L strategy). Because these strategies were based on the number of accessions in a given group and did not utilise population genetics information, Schoen and Brown (1993) suggested two additional strategies, H and M. The H strategy (heterozygosity) maximises Nei's genetic diversity index, whereas the M strategy (maximisation) is based on maximising the allele diversity in the core collection. In an empirical examination, these authors ranked the strategies from the highest to the lowest expected allele retention, as follows: M > H > P > L > C > R. Bataillon et al. (1996), using computer simulation, confirmed that the M strategy works well in maximising . . . .. .. Quantitative and qualitative principal component score strategies S239 the non-neutral diversity of an autogamous species, or a species subdivided into genetically isolated populations. Reports of the practical applications of the H and M strategies have not yet been published. One major reason is the lack of base collections that are fully characterised at the molecular level. Multivariate methods, based on quantitative data, have been developed for the purpose of forming CCs. The approach was first introduced by Spagnoletti-Zeuli and Qualset (1993) using a three-step procedure: 1) groups were defined using cluster analysis; 2) within each group factor scores of each accession were computed using discriminant analysis; and 3) accessions were randomly sampled in zones delimited by factorial scores. As a result, the variance of quantitative characters in the CC was maximised. Basigalup et al. (1995) compared height putative CC strategies, and found similar resuIits. Zhong and Qualset (1995) suggested the use of a generalised coefficient of phenotypic variation (GCPV), calculated on the basis of the coefficient of variation within and between populations. Mahajan et al. (1996) tested the Shannon diversity index (SDI) which is adapted to morphological qualitative characters. The results suggest that given rather complete data, principal component and cluster analysis are useful tools for grouping and selecting accessions used to build CCs. In fact, random sampling within groups leads to increases in the global variance. A theoretically more efficient method, the principal component score strategy (PCSS), was described by Noirot et al. (1996). In the PCSS, groups are defined on the basis of possible gene flow and lack of reproductive barriers. In a given group, the diversity of the CC is maximised after elimination of colinearity between variables, then accessions are selected according to their cumulated relative contribution (CRC). The CC size can be defined by determining the number of accessions or by fixing a CRC value.
Whatever the strategy used, the creation of CCs is intended to improve the efficiency of both conservation and utilisation of genetic resources. Recent experience suggests that CCs do, in fact, aid the end-user in discovering useful traits with fewer accessions to screen. As an example, Bouton (1996) found nearly the same frequency of acid soil tolerance in bsoth BCs and CCs of alfalfa (Medicago sativa). The white clover (Tr?;foLium repens) CC was also representative of BC for total cyanogenesis (Pederson et al., 1996).
In addition, a two-stage screening approach for resistance to late leafspot in the peanut (Arachis hypogaeu) CC clearly demonstrated that this CC can be used to improve the efficiency of peanut germ-plasm evaluation (Holbrook and Anderson, 1995). Despite this, some authors have advocated the development of a few situation-specific CCs, or subsets of CCs rather than a single core from a base collection (&lackay, 1995;Rana and Kochhar, 1996).
In this report, we present 1) an adaptation of the PCSS strategy to qualitative data; 2) the CRC for quantitative and qualitative data; 3) the differences observed when used with four crops that have contrasting biology; 4) the impact of the selections on the final allelic composition; and 5) the modification induced (or not) for means and variances of the quantitative descriptors.

Genetic background of the four crops studied
The evolutionary relationships among cultivars of the four crops tfested are summarised here. Two sub-specific groups have been recognised for centuries in China. They were found to reflect the species structure in most other regions of the world (Oka, 1958) and have been named indica and japonica. Isozyme diversity confirmed the existence of the two major types (Second, 1982), which might be related to a 2 to 3 million year differentiation between two populations of wild rice, followed by two independent domestications (Second, 1985). A more precise analysis of isozymic diversity among Asian cultivars (Glaszmann, 1987) showed that several other specific types coexist with the two major groups. Their evolutionary origin is still unclear. This structure of the species was largely confirmed by subsequent analyses with molecular markers, including 1994; Maclull, 1995) and restriction fragment length polymorphisms (RFLPs) (Wang and Taiiksley, 1989;Second and Ghesquière, 1994).
Cultivated sorghum forms are all included in the African species Sorghum bicolor and constitute the S. Bicolor ssp. bicolor subspecies. They are moiioecious, preferentially self-pollinating and exhibit great phenotypic diversity. A simpler classificat'ion than that of Snowden (1936) was proposed by Harlan and de Wet (1972) using two morphological criteria: spikelet structure and panicle shape. Five basic races, bicolor, caudatum, durra, Ba& and guinea and ten intermediate races (representing intermediates between two races) have been defined. A quantitative study, involving morphological and physiological traits, led to a classification with three groups characterised by different cropping performances (Chantereau et al., 1989). Isozymic markers do not discriminate the races but highlight a geographical structuring (Morden et al., 1989;Ollitrault et al., 1989). The variation of sorghum cultivars grown throughout the world is included in that of African forms. Nuclear deoxyribonucleic acid (DNA) diversity revealed a racial differentiátion and a subrace division within the guinea race (Deu et al., 1994(Deu et al., , 1995de Oliveira et al., 1996).
Cultivated rubber tree (Hevea brusiliensis) is an allogamous diploid species (2n = 36), originating from the Amazon basin. All the elite cultivars (grafted clones) were selected from the few seeds introduced in Southeast Asia at the end of the nineteenth century (Wycherley, 1979). Significant germ-plasm collections were constituted recently with H. brasilienszs accessions surveyed in three Brazilian states (Acre, Rondonia and Mato Grosso). The IDEFOR International Conservation Centre in the Ivory Coast encompasses 2 423 trees surveyed in 1981 in 16 districts of these three states IRRDB collection (Chapuset et al., 1995). Important agronomic evaluations were carried out on this collection (Chapuset et al., 1995) as well as genetic diversity studies using morphological traits (Nicolas et al., 1988), isozymes Seguin et al., 1995), nuclear W L P (Besse et al., 1994) and mitochondrial RFLP markers (Lu0 et al., 1995). Molecular markers revealed four differentiated genetic groups in accordance with the geographic origin of the accessions, despite the predominance of the genetic diversity at the intragroup level (Seguin et al., 1996). A slight agronomic difference is also observed between the four molecular genetic groups (Chapuset et al., 1995).
Coffee 1988). All species are woody, ranging from small shrubs to robust trees, and originate from the intertropical forests of Africa and Madagascar. Phylogenic relationships among species of coffee are well studied (Lashermes et al., 1996(Lashermes et al., , 1997. Commercial coffee production relies on two species only, C. arabica L. and C. cunephora Pierre. Since 1975, more than 1 O00 C. canephora genotypes have been collected by ORSTOM and CIRAD, in collaboration with IBPGR and FAO, in five African countries: Guinea and the Ivory Coast in West Africa, Congo, Central African Republic and Cameroon in Central Africa (Berthaud and Charrier, 1988). A base field collection was established in the Ivory Coast (IDEFOR-DCC, Divo) to conserve the germ-plasm collected. An isozymic evaluation of the diversity, connected with intercrossing behaviour studies and morphological descriptions, revealed evidence of two genetic groups, the guinean and congolian (Berthaud and Charrier, 1988). The four crops used represent four contrasting cases that are relevant to testing of our methodologies: rice and sorghum are annual and autogamous, whereas coffee (C. cunephora) and rubber tree are perennial allogamous species; rice and C. canephfora display a strong structure in specific groups, whereas sorghum and rubber tree display only a weak structure.

MATERIALS AND METHODS
Within-population diversity is determined by the level of between-individual differences in one or more traits. Generally, quantitative traits are of heterogeneous type. In order to give the same contribution (weight) to each trait j , the Euclidian distance is weighted by the reciprocal of the standard deviation o,.
The distance dik between two individuals i and k for the J quantitative traits is defined by the following formula: where x i j and X k j are the observed values of the trait j on the individuals i and k , respectively. The between-individual distance is directly related to the number of differences. If traits are highly correlated (positively or negatively) , this may lead to overestimation of the distance between individuals. To avoid the effect of colinearity a.mong traits, principal component analysis was applied to standardised data, to yield J statistically independent and centred variables, or factors. The distance between two individuals i and k for the J factors is computed using a similar formula: where the square root of the l j eigenvalue allows weighting, and where zij and z k j are the scores of the individuals i and 5, respectively, on the factor j . Such a S. Hamon et al. procedure takes into account all factors with the same weight, including residual componentsthe result of chance or notation errorsin distance estimation.
Removal of factors for which the eigenvalue is below one is arbitrarily applied to eliminate this disadvantage.
The generalised sum of squares (GSS) of a set of N individuals in the factorial space of K standardised (mean = O; variance = 1) and independent (correlation coefficient = O) variables is equal to the product N.K (Lebart et al., 1977). The contribution Pi of the individual i to the GSS is equal to the sum of the squares of its I< new scores: The relative contribution CRi of the individual i to the GSS of the set is given by: Preserving the greatest variability is equivalent to maximising the score of the subset of sampled individuals using a GSS estimator. The first step consists of keeping the farthest individual of the set centre as initial subset, i.e. the individual with the highest relative contribution. Iterative selection of individuals that maximise subset variability increases subset size and provides a core collection. At each iteration, the cumulative GSS of the subset (expressed in percentage of the total GSS) is calculated. The procedure can thus be stopped according to either the subset size or the GSS expressed in %. The two criteria can be simultaneously taken into account. In this case, the first criterion to be reached defines the stopping point for sampling.

Qualitative PCSS
The method just described was adapted here to qualitative data. Changes concern only the first step of the PCSS. As for quantitative data, betweenvariate relationships can also exist; for example, two molecular markers that are highly linked on a cliromosome. In order to avoid between-variate relationships and to give the same weight to independent markers, a multivariate method was used to transform initial data into factor scores. Factorial analysis of correspondence (Benzécri, 1972) was adopted here.
The method uses the x2 distance instead of the Euclidian distance. A complete disjunctive table has to be used in this case. In this table, the presence and absence of an allele are considered as two different variates taking O or 1 as values. With p molecular markers observed on N individuals, we obtain a 2px N table. Consequently, all individuals show the same margin frequencies equal to p. In addition, the term p& (Xi is the eigenvalue of factor i) is equal to the sum of the correlation ratios of the factor with the p variates (Saporta, 1990). This term is equivalent to the eigenvalue observed in principal component analysis. The sum of pXl is equal to the number of markers (for the principal component analysis on quantitative data, the sum of eigenvalues is equal to the number of variates). As for quantitative data, factor scores are weighted. For qualitative data, weights are the square root of the respective PAI. Other steps are the same.
A software was designed using Visual Basic (Microsoft copyright). Data were recorded as an Excel sheet (Microsoft copyright), and both algorithms were made available in the 'tools' option of the main menu of Excel.

Evolution of the CRC
For the four species, the initial data were first used to perform multivariate analysis, respectively principal compoeent analysis on the quantitative data and factorial analysis on the qualitative data. Then, for each accession, the multivariate scores were used to perform quantitative PCSS (Quant PCSS) or qualitative PCSS (Qual PCSS). For each species, the CRC during the selection process was recorded as a function of the relative size of the initial collection.
For each crop, we arbitrarily selected, using Quant PCSS and Qual PCSS, two CCs which were selected at the CRC level of 50 %: the quantitative CC (Core Quant) and qualitative data (Core Qual). With this constraint, the selected samples differed in size. They were coffee (21/15), rubber tree (30/29), rice (68/29) and sorghum (40/70), where the first number is the size of the Core Quant and the Core Qual.

Allelic retention and genetic diversity
The allelic frequencies in the BC were calculated and then five categories were defined as follows: (f < 5 %, 5 % < f < 10 %, 10 < f < 20 %, 20 % < f < 40 %, f > 40 %). With the constraint of CRC = 50 %, the selected samples differed in size, so we defined the allelic retention index as the number of alleles found in at least one individual of the core subset for a given category.
The various subsets were also compared to the initial sample on the basis of the global diversity (Nei's diversity index) Nei (1978).

Plant data used
For the four species, a subsample of the collection was extracted to best represent the genetic diversity and was considered in the study as the BC (base collection).

Rice
The selected BC consists of 270 accessions &om the world collection maintained at the IRFU. Characters used to define the BC included geographic origin, the culture type and the position in the isozyme classification. Two hundred sixty-five accessions in the BC were characterised with isozymes at 15 loci, as described by Glaszmann andcolleagues (1987, 1988). In all, 49 alleles were observed. Two hundred fifty-six accessions were described for 11 morphological traits: seedling height (SDHT), leaf length ( . , . .. .

Sorghum
The selected BC consists of 347 accessions from the CIRAD collection. Ten enzymatic systems corresponding to 14 polymorphic loci were revealed for 347 accessions by Ollitrault et al. (1989)

Coffee
The BC consists of 73 wild and 62 cultivated Coffea canephora accessions which are maintained in a field collection kept in the Ivory Coast (IDEFOR-DCC, Divo). Genomic DNA was isolated from lyophilised leaves through a nuclei isolation step. Restriction enzyme digestion, gel electrophoresis, alkaline transfer, nonradioactive digoxigenin-labelling of DNA probes and southern hybridisation were carried out as previously reported by Lashermes et al. (1995). Twelve single-copy nuclear genomic clones from a C. arabica (cultivar N39) Pucl8-Pstl library were used as probes following digestion by either EcoRI or Dra I. Ninety of the 135 accessions were evaluated for ten quantitative traits: one corresponds to the annual yield (YIEL), six variables are foliar morphology traits (leaf length LEAL, leaf width LEAW, leaf shape LEAS, leaf area LEAA, acumen length ACUM and petiole length PETL), fertility (two variables, Caracoli bean rate CARA, outturn OUTT) and bean technological characteristic, the 100-seed weight SEWE.  the BC. Altogether, these results show that in the initial phase the PCSS gave a rapid gain, but differences occurred among the crops.
The selection patterns using qualitative data (figure Ib) were similar for three of four crops. For coffee, rice and sorghum, 50 % CRC was selected with 10 % of the BC. Sixty percent of the CRC was selected with 20 % of the BC. The rubber tree curve was slightly different: only 30 % CRC was selected with 10 % BC and 45 % CRC with 20 % of the BC. These results show that, whatever the crop, as for the quantitative data, the PCSS gave a rapid gain in variability but the rate depends on the crop. i l I I

Allelic retention in the core subsets
For each crop, using Quant PCSS and Qual PCSS, two CC were selected at the CRC = 50 %: the Core Quant and Core Qual. With this constraint, the selected samples differed in size, and were coffee (21/15), rubber tree (30/29), rice (68/29) and sorghum (40/70), where the first number is the size of the Core Quant and the other the Core Qual.
Using the quantitative selection (table Ia), the number of alleles lost were 11 (33 %) for coffee, six (12 %) for rice, three (6 %) for rubber and four (11 %) for sorghum. All the alleles that were lost in the core set had an initial frequency lower than 0.05. Using the qualitative selection (table fi), the number of alleles lost were two (6 %) for coffee, one (2 %) for rice, one (2 %) for rubber and one (2 %) for sorghum. Again, all of them had an initial frequency lower than 0.05.
The global diversity indices calculated using the allele frequencies at all loci gave another perspective. The allele frequency change (data not shown) was significant in several instances after selection based on quantitative data. For rice and sorghum, the average diversity was not affected (table Ia) because changes were in both directions; for rubber tree, the average diversity notably increased because most changes were in the same direction. After selection based on qualitative data, the allele frequency changes were more markedly affected. Many loci displayed significant changes, almost all towards an increased diversity; the frequency of the rare alleles increased. For all crops the average diversity index was thus higher than before the selection. Rubber tree had a distinct response as compared to other crops: the selection based on quantitative traits resulted in a limited loss of rare alleles and a global increase in molecular diversity.

Variance and mean differences between core subsets and the initial collection
For the four crops the variance homogeneity was tested (Levene test) between the BC. Core Qual and Core Quant. Means were compared (Bartlett test) when homogeneity of variances allowed the comparison.
For sorghum, the Core Qual compared to the BC showed only one slightly heterogeneous variance and one different mean (table IIa). In contrast, the comparison of the Core Quant and the BC variances gave seven heterogeneous variances, and three homogeneous with no mean differences. The three other crops, rubber tree (table Ilb), rice (table UC) and coffee (table .Ud), revealed similar situations. In most cases, as expected, variances were homogeneous when the BC was compared to the Core Qual and the variances were mostly heterogeneous when the Core Quant was compared to the BC. Nevertheless, this was not systematic and reciprocal situations were found.  Table Ia. Selection of core subset accessions for four different crops using the PCSS strategy on quantitative data. The selected number of alleles are reported according to their frequency in the initial collection and to their presence in the subset.  Table IIa. Comparison, for the sorghum case, of ten trait distribution variables (Vardiff), means (MeanDiff) between the BC and the qualitative subset means (Core Qual) and between the BC and the quantitative subset means (Core Quant).        -. ... *. i. ... ..... . . . the variances of the characters were not greatly modified (5 of 40) and seven Frankel and Brown (1984) defined the core collection as a collection which is expected to reduce repetitiveness. Consequently, the core should not be a photocopy, in reduction, of the global collection, but a new organisation with a maximum variability in a minimum size. The diagram distributions for the variation were represented and the top of the bell curves were eliminated.
The impact of the PCSS on the diversity for molecular markers was uneven among the crops. For three of them, some rare alleles were lost, in a proportion I means were slightly different. I quantitative traits were illustrative of this objective. The full amplitudes of the I j i Quantitative and qualitative principal component score strategies S253

Distribution differences in core subsets
For four variables, arbitrarily selected in the coffee example, the distributions of the entire collection, and both the Core Qual and Core Quant, are shown in figure 2a-d.
These distributions clearly show that, whatever the shape of the curve in the BC, the final profile is more or less regular along the z axes. Most of the redundancy is eliminated. In table IId, for CARA, there was no significant difference between the Core Qual, the Core Quant and the BC, and in figure 2b it is clear that Core Qual and Core Quant have similar distributions. Conversely, for LOFLA and SEWE for which the variances were different, it seems that these distributions are also different. . . . . Quantitative and qualitative principal component score strategies S255

DISCUSSION
comparable to that expected through random selection; the global diversity, however, did not seem to be affected. For the fourth crop, rubber tree, most alleles were retained and the global diversity increased, suggesting that the two types of diversity were related. The PCSS selection on the basis of markers, in turn, had an obvious impact on the marker diversity, essentially by overestimating the rare alleles. The impact on quantitative traits was limited to slight deviations for a few characters.
The association between the two types of characters is responsible for crossimpacts of one PCSS basis on the other type of character. Two contrasting cases are worth examining in more detail.
In terms of the mating system, there is rice with autogamy and a marked structure of varietal groups. The PCSS on quantitative traits resulted in an overrepresentation of the japonica group, which hosts a high morphological diversity between tropical forms (so-called Javanica) and temperate forms. This in turn modified the marker frequencies in favour of the alleles predominant in this group, sometimes causing an increase in the diversity index and sometimes a decrease. The PCSS on qualitative traits resulted in an overrepresentation of minor varietal groups that have several rare alleles. Thus, both strategies led to a divergence between the core subsets.
The other extreme is rubber tree, with allogamy and continuous geographic distribution in the Amazon basin. In this case, both PCSS converged t o an increase in marker diversity.
The above contrasts illustrate the potential differential impacts of one method or another, depending on the biology and the evolutionary history of the species.
The example of rice can be used to examine the status of the dilemma.
A selection on the single basis of morphological traits will poorly sample types corresponding to the minor marker-based groups that were shown to hold alternative sources of factors for resistance to the major disease off rice (Glaszmann et al., 1996). Conversely, a selection on the single basis of molecular traits will leave little room for covering the wide ecogeographical adaptation of a group such as japonica, which is much more associated with morphological diversity than with markers; only the use of a large number of markers would reveal the differentiation within japonica.
On the other hand, the example of rubber tree illustrates the efficiency of the PCSS on quantitative data in enriching the diversity for both types of traits. When the variation is continuous, a selection on the basis of traits related to the use of the crop, which are of primary interest for the breeders, will be of little detriment to genetic diversity as a whole.
It has already been said that the PCSS should be applied after distinguishing clusters of materials that are separated by restrictions to recombination (Noirot et al., 1996). Molecular markers are best suited to reveal such restrictions, be they due t o partial reproductive barriers or to factors such as geographic or seasonal isolation. As an example, an extensive AFLP analysis of the wild bean gene pool focused on insights into the genetic structure of the bean CC that were not possible by another approach (Thome et al., 1996). The difficulty lies in the necessity to split a collection into clusters when the variation is generally continuous between the clusters, when clusters are bridged by local interfaces exhibiting gene flow and introgression and when the determination of the thresholds are essentially arbitrary.
Performing an appropriate classification requires a considerable amount of information that is seldom available. Therefore, the challenge for core-sampling strategies is to malre the best use of the data available and to combine information of various kinds in a refined manner. This is obviously an avenue for future research, which will adapt the strategies to a wide array of biological situations.