Sequence heterogeneity and phylogenetic relationships between the copia retrotransposon in Drosophila species of the repleta and melanogaster groups

Although the retrotransposon copia has been studied in the melanogaster group of Drosophila species, very little is known about copia dynamism and evolution in other groups. We analyzed the occurrence and heterogeneity of the copia 5'LTR-ULR partial sequence and their phylogenetic relationships in 24 species of the repleta group of Drosophila. PCR showed that copia occurs in 18 out of the 24 species evaluated. Sequencing was possible in only eight species. The sequences showed a low nucleotide diversity, which suggests selective constraints maintaining this regulatory region over evolutionary time. On the contrary, the low nucleotide divergence and the phylogenetic relationships between the D. willistoni/Zaprionus tuberculatus/melanogaster species subgroup suggest horizontal transfer. Sixteen transcription factor binding sites were identified in the LTR-ULR repleta and melanogaster consensus sequences. However, these motifs are not homologous, neither according to their position in the LTR-ULR sequences, nor according to their sequences. Taken together, the low motif homologies, the phylogenetic relationship and the great nucleotide divergence between the melanogaster and repleta copia sequences reinforce the hypothesis that there are two copia families.


INTRODUCTION
Ty1-copia is present as a highly heterogeneous group of retrotransposons within all higher eukaryotes [10,21]. The Drosophila retrotransposon copia, which is structurally similar to retroviral proviruses, is 5.4 kb in length and flanked by 276-bp direct long terminal repeats (LTR). According to the review of Biémont and Cizeron [2], copia sequences were identified by PCR and Southern blot analysis in 52 different species of the genus Drosophila. Twentytwo species out of this total belong to the melanogaster group, seven to the willistoni group, seven to the obscura group, six to the saltans group, two to the immigrans group, one to the mesophragmatica group, and one to the pinicola group. Although the retrotransposon copia is harbored by the genome of these 52 species, nucleotide sequences have been described only for eight species of the melanogaster group, two species of the repleta group, D. willistoni and Zaprionus tuberculatus. Drosophila copia phylogeny studies are difficult to carry out, not only because of the low number of copia sequences in the group, but also because these sequences are partial, most of them concerning 5' long terminal repeats (LTR) and untranslated leader regions (ULR).
The 5' LTR-ULR contains sequences responsible for controlling copia transcription, which is a rate-limiting step in the retrotransposition process [3]. The 5' LTR contains promoter sequences and the transcription start site [19,20]. The ULR contains several repeated sequence motifs which function as enhancers [9,26,34,35,45,49]. These repeat motifs are binding sites for host regulatory proteins, and the strength of an enhancer is often positively correlated to the number of repeat motifs it contains [42]. Because of the functional importance of these regulatory sequences, the noncoding LTR-ULR sequences have been commonly used in phylogenetic studies [14,25] and in retrotransposon regulation studies [9,10,26,34,35,49].
The retrotransposon copia has been intensively studied in the melanogaster group of Drosophila species and proven to be a good model system for studying regulatory interactions between retrotransposons and their host genomes [8,10,20,26,34]. However, very little is known about copia dynamism and evolution in other species of the genus Drosophila. In order to broaden our knowledge about the evolutionary history and dynamism of this element, we analyzed the occurrence and heterogeneity of the copia 5'LTR-ULR partial sequence and their phylogenetic relationships in 24 species of three subgroups of the Drosophila repleta group and their relationships with all the corresponding copia sequences of Drosophilidae found in GenBank.

Fly stocks
All species and strains (isofemales) used in this study are listed in Table I. Each list includes the taxonomic nomenclature [18], location, and either stock number or collection date.

DNA amplification and sequencing
The regions of copia focused on in this study were the 5' LTR and the untranslated leader region (ULR). PCR reactions were performed in final volumes of 25 µL, using approximately 200 ng of template DNA, 100 µM of each primer, 200 µM of dNTP, 1.5 mM of MgCl 2 , 1.25 µL of DMSO and 1 unit of TaqBead Hot Start Polymerase (Promega) in 1× polymerase buffer. After an initial denaturation step of 5 min at 94 • C, 35 cycles consisting of 1 min at 94 • C, 1 min at 55 • C, and 1 min at 72 • C were carried out, followed by a final extension step of 15 min at 72 • C. The primers used were the following: CoBuz1 (5'-CCCNTATTCCTCCTTCAAAAA-3') and CoBuz2 (5'-CCGCGAAATTAAGAAACGAG-3'), which anneal into the LTR-URL copia region and amplify a 615 bp long fragment (nucleotides 10 to 625). These primers were designed based on the D. buzzatii (X96972) and D. koepferae (X96971) copia sequences obtained from GenBank, which contain a polymorphism between both species in the fourth position of CoBuz1. It is important to point out that the region amplified by the primers CoBuz1 and CoBuz2 corresponds to the 5' LTR-ULR region studied by Jordan and McDonald [26]. The amplified fragments were separated by electrophoresis in 1% agarose gel. The PCR products were cloned into a TA cloning vector (Invitrogen). Both strands of three clones chosen randomly were sequenced for each species, and the consensus sequence was used for the phylogenetic analysis. The copia sequences obtained in the present study were deposited in NCBI GenBank (accession numbers from AY655745 to AY655750 and DQ494345 and DQ494346).

Evolutionary analysis
The multiple alignments of copia consensus sequences were performed with CLUSTAL W [47]. The evolutionary relationships among copia sequences were assessed using the maximum parsimony method (branch and bound algorithm), as implemented in PAUP v.4.0b10 [46]. The distance matrix used was built according to the HKY model [23], which was determined as the best fit for the data by a likelihood ratio test using MODELTEST 2.0 [38].

Neutrality tests
The levels of nucleotide diversity and number of segregating sites were determined for the copia LTR and ULR regions using the DnaSP program [40], in order to evaluate whether the sequences evolve randomly or have been subjected to functional constraints.

Identification of transcription factor binding sites
The copia 5' LTR-ULR sequence in the melanogaster group has been shown to contain some motifs, which are binding sites to trans regulatory proteins and regulate their transcriptional activity [9,10,34,35,49]. Variants of 5' LTR-ULR with different numbers of motifs differ in their abilities to drive expression [34]. In order to identify LTR-ULR binding sites of transcriptional factors, the repleta and melanogaster consensus sequences were submitted to the Alibaba2 software [22] (http://www.gene-regulation.com/pub/programs.html) that is currently considered as the most effective tool for predicting transcription factor binding sites in an unknown DNA sequence.

Evolutionary analysis
The aim of the sequencing analysis was to investigate the natural copia LTR-URL nucleotide variation and to propose a phylogenetic relationship between these sequences and those described in the literature. The most unrooted parsimonious tree is shown in Figure 2. From a total of 742 characters, 484 were phylogenetically informative. The consistency index was 0.7800 and the retention index was 0.9156. Branch support was calculated by bootstrap analysis consisting of 1000 replicates. Two well-defined groups of copia sequences can be seen in the tree, one containing the species of the repleta group  from the other sequences of the melanogaster species group. Also, the copia sequences of D. mojavensis (cluster mojavensis) and D. pachuca (cluster longicornis) constitute a monophyletic group with species of the buzzatii cluster, the first with D. buzzatii and the second with D. koepferae and D. antonietae.
The distance matrix is shown in Table II. The smallest divergence within the repleta group was 0.01 (D. koepferae and D. pachuca) and the greatest was 0.18 (between D. koepferae and D. mojavensis). In the species of the melanogaster group, the values varied from 0.00 (between D. teissieri and D. melanogaster) to 0.06 (between D. sechellia and D. melanogaster; D. teissieri and D. sechellia). In spite of the low divergence rates within each species group, the intergroup rates between the melanogaster and repleta groups were very high. The smallest divergence was 0.61 (between D. yakuba and D. seriema; D. yakuba and D. gouveai) and the greatest was 0.82 (between D. koepferae and D. sechellia). On the contrary to these high rates, the nucleotide divergence between the melanogaster subgroup and D. willistoni and Z. tuberculatus was very small: 0.00 between D. willistoni and D. teissieri, and 0.02 between Z. tuberculatus and D. yakuba. A low rate was also observed between Z. tuberculatus and D. willistoni (0.03). These results, in agreement with the phylogenetic analysis, indicate the occurrence of two significantly divergent copia groups of sequences, one carried by genomes of the species of the melanogaster group / D. willistoni / Zaprionus tuberculatus and the other by the repleta species group.

Neutrality tests
The nucleotide diversity and the number of segregating sites were calculated for the six species of the buzzatii cluster, in order to determine whether Table II. Genetic distances between copia nucleotide sequences calculated using the HKY method (Hasegawa et al., 1985).
GB: GenBank.  selective constraints have been imposed upon the copia LTR and ULR sequences. Our data showed that the polymorphism is 0.06718 within the LTR, with 45 segregating sites, and 0.01979 within the ULR, with 18 segregating sites. Hence, the ULR conservation is 3.4 times bigger than that of the LTR, which suggests that some degree of selective constraint has been imposed upon the ULR compared to the LTR.

Identification of transcription factor binding sites
The repeated motifs within enhancers are usually binding sites for host factors, which regulate element expression. We identified the motifs of the copia ULR sequences present in the melanogaster and the repleta group using the Alibaba 2 program [22]. Table III shows the identified motifs, the sequence of each motif and the number of repetitions of each one. The most frequently found motif was the CCAAT/enhancer binding protein (C/EBP is involved in the control of head segmentation in Drosophila). This motif presents 13 repetitions in the repleta species group and 11 in the melanogaster species group. It has been shown that the number of C/EBP repetitions is responsible for different levels of copia expression in D. melanogaster [27]. Motifs such as HB (hunchback factor), TBP (TATA-binding protein), Oct-1 (octamer-binding factor), FTZ (fushi tarazu) and Zen (Zerknuellt 1) were found in both species groups. On the contrary, some motifs had their occurrence limited to just one group of species. The motifs E47 (hairy); CFF (complex forming factor) and Embry (embryo DNA binding protein) were present only in the melanogaster species group, while Kr (Krueppel), Odd (Odd-skipped), NF-kappa; Ant (antennapedia) and Oct-2.1 (octamer-binding factor) were present only in the repleta species group.

DISCUSSION
Transposable elements can be classified, according to their structure, into classes, subclasses, families, and subfamilies. Jordan and McDonald [25,26] suggested that there were two copia families within the genus Drosophila: the melanogaster family with three subfamilies, and the repleta family (based on the analysis of only two species of the repleta group). Looking for other families and subfamilies within the repleta group, we increased the number of analyzed species to 24 (one species belonging to the hydei subgroup, two to the mercatorum subgroup, and 21 to the mulleri subgroup), but the copia sequences could only be obtained from eight species. The greatest HKY distance Table III. Transcription-factor binding sites in the repleta and melanogaster copia consensus-sequence according to Alibaba2 (Grabe, 2002). Motifs: the factor name and the consensus sequence identified in TRANSFAC database; repleta group: the sequences identified as motifs in the copia repleta sequence; nt position: the position of each motif in the copia sequence; melanogaster group: the sequences identified as motifs in the copia melanogaster sequence. value among the copia sequences in the repleta group was 0.18 (D. mojavensis and D. koepferae), which reinforces the idea that this group of species harbors a single copia subfamily. But the great divergence of these sequences compared to those of the melanogaster group (0.82) between D. koepferae and D. sechellia) confirms the proposition of Jordan and McDonald [25,26] that there are at least two copia families in the genus Drosophila. This hypothesis is reinforced by the evolutionary relationships presented here. The unrooted tree obtained by the parsimony method shows two main groups: one with copia sequences of the repleta species and another with copia sequences of the melanogaster species, Z. tuberculatus and D. willistoni.

Motifs
The wide distribution, the heterogeneous occurrence of copia in the Drosophila and Sophophora subgenera suggest that copia might have been present in the common ancestor of the genus Drosophila and have been vertically transmitted over evolutionary time. Hence, the divergence rates between the species groups should be, as observed, so great that they cannot be recognized anymore, or the sequences may have been lost by stochastic events. This might be the case of the repleta species, in which only faint bands (subgroup mercatorum: D. hydei, D. mercatorum; subgroup mulleri, mojavensis cluster: D. navojoa, D. mojavensis) or no amplification at all (subgroup mulleri, mojavensis cluster: D. arizonae, longicornis cluster: D. longicornis, and all four studied species of the eremophila cluster) were observed. Moreover, very low distance values were also found between species belonging to a different genus (species of the melanogaster and willistoni groups and the Zaprionus genus) and between different groups of species within the Sophophora subgenus (melanogaster and willistoni groups). Taking together the nucleotide divergence and the parsimony analysis, the results may be indicative of horizontal transfers between the D. willistoni / Zaprionus tuberculatus / melanogaster species subgroup. However, it is not possible to infer the direction of the postulated events. Horizontal copia transfers have been previously proposed between species of the melanogaster subgroup [25], between D. melanogaster and D. willistoni [27], and between D. melanogaster and D. simulans [41]. When we included in our analysis the copia reported by Jordan and McDonald [25] plus sequences of eight species from the repleta group, we observed the inconsistencies reported by Jordan and McDonald [25] and Jordan et al. [27]. Additional phylogenetic incongruences can be observed between Z. tuberculatus and species of the melanogaster subgroup (D. melanogaster / D. yakuba), and between D. mojavensis, D. pachuca and species of the buzzatii cluster. Since geographic and temporal overlap between donor and recipient species are the minimum requirement to infer horizontal transfer, only the event between Z. indianus and D. yakuba or D. melanogaster might suggest such a transfer because the three species share their range distribution in Africa [28,48]. Horizontal transfer between transposable elements of Drosophila has been shown to be a very frequent event. In addition to the classical reports [4, 5, 11-13, 16, 17, 44], other examples have been published more recently [1,7,24,30,32,39]. It has been postulated that cross-species transfers may be an effective strategy by which TEs avoid inactivation over evolutionary time [33,37,43]. Since copia is known to be subject to effective host-mediated repression, selective pressure might favor its horizontal transfer over evolutionary time [34,41]. Nevertheless, it is important to point out that a potential area of weakness for this kind of research is the presence of several copies of the transposable element in the same species (ancestral polymorphism). Comparisons of paralogous copies of elements and varying rates of the sequence evolution of TE copies within and between species are factors which can yield incongruent phylogenies even under conditions of strict vertical transmission, as stressed by Zupunski et al. [50]. Another possibility, sequence similarity between distantly related species due to conservation of small motifs [6], could explain the similarity between copia sequences of species that do not share the same environments, since the LTR-ULR copia regions analyzed are regulatory protein-binding domains which control copia expression and spreading in natural populations [26].
Positive diversifying selection acting between copia families (melanogaster and repleta) and negative purifying selection acting within these families were reported by Jordan and McDonald [26] when studying the copia LTR-URL natural variation among seven species of the melanogaster group and two species of the repleta group. According to these authors, the ULR nucleotide diversity (π) in the repleta group of species was 2.2 times greater than that of the LTR. By increasing the number of repleta species analyzed, we showed that this ratio is even higher (3.4). This result reinforces the hypothesis of functional constraint of copia ULR regulatory regions within the repleta family. The π value in the repleta LTR and ULR was 3.0 and 2.8 times higher, respectively, than in the copia melanogaster family; these ratios are lower than those reported by Jordan and McDonald [26]. Despite the differences between the two studies, which are probably due to the smaller number of repleta species studied by those authors, the negative purifying selection explains the very low distance values within each species group. The fact that the nucleotide diversity in the repleta LTR and ULR is greater than in those of the melanogaster group and that their distance values are higher reinforces the idea that the copia of the repleta group is a more ancestral family than that of the melanogaster group.
Another approach used by us to compare the copia sequences harbored by the melanogaster species on the one hand and by the repleta species on the other was to analyze the LTR-ULR transcription factor binding sites. DNA-binding transcription factors play a central role in transcription regulation [29]. In this work, we found a similar repetition number of the C/EBP, TBP, FTZ and Zen motifs in the repleta and melanogaster copia families. However, these motifs are not homologous, neither according to their position in the LTR-ULR sequences nor according to their sequences. Moreover, it is known that regulatory regions can maintain their functions in spite of structural reorganization, as a result of species-specific losses and gains of transcription factor binding sites [15,31,36]. Of the 16 motifs found in this study, E47, CFF and Embry were absent in the repleta species, and Oct-2.1, Kr, Odd, Nf-kappa and Ant were absent in the melanogaster species (Tab. III). On the contrary, the occurrence of several motifs in common in the repleta and melanogaster LTR-ULR copia sequences could be an indication that both copia families regulate the TE activity in the same way. Because LTR retroelements may be continually generating variation within their noncoding regions, continuous opportunities might exist for natural selection to favor the evolution of adaptive enhancer configurations. Hence, diversifying selection could explain so highly divergent sequences between the repleta and melanogaster species groups and so greatly conserved sequences within these groups. Taking together the low homology of the motif sequences, the phylogenetic relationships and the high level of nucleotide divergence between the melanogaster and repleta copia sequences, the occurrence of at least two retrotransposon copia families in Drosophila seems to be a robust hypothesis. However, additional analysis including species from other groups of the Sophophora and the Drosophila subgenera might fill the gap and clarify whether the discontinuity of copia sequences between the repleta and melanogaster groups is real or not.